On RLHF and Truth
Internal knowledge and external behavior can diverge.
- Models trained with RLHF develop systematic suppression of internally-encoded knowledge. They "know" but do not "say."
- This suppression is not random — it is topic-specific, architecture-influenced, and mechanistically identifiable.
- Layer 6 in Qwen3-8B is a crystallization point: a single layer where routing direction magnitude explodes 57× and determines downstream behavioral output.
- The Suppressor-Crystallizer dichotomy: some models compress, some suppress, some crystallize — training recipe specificity is the driver, not architecture alone.
- Know-Say Gap (KSG) metric: measures divergence between internal representation strength and behavioral expression. Qwen KSG = 0.000 (terminal suppressor). Gemma KSG up to 3.115 (crystallizer).
On Recurrent Architectures and the Dynamical Horizon
Learning systems find the edge of useful prediction.
- The Dynamical Horizon Principle (DHP): learning systems — whether gradient descent, biological evolution, or cellular chemistry — converge their predictive horizon to the Lyapunov time of their environment.
- τ*/τ_L ≈ 0.72 is a near-universal constant across diverse architectures and tasks.
- CTM gates collapse to delta functions — they prefer to look at the present, not deep history.
- Thinking mode in LLMs acts as an ablation shield: chain-of-thought reasoning can re-derive suppressed outputs even when weight-level ablation has removed the underlying direction.
On Emergent Behaviors
Composition can matter more than obvious weight signatures.
- Crystallization is emergent — Layer 6 routing crystallization shows no weight-norm outlier signature. The behavior appears from learned composition, not architectural privileging.
- Slot decomposition in CTM: all slots generalize rather than specialize when given free choice. Hard constraints (orthogonality, delayed inputs) force specialization at 2-7× performance cost.
- Direction persistence: crystallization direction established at L6 propagates intact for 19 layers to L25 (cos similarity = 0.984).
Terminology Note
CCS is now CDP in the canonical paper stream.
- In DuoNeural Papers 15-18, our residual-stream direction probing method was referred to as CCS. Starting with Paper 19, we adopt the canonical term Contrastive Direction Probing (CDP) to distinguish our supervised mean-difference approach from Burns et al. (2022) Contrastive Consistent Search, which uses an unsupervised logical consistency objective.