Diffusion language models are gaining traction. Models like LLaDA, Dream 7B, Mercury, and SDAR match autoregressive models on standard benchmarks while offering 2-4x faster inference through parallel token generation. For a deeper dive into how diffusion LMs work, see our previous post.
SDAR (Synergy of Diffusion and AutoRegression) is a block diffusion model based on the Qwen3 architecture. Instead of generating one token at a time, it generates tokens in blocks and refines them through iterative denoising.
Models like Qwen3-Thinking and DeepSeek-R1 show what’s possible when we let the model “reason” before answering, but this is almost entirely done with AR architectures. Diffusion models are underexplored in terms of reasoning.
So we asked: can we transfer reasoning capabilities from AR models to diffusion models?
SDAR shares its architecture and initial weights with Qwen3-base. Model merging techniques (linear interpolation, SLERP, task vectors) have shown success within AR model families.
A recent paper shaped our approach. Nepal et al. (2025) showed that mathematical reasoning depends on a small number of specialized layers. Remove these critical layers, and math accuracy drops by 80%, while factual recall barely changes.
If we could identify shared critical layers between AR and diffusion models, maybe we could target our transfer there.
Zero Ablation: Finding Critical Layers
We ran zero ablation: zero out each layer’s weights, measure GSM8K accuracy.
Model
Critical Layers (lowest accuracy when zeroed)
SDAR
1 (6.25%), 6 (6.25%)
Qwen-Thinking
6 (6.25%), 23 (0%), 26 (0%)
Layer 6 stood out. Critical for both models. A shared bottleneck where both architectures route math reasoning. This seemed like a good target for merging.
CKA Analysis: Where Models Diverge
We validated with CKA (Centered Kernel Alignment) between AR and diffusion activations:
Layer Range
CKA Score
Interpretation
0-5
0.73-0.99
Nearly identical
6
0.07
Divergence point
7-15
0.10-0.21
Low similarity
16-33
0.22-0.30
Moderate
34-35
0.64-0.75
Converging
Layer 6 shows a dramatic drop from 0.89 to 0.07. Both analyses pointed to Layer 6 as the divergence point. The hypothesis: merge Qwen’s Layer 6 into SDAR to transfer reasoning.
Part 2: What We Tried
Approach 1: Layer-6 Linear Merging
We first finetuned SDAR with LoRA for 300 steps on reasoning data to teach it to produce <think> tokens (required for extended reasoning). Then we targeted Layer 6:
θmerged=(1−α)⋅θSDAR−FT+α⋅θQwen
Configuration
Layer 6 Ratio (SDAR/Qwen)
l6_merge_50
50⁄50
l6_merge_70
70⁄30
l6_merge_90
90⁄10
l6_swap_100
0⁄100 (full replacement)
Approach 2: Full-Model SLERP
Standard SLERP merging across all layers at various ratios (90/10, 70⁄30, 50⁄50).
Approach 3: Task Vectors
Extract the “reasoning delta” from AR models and apply it to diffusion:
τAR=Qwen3Thinking−Qwen3Base
SDARnew=SDAR+λ×τAR
We tried multiple configurations: basic task vectors, norm-preserving variants, MLP-only, various λ values (0.01 to 0.5).
Approach 4: Sophisticated Merging
TIES-Merging, DARE-TIES, DELLA. Techniques designed to handle conflicting weight updates.
Approach 5: Activation Surgery
Train bottleneck modules to transform AR activations into diffusion-compatible representations.
Part 3: Results
GSM8K Results
Model
GSM8K (n=1319)
Notes
Qwen3-4B-Thinking
~95%
AR reasoning model
SDAR-4B baseline
86-89%
Diffusion baseline
SDAR-4B-FT (Fresh300)
~88%
LoRA finetuned on reasoning
Full SLERP 90⁄10
87%
= baseline
L6 merge 70⁄30
79%
< baseline
L6 merge 90⁄10
61-80%
< baseline
L6 swap 100
58-80%
< baseline
Task vectors (all configs)
0%
Model collapsed
TIES, DARE, DELLA
0%
Model collapsed
The L6 merges performed worse than baseline SDAR. Early small-sample tests (n=16) showed promising results (up to 100%), but full evaluation revealed this was sample variance. The LoRA finetuned model (Fresh300) maintains baseline performance, showing native finetuning works.
AIME24 Results (Harder Benchmark)
Model
AIME24 Pass@8
Tokens
Notes
SDAR-4B baseline
20%
32K
Diffusion baseline
SDAR-4B-FT (Fresh300)
20%
32K
LoRA finetuned
L6 swap 100
20%
8K
= baseline
L6 merge 70⁄30
23%
8K
26⁄30 problems (timeout)
L6 merge 90⁄10
17%
8K
< baseline
L6 merge 50⁄50
13%
8K
< baseline
Full SLERP 90⁄10
23%
2K
Short context
On the harder AIME24 benchmark, no merge configuration beat baseline SDAR. The L6 merge 70⁄30 achieved 23%, but this is within noise of the 20% baseline and required 8K tokens of reasoning. All other configurations performed at or below baseline.
Task Vectors Don’t Transfer
Task vectors didn’t just fail. They destroyed the model:
Method
Sample Output
DARE-TIES
<|endoftext|>
(immediate termination)
DELLA
"eagerly eagerly eagerly murdered murdered..."
TIES
"" (empty string)
Compare to baseline SDAR:
To solve this, I need to find how many apples Janet has in total.
Janet starts with 10 apples and buys 5 more.
10 + 5 = 15
The answer is \\boxed{15}
Part 4: The Geometry Behind This
Why We Analyzed the Weight Space
The results above tell us what doesn’t work, but not why. To understand that, we looked at the geometry of how these models learn.
We computed deltas between four models:
Delta
Formula
Description
D1
Thinking_AR—Base_AR
AR finetuning direction
D2
FT_Diff—Base_Diff
Diffusion finetuning direction
D3
Base_AR—Base_Diff
Mode difference
The question: do AR and diffusion learn reasoning in similar directions?
They Don’t
cosine(D1,D2)=0.001
AR and diffusion finetuning directions are orthogonal. Geometrically perpendicular in weight space.
Think of it this way: if you want to go North (improve diffusion reasoning), but you push East (apply AR task vector), you make no progress.
Different Layers, Different Learning
AR and diffusion don’t just learn in different directions. They learn in different places:
AR Finetuning (D1): Middle layers 14-23, ~22% relative change Diffusion Finetuning (D2): Edge layers 1-10 and 32-33, ~3% relative change
Almost no overlap. And AR makes changes 7.3x larger than diffusion (mean norm 10.13 vs 1.39).
Why Linear Merging Doesn’t Help
Linear merging creates a weighted average of two sets of weights. It doesn’t transfer capabilities; it dilutes both models.
The Layer 6 results show this clearly: replacing SDAR’s Layer 6 with Qwen’s doesn’t improve reasoning. Layer 6 is critical for both models (zeroing it hurts), but they use it differently. Swapping doesn’t transfer the capability. It just substitutes one implementation for another incompatible one.
Activation Surgery: A Different Direction
Weight-space merging fails because AR and diffusion learn in orthogonal subspaces. Activation-space techniques sidestep this entirely by operating on representations rather than parameters.
We tried one approach: train bottleneck modules to make AR activations statistically similar to diffusion activations (measured by CKA). CKA improved by +0.11 on average. Task accuracy dropped to 0%.
This remains an open direction. Unlike weight-space merging, activation surgery doesn’t face the geometric orthogonality barrier.
Part 5: What We Learned
The Core Discovery
AR and diffusion models learn reasoning in orthogonal weight subspaces.
This isn’t a hyperparameter problem as no alignment method can overcome geometric orthogonality unless we start to introduce some rotation vectors.
Three Insights
You can’t just copy weights between AR and diffusion models. The orthogonality means you need something beyond weight manipulation, maybe architecture-level bridges or modules designed to be paradigm-agnostic.
Similar-looking activations don’t mean similar capabilities. Our CKA surgery improved statistical similarity while destroying task performance. You can match the shape of representations without preserving what they encode.
Same weights, same architecture, different computation. SDAR and Qwen start from identical base model but end up routing information through different layers. How you generate (AR vs diffusion) shapes what the model learns.
What Actually Works
If you want better reasoning in diffusion models:
LoRA finetuning on reasoning data. Native subspace learning respects how diffusion models represent information. Full finetuning causes catastrophic forgetting.
Train from scratch with reasoning data. The orthogonality suggests reasoning must be learned within the diffusion paradigm, not transferred from AR.
Wait for scale. AR reasoning improved dramatically with scale and data. Diffusion models haven’t received the same investment yet.
Summary
Approach
Result
Why
Task vectors
0%
Orthogonal subspaces (cos=0.001)
TIES, DARE, DELLA
0%
Same geometric problem
Layer-6 merging
≤ baseline
Creates broken hybrid
Full SLERP
≤ baseline
Dilutes both models
Activation surgery
0% (CKA objective)
Wrong objective, not wrong paradigm
LoRA finetuning
Works
Native subspace learning
Weight-based merging doesn’t transfer AR reasoning into diffusion models. The same capability has fundamentally different implementations depending on the generation mechanism. But now we know why, and we know that native finetuning still works.
Open Questions
Can subspace alignment methods help? Techniques like Git Re-Basin align weight spaces through permutation. Could they rotate AR’s reasoning subspace into diffusion’s?
Is the orthogonality fundamental or incidental? Would diffusion models trained differently (different data, different objectives) show more alignment with AR?
Architecture design for transferability: Could models be designed with paradigm-agnostic reasoning modules that enable genuine capability portability?
What Happens When You Try to Merge AR Reasoning Into Diffusion Models
Diffusion language models are gaining traction. Models like LLaDA, Dream 7B, Mercury, and SDAR match autoregressive models on standard benchmarks while offering 2-4x faster inference through parallel token generation. For a deeper dive into how diffusion LMs work, see our previous post.
SDAR (Synergy of Diffusion and AutoRegression) is a block diffusion model based on the Qwen3 architecture. Instead of generating one token at a time, it generates tokens in blocks and refines them through iterative denoising.
Models like Qwen3-Thinking and DeepSeek-R1 show what’s possible when we let the model “reason” before answering, but this is almost entirely done with AR architectures. Diffusion models are underexplored in terms of reasoning.
So we asked: can we transfer reasoning capabilities from AR models to diffusion models?
Prerequisites:
For task vectors: Ilharco et al., 2023
For diffusion LMs: Our guide
For CKA: Kornblith et al., 2019
Part 1: The Hypothesis
Finding Shared Structure
SDAR shares its architecture and initial weights with Qwen3-base. Model merging techniques (linear interpolation, SLERP, task vectors) have shown success within AR model families.
A recent paper shaped our approach. Nepal et al. (2025) showed that mathematical reasoning depends on a small number of specialized layers. Remove these critical layers, and math accuracy drops by 80%, while factual recall barely changes.
If we could identify shared critical layers between AR and diffusion models, maybe we could target our transfer there.
Zero Ablation: Finding Critical Layers
We ran zero ablation: zero out each layer’s weights, measure GSM8K accuracy.
Layer 6 stood out. Critical for both models. A shared bottleneck where both architectures route math reasoning. This seemed like a good target for merging.
CKA Analysis: Where Models Diverge
We validated with CKA (Centered Kernel Alignment) between AR and diffusion activations:
Layer 6 shows a dramatic drop from 0.89 to 0.07. Both analyses pointed to Layer 6 as the divergence point. The hypothesis: merge Qwen’s Layer 6 into SDAR to transfer reasoning.
Part 2: What We Tried
Approach 1: Layer-6 Linear Merging
We first finetuned SDAR with LoRA for 300 steps on reasoning data to teach it to produce
<think>tokens (required for extended reasoning). Then we targeted Layer 6:θmerged=(1−α)⋅θSDAR−FT+α⋅θQwen
Approach 2: Full-Model SLERP
Standard SLERP merging across all layers at various ratios (90/10, 70⁄30, 50⁄50).
Approach 3: Task Vectors
Extract the “reasoning delta” from AR models and apply it to diffusion:
τAR=Qwen3Thinking−Qwen3Base
SDARnew=SDAR+λ×τAR
We tried multiple configurations: basic task vectors, norm-preserving variants, MLP-only, various λ values (0.01 to 0.5).
Approach 4: Sophisticated Merging
TIES-Merging, DARE-TIES, DELLA. Techniques designed to handle conflicting weight updates.
Approach 5: Activation Surgery
Train bottleneck modules to transform AR activations into diffusion-compatible representations.
Part 3: Results
GSM8K Results
The L6 merges performed worse than baseline SDAR. Early small-sample tests (n=16) showed promising results (up to 100%), but full evaluation revealed this was sample variance. The LoRA finetuned model (Fresh300) maintains baseline performance, showing native finetuning works.
AIME24 Results (Harder Benchmark)
On the harder AIME24 benchmark, no merge configuration beat baseline SDAR. The L6 merge 70⁄30 achieved 23%, but this is within noise of the 20% baseline and required 8K tokens of reasoning. All other configurations performed at or below baseline.
Task Vectors Don’t Transfer
Task vectors didn’t just fail. They destroyed the model:
<|endoftext|>
(immediate termination)
"eagerly eagerly eagerly murdered murdered..."""(empty string)Compare to baseline SDAR:
Part 4: The Geometry Behind This
Why We Analyzed the Weight Space
The results above tell us what doesn’t work, but not why. To understand that, we looked at the geometry of how these models learn.
We computed deltas between four models:
The question: do AR and diffusion learn reasoning in similar directions?
They Don’t
cosine(D1,D2)=0.001
AR and diffusion finetuning directions are orthogonal. Geometrically perpendicular in weight space.
Think of it this way: if you want to go North (improve diffusion reasoning), but you push East (apply AR task vector), you make no progress.
Different Layers, Different Learning
AR and diffusion don’t just learn in different directions. They learn in different places:
AR Finetuning (D1): Middle layers 14-23, ~22% relative change Diffusion Finetuning (D2): Edge layers 1-10 and 32-33, ~3% relative change
Almost no overlap. And AR makes changes 7.3x larger than diffusion (mean norm 10.13 vs 1.39).
Why Linear Merging Doesn’t Help
Linear merging creates a weighted average of two sets of weights. It doesn’t transfer capabilities; it dilutes both models.
The Layer 6 results show this clearly: replacing SDAR’s Layer 6 with Qwen’s doesn’t improve reasoning. Layer 6 is critical for both models (zeroing it hurts), but they use it differently. Swapping doesn’t transfer the capability. It just substitutes one implementation for another incompatible one.
Activation Surgery: A Different Direction
Weight-space merging fails because AR and diffusion learn in orthogonal subspaces. Activation-space techniques sidestep this entirely by operating on representations rather than parameters.
We tried one approach: train bottleneck modules to make AR activations statistically similar to diffusion activations (measured by CKA). CKA improved by +0.11 on average. Task accuracy dropped to 0%.
This remains an open direction. Unlike weight-space merging, activation surgery doesn’t face the geometric orthogonality barrier.
Part 5: What We Learned
The Core Discovery
AR and diffusion models learn reasoning in orthogonal weight subspaces.
This isn’t a hyperparameter problem as no alignment method can overcome geometric orthogonality unless we start to introduce some rotation vectors.
Three Insights
You can’t just copy weights between AR and diffusion models. The orthogonality means you need something beyond weight manipulation, maybe architecture-level bridges or modules designed to be paradigm-agnostic.
Similar-looking activations don’t mean similar capabilities. Our CKA surgery improved statistical similarity while destroying task performance. You can match the shape of representations without preserving what they encode.
Same weights, same architecture, different computation. SDAR and Qwen start from identical base model but end up routing information through different layers. How you generate (AR vs diffusion) shapes what the model learns.
What Actually Works
If you want better reasoning in diffusion models:
LoRA finetuning on reasoning data. Native subspace learning respects how diffusion models represent information. Full finetuning causes catastrophic forgetting.
Train from scratch with reasoning data. The orthogonality suggests reasoning must be learned within the diffusion paradigm, not transferred from AR.
Wait for scale. AR reasoning improved dramatically with scale and data. Diffusion models haven’t received the same investment yet.
Summary
Weight-based merging doesn’t transfer AR reasoning into diffusion models. The same capability has fundamentally different implementations depending on the generation mechanism. But now we know why, and we know that native finetuning still works.
Open Questions
Can subspace alignment methods help? Techniques like Git Re-Basin align weight spaces through permutation. Could they rotate AR’s reasoning subspace into diffusion’s?
Is the orthogonality fundamental or incidental? Would diffusion models trained differently (different data, different objectives) show more alignment with AR?
Architecture design for transferability: Could models be designed with paradigm-agnostic reasoning modules that enable genuine capability portability?
References
Nepal et al. (2025). Layer Importance for Mathematical Reasoning
Ilharco et al. (2023). Editing Models with Task Arithmetic
Kornblith et al. (2019). Similarity of Neural Network Representations Revisited
Hu et al. (2021). LoRA: Low-Rank Adaptation
Yadav et al. (2023). TIES-Merging
Yu et al. (2023). Language Model Merging via DARE