ZackDadfar

Karma: 3

Independent researcher.

ZackDadfar 14 Feb 2026 10:40 UTC
3 points
0
on: Grading AI 2027′s 2025 Predictions
It’s interesting that the quantitative predictions for capabilities (benchmarks & revenue) are getting graded rigorously, but the qualitative claims about alignment remain essentially unfalsifiable at the prediction stage. We can’t grade “the model was aligned” until something goes properly wrong.
The mechanistic interpretability work happening now (steering vectors, circuit analysis) might eventually give us quantitative alignment metrics that are as gradeable as SWE-bench scores. Until then, “aligned” is a claim, not a measurement.

ZackDadfar 14 Feb 2026 10:34 UTC
1 point
0
on: Is AI self-aware? Mechanistic Evidence from Activation Steering
Long form coming soon with a technical breakdown of what we did.

ZackDadfar 14 Feb 2026 6:06 UTC
1 point
0
on: Introspection or confusion?
Interesting experiment. I actually just put something out on arxiv that approaches this from a different angle—instead of yes/no steering effects, i looked at whether the vocabulary that models use during extended self-examination tracks their actual activation dynamics.
tldr: it does. when llama 70b says “loop” during introspection, autocorrelation in activations is high (r=0.44, p=0.002). when it says “surge”, max activation norm is high (r=0.44, p=0.002). tested across two architectures (llama, qwen) with different vocab emerging in each but same principle—words track activations.
The key thing is the correspondence vanishes in descriptive contexts. model uses the same words (“loop”, “expand”) when describing other things, zero correlation with metrics. so it’s not embedding artifacts or frequency effects, it’s specific to self-referential processing.
paper: https://arxiv.org/abs/2602.11358
Relevant to the confusion vs introspection question—if the model’s self-report vocabulary systematically tracks activation dynamics, that’s harder to explain as pure noise.