Sheikh Abdur Raheem Ali comments on leogao’s Shortform

Sheikh Abdur Raheem Ali 20 Apr 2026 9:25 UTC
1 point
0
There are ML papers floating around with training methods and architectural tweaks (e.g Block AttnRes or mHC-lite) which might be incorporated into future models.
It seems plausible to me that replacing standard residual skip connections with something more complicated:
- scales intelligence somewhat but not past the frontier
- makes it slightly harder for existing interp flavored techniques to generate understanding
- doesn’t meaningfully affect the relative performance of linear probes vs output classifiers for inference-time detection of precursors to high-risk misaligned behavior.
I do think there are cases where models will be able to manipulate the data they’re feeding into white-box methods in a way that affects verdicts, but it’s hard to see these arising naturally before being demonstrated in more contrived scenarios, and I agree with evhub that this would be harder than circumventing black box safeguards.