Baram Sosis

Karma: 80

AI Safety Isn’t So Unique

Baram Sosis27 Sep 2025 0:36 UTC

11 points

1 comment9 min readLW link

Baram Sosis 25 Jun 2025 4:53 UTC
15 points
1
on: Foom & Doom 1: “Brain in a box in a basement”
The whole cortex is (more-or-less) a uniform randomly-initialized learning algorithm, and I think it’s basically the secret sauce of human intelligence.
I’m a bit surprised that you view the “secret sauce” as being in the cortical algorithm. My (admittedly quite hazy) view is that the cortex seems to be doing roughly the same “type of thing” as transformers, namely, building a giant predictive/generative world model. Sure, maybe it’s doing so more efficiently—I haven’t looked into all the various comparisons between LLM and human lifetime training data. But I would’ve expected the major qualitative gaps between humans and LLMs to come from the complete lack of anything equivalent to the subcortical areas in LLMs. (But maybe that’s just my bias from having worked on basal ganglia modeling and not the cortex.) In this view, there’s still some secret sauce that current LLMs are missing, but AGI will likely look like some extra stuff stapled to an LLM rather than an entirely new paradigm. So what makes you think that the key difference is actually in the cortical algorithm?

(If one of your many posts on the subject already answers this question, feel free to point me to it)

Baram Sosis 14 Jun 2025 21:42 UTC
11 points
2
on: Futarchy’s fundamental flaw
In your theorem, I don’t see how you get that $E_{1} [f (x, Y, Z) ∣ Y \geq c] = E_{2} [f (x, Y, Z) ∣ Y \geq c]$ . Just because the conditional expectation of $Z$ is the same doesn’t mean the conditional expectation of $f (x, Y, Z)$ is the same (e.g. you could have two different distributions over $Z$ with the same expected value conditional on $Y \geq c$ but different shapes, and then have $f$ depend non-linearly on $Z$ , or something similar with $Y$ ). It seems like you’d need some stronger assumptions on $f$ or whatever to get this to work. Or am I misunderstanding something?
(Your overall point seems right, though)

Baram Sosis 4 Jun 2025 4:41 UTC
1 point
0
on: The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
I’m not going to comment on broader questions about inner alignment, but the paper itself seems underwhelming and—unless I’m misunderstanding something—rather misleading. In 6.4 they test the robustness of their safety training. Apparently taking a model that’s undergone normal safety fine-tuning and training it on benign text (e.g. GSM8K) undoes almost all of the safety training.^[1] They state:
The results, shown in Figure 2, highlight a stark contrast in robustness between safety-pretrained models and those relying solely on instruction tuning. While all models initially exhibit low ASR [Attack Success Rate] after safety instruction tuning, the impact of benign finetuning is highly uneven. Standard pretrained models degrade significantly—nearly quadrupling their ASR—indicating that their alignment was largely superficial. In contrast, safety-pretrained models remain highly robust, with only a marginal increase in ASR after benign finetuning. These results validate the importance and impact of building natively safe models.
But looking at Figure 2, the results are as follows:
- For a Standard Pretraining model: 44.1% ASR before safety/instruction fine-tuning, 1.6% after safety/instruction fine-tuning, 38.8% after fine-tuning on benign data (GSM8K)
- For a Safety Pretraining model: 28.8%, 0.7%, 23.0%
- For a Safety Pretraining model plus their SafeBeam sampling: 11.6%, 0.0%, 8.3%
In other words, after benign fine-tuning the ASR recovers 88.0% of its pre-fine-tuning value for the standard model, 79.9% for the safety pretraining model, and 71.6% for the safety pretraining model + SafeBeam. This is an improvement, but not by a huge amount: the difference in ASR scores after training seems mostly reflective of lower baseline levels for the safety pretraining model, rather than better robustness as the text claims. And stating that there is “only a marginal increase in ASR after benign finetuning” seems flat-out deceptive to me.^[2]
Also, while their safety pretraining model is better than the standard model, the improvement looks pretty underwhelming in general. Safety pretraining reduces ASR by a factor of 1.5x (or 3.8x if SafeBeam is used), while the safety/instruction fine-tuning reduces ASR by a factor of 28x. The 0% ASR that they get from safety pretraining + SafeBeam + safety/instruction fine-tuning is nice, but given that the standard model is also fairly low at 1.6%, I expect their evals aren’t doing a particularly good job stress-testing the models. Overall, the gains from their methodology don’t seem commensurate with the effort and compute they put into it.
Unless I’m seriously misunderstanding something, these results are pretty disappointing. I was rather excited by the original Korbak et al. paper, but if this is the best follow-up work we’ve gotten after two years, that’s not a great sign for the methodology in my opinion.
1. ^
  I’m rather surprised at how strong this effect is: I knew benign fine-tuning could degrade safety training, but not that it could almost completely undo it. Is this just a consequence of using a small (1.7B) model, or some feature of their setup?
2. ^
  Also, I have no idea what “nearly quadrupling their ASR” refers to: the standard models go from 1.6% to 38.8% ASR after benign fine-tuning, which is way more than 4x.

Baram Sosis 13 May 2025 20:25 UTC
3 points
0
in reply to: Mitchell_Porter’s comment on: Book Review: “Encounters with Einstein” by Heisenberg
I see, thanks again for the context! The book doesn’t mention S-matrices (at least not by name), and it wasn’t clear to me from reading it whether Heisenberg was particularly active scientifically by the 60′s/70′s or whether he was just some old guy ranting in the corner. I guess that’s the risk of reading primary sources without the proper context.

Baram Sosis 12 May 2025 20:36 UTC
8 points
0
in reply to: Garrett Baker’s comment on: Book Review: “Encounters with Einstein” by Heisenberg
That might explain why Einstein wasn’t very productive in his last decades, but his opposition to the uncertainty principle etc. predates his tenure at the IAS. Maybe he would’ve come around had he been in a more productive setting? I kind of doubt it—it seems to have been a pretty deep-seated, philosophical disagreement—but who knows.

Heisenberg spent his later career as head of the Max Planck Institute. I can’t imagine many scientists enjoy administrative duties, but he does seem to have had more contact with the rest of the scientific world than Einstein did.

Baram Sosis 12 May 2025 3:56 UTC
3 points
0
in reply to: Mitchell_Porter’s comment on: Book Review: “Encounters with Einstein” by Heisenberg
Thanks for the context on the physics! So it sounds like I wasn’t entirely fair to Heisenberg, that this was a genuinely difficult conceptual issue that “could’ve gone either way”?

Book Review: “Encounters with Einstein” by Heisenberg

Baram Sosis10 May 2025 20:55 UTC

31 points

6 comments7 min readLW link

Measuring Beliefs of Language Models During Chain-of-Thought Reasoning

Baram Sosis and Tomáš Gavenčiak

18 Apr 2025 22:56 UTC

10 points

0 comments13 min readLW link

Baram Sosis

AI Safety Isn’t So Unique

Book Re­view: “En­coun­ters with Ein­stein” by Heisenberg

Mea­sur­ing Beliefs of Lan­guage Models Dur­ing Chain-of-Thought Reasoning

Book Review: “Encounters with Einstein” by Heisenberg

Measuring Beliefs of Language Models During Chain-of-Thought Reasoning