Semi-anon account so I could write stuff without feeling stressed.
Sodium
As far as I’m aware of, this is one of the very few pieces of writing that sketches out what safety reassurances could be made for a model capable of doing significant harms. I wish there were more posts like this one.
This post and (imo more importantly) the discussion it spurred has been pretty helpful for how I think about scheming. I’m happy that it was written!
I feel like the react buttons are cluttering up the UI and distracting. Maybe they should be e.g., restricted to users with 100+ karma and everyone gets only one react a day or something?
Like they are really annoying when reading articles like this one.
Yeah I get that the actual parameter count isn’t, but I think the general argument that bigger pre trains remember more facts, and we can use that to try predict the model size.
For what it’s worth, I’m still bullish on pre-training given the performance of Gemini-3, which is probably a huge model based on its score in the AA-Omniscience benchmark.
man you should probably get some more I can’t imagine it’ll be that expensive?
I agree it’s probably good to not use moral reasoning, but the reason people have deontological rules around drugs is because it’s hard to trust our own consequentialist reasoning. Something like “don’t do (non-prescribed) drugs ” also a simple rule that’s much more low effort to follow and may well be worth the cost-benefit analysis.
The moment the model becomes fully aware of what’s going on here with the inoculation prompt, the technique is likely to fall apart.
I think this is probably false? You could empirically test this today if you have a sufficiently realistic inoculation prompting setup: Check that the prompt works, then do synthetic document fine-tuning to teach the model facts about training processes it could undergo, including what inoculation prompting is and how it works.
I agree that inoculation prompting would not work for instrumental deception, but I don’t think being aware of the technique does anything.
I think the installation is actually quite complicated (source: I vaguely remember how my friend who works at Starlink described the process. ChatGPT claims the installation is $150k and requires modifying the airframe).
Man this is such a big issue with the Sequences. Like, “Is that your true rejection” is a concept that I use very often: when I decide to not do something, I would sometimes go “hmm, what’s the real reason I don’t want to do this?” I believe that we often come up with nice-sounding reasons to not do this that are totally unrelated to our true motivations, and noticing this is an important rationalist skill.
But “Is that your true rejection” is also just Eliezer complaining about how people wouldn’t listen to him because he doesn’t have a PhD, and him saying “I bet you wouldn’t listen, even if I had one!!”
Sure grandpa let’s get you to bed
Oh man. The Witchers et al. math/syncophancy experiments were conducted on the original Gemma 2B it, a model from a year and a half ago. I think it would’ve made things a good bit more convincing if the experiments were done on Gemma 3 (and preferably on a bigger model/harder task)
Guess: most people who have gotten seriously interested in AI safety in the last year have not read/skimmed Risks From Learned Optimization.
Maybe 70% confident that this is true. Not sure how to feel about this tbh.
Huh I don’t see it :/
Oh huh is this for pro users only. I don’t see it (as a plus user). Nice.
Feels worth noting that the alignment evaluation section is by far the largest section in the system card: 65 pages in total (44% of the whole thing).
Here are the section page countsAbstract — 4 pages (pp. 1–4)
1 Introduction — 5 pages (pp. 5–9)
2 Safeguards and harmlessness — 9 pages (pp. 10–18)
3 Honesty — 5 pages (pp. 19–23)
4 Agentic safety — 7 pages (pp. 24–30)
5 Cyber capabilities — 14 pages (pp. 31–44)
6 Reward hacking — 4 pages (pp. 45–48)
7 Alignment assessment — 65 pages (pp. 49–113)
8 Model welfare assessment — 9 pages (pp. 114–122)
9 RSP evaluations — 25 pages (pp. 123–147)
The white box evaluation subsection (7.4) alone is 26 pages, longer than any other section!
Mandarin speakers will have understood that the above contains an unwholesome sublist of spammy and adult-oriented website terms, with one being too weird to make the list here.
Lol unfortunately I am good enough at Mandarin to understand the literal meaning of those words but not good enough to immediately parse what they’re about. I was like, “what on earth is ‘大香蕉网’” (lit. “Big Banana Website”), and then I googled it and clicked around and was like “ohhhhh that makes a lot of sense.”
People might be interested in the results from this paper.
Another piece of evidence that the AI is already having substantial labor market effects, Brynjolfsson et al.’s paper (released today!) shows that sectors that can be more easily automated by AI has seen less employment growth among young workers. For example, in Software engineering:
I think some of the effect here is mean reversion from overhiring in tech instead of AI-assisted coding. However, note that we see a similar divergence if we take out the information sector alltogether. In the graph below, we look at the employment growth among occupations broken up by how LLM-automateable they are. The light lines represent the change in headcount in low-exposure occupations (e.g., nurses) while the dark lines represent the change in headcount in high-exposure occupations (e.g., customer service representatives).
We see that for the youngest workers, there appears to be a movement of labor from more exposed sectors to less exposed sectors.
Yeah I would like to mute some users site-wide so that I never see reacts from them & their comments are hidden by default....