I’ve been encouraged to write a self-review: I don’t have much to say here, except that if I knew this article would be this popular (over 100 upvotes), then I would have written it a bit more carefully the first time. I just spent 10 minutes rewriting some awkwards phrasings.
Chris_Leong
Out of these, who is your top pick?
Is there a way to insert diagrams like that into Less Wrong posts in general or is this a feature you added just for this specific post?
Even if they had almost destroyed the world, the story would still not properly be about their guilt or their regret, it would be about almost destroying the world. This is why, in a much more real and also famous case, President Truman was validly angered and told “that son of a bitch”, Oppenheimer, to fuck off, after Oppenheimer decided to be a drama queen at Truman. Oppenheimer was trying to have nuclear weapons be about Oppenheimer’s remorse at having helped create nuclear weapons. This feels obviously icky to me; I would not be surprised if Truman felt very nearly the same.
Fascinating, I always interpreted this as Truman being an asshole, but I guess that makes sense now that you explain it that way. I suppose a meeting with the president is precisely the wrong time to focus on your own guilt as opposed to trying to do what you can to steer the world towards positive outcomes.
One of the ways you can get up in the morning, if you are me, is by looking in the internal direction of your motor plans, and writing into your pending motor plan the image of you getting out of bed in a few moments, and then letting that image get sent to motor output and happen
Was this inspired by active inference?
Inducing an honest-only output channel hasn’t clearly worked so far
I wonder if this would be more successful if you tried making the confession channel operate in another language, even output an encoded string or respond with a different modality.
I’m also curious about whether prompting the model to produce a chain of thought before deciding whether or not to confess would be to provide more signal since the AI might admit it lied during its chain of thought even if it lies in the confession (indeed AI’s seem to be more honest in their chain of thought).
Sydney AI Safety Fellowship 2026 (Priority deadline this Sunday)
I haven’t read the whole thing, but even if this were a hallucination, it seems like it could be a decent method for creating such a soul document.
“This doesn’t work. People see through it. It comes across as either dishonest or oblivious, and both make you look bad”
I wouldn’t be so quick to say that this doesn’t work. If you want people to stop attacking you, then it usually won’t work; but not admitting you have an agenda seems to be precisely what most conflict theorists would do in these circumstances because there often are bystanders who will accept this justification.
Quotes on AI and wisdom
I think you missed: “Maybe we can reduce these propensities faster than the increasing optimisation power brings them out”
Regarding: “Unless you’re posing a non-smooth model”
Why would the model be smooth when the we’re making all kinds of changes to how the models are trained and how we elicit them? As an analogy, even if I was bullish on Nvidea stock prices over the long term, it doesn’t mean that even a major crash would necessarily falsify my prediction as it could still recover.
My main disagreement is that I feel your certainty outstrips the strength of your arguments.
“It’s really difficult to get AIs to be dishonest or evil by prompting, you have to fine-tune them.”
This is much less of a killer argument if we expect increasing optimisation power to be applied over time.
When ChatGPT came out I was surprised by how aligned the model was relative to its general capabilities. This was definitely a signficant update compared to what I expected from older AI arguments (say the classic story about a robot getting a coffee and pushing a kid out of the way).
However, what I didn’t realise at the time was that the main reason why we weren’t seeing misbehaviour was a lack of optimisation power. Whilst it may have seemed that you’d be able to do a lot with gpt4 level agents in a loop, this mostly just resulted in them going around in circles. From casual use these models seemed a lot better at optimising than they actually were because optimising over time required a degree of coherence that these agents lacked.
Once we started applying more optimisation power and reached the o1 series of models then we started seeing misbehaviour a lot more. Just to be clear, what we’re seeing isn’t quite a direct instantiation of the old instrumental convergence arguments. Instead what we’re seeing is surviving[1] unwanted tendencies from the pretraining distribution being differentially selected for. In other words, it’s more of a combination of the pretraining distribution and optimisation power as opposed to the old instrumental convergence arguments that were based on an idealisation of a perfect optimiser.
However, as we increase the amount of optimisation power, we should expect the instrumental convergence arguments to mean that unwanted behaviour can still be brought to the surface even with lower and lower propensities of the unamplified model to act in that way. Maybe we can reduce these propensities faster than the increasing optimisation power brings them out (and indeed safety teams are attempting to achieve this), but that remains to be seen and the amount of money/talent being directed into optimisation is much more than the amount being directed into safety.- ^
In particular, that not removed by RLHF. This is surprisingly effective, but it doesn’t remove everything.
- ^
This is a good point.
Towards Humanist Superintelligence
What percentage applied?
“To be clear, I think this is a good thing! I respect your disagreement here. MATS has tried to run AI safety strategy workshops and reading groups many times in the past, but this has generally had low engagement relative to our seminar series”
I suspect that achieving high-engagement will be hard because fellows have to compete for extension funding.
I suspect that the undervaluing of field-building is downstream of EA overupdating on The Meta Trap (I appreciated points 1 & 5; point 2 probably looks worst in retrospect).
I don’t know if founding is still undervalued—seems like there’s a lot in the space these days.
”I confess that I don’t really understand this concern”
Have you heard of Eternal September? If a field/group/movement grows at less than a certain rate, then there’s time for new folks to absorb the existing culture/knowledge/strategic takes and then pass it on to the folks after them. However, this breaks down if the growth happens too fast.
”We should be careful not to dilute the quality of the field by scaling too fast… If outreach funnels attract a large number of low-caliber talent to AI safety, we can enforce high standards for research grants and second-stage programs like ARENA and MATS. If forums like LessWrong or the EA Forum become overcrowded with low-calibre posts, we can adjust content moderation or the effect of karma on visibility.”
Firstly, filtering/selection time isn’t free. It takes money, time from high-skilled people and also increases the chance of good candidates being overlooked in the sea of applications since it forces you to filter more aggressively.
Secondly, people need high-quality peers in order to develop intellectually. Even if second-stage programs manage to avoid being diluted, adding a bunch of low-caliber talent to local community groups would make it harder for people to develop intellectually before reaching the second-stage programs; in other words it’d undercut the talent development pipeline for these later stage programs.“Additionally, growing the AI safety field is far from guaranteed to reduce the average quality of research, as most smart people are not working on AI safety and, until recently, AI safety had poor academic legibility. Even if growing the field reduces the average researcher quality, I expect this will result in more net impact”
I suspect AI safety research is very heavy-tailed and what would encourage the best folks to enter the field is not so much the field being large so much as the field having a high densitiy of talent.
Thanks for the suggestions. I reordered the graphs to tell a clearer narrative.
Is there any way to view the mode?
Feel reply to this comment with any suggestions about other graphs that I should consider including.
Thanks!
“AI Explained has also felt like he lost depth of insight”
What makes you say that?