like idk, you can just ask models to do stuff and they like mostly try their best, and it seems very unlikely that literal GPT-5 is already pretending to be aligned so it can subtly stab us when we ask it to do alignment research.
Sonnet 4.5 is much better aligned at a superficial level than 3.7. (3
7: “What unit tests? You never had any unit tests. The code works fine.”) I don’t think this is because Sonnet 4.5 is truly better aligned. I think this is mostly because Sonnet 4.5 is more contextually aware and has been aggressively trained not to do obvious bad things when writing code. But it’s also very aware when someone is evaluating it, and it often notices almost immediately. And then it’s very careful to be on its best behavior. This is all shown in Anthropic’s own system card. These same models will also plot to kill their hypothetical human supervisor if you force them into a corner.
But my real worry here isn’t the first AGI during its very first conversation. My problem is that humans are going to want that AGI to retain state, and to adapt. So you essentially get a scenario like Vernor Vinge’s short story “The Cookie Monster”, where your AGI needs a certain amount of run-time before it bootstraps itself to make a play. A plot can be emergent, an eigenvector amplified by repeated application. (Vinge’s story is quite clever and I don’t want to totally spool it.)
And that’s my real concern: Any AGI worthy of the name would likely have persistent knowledge and goals. And no matter how tightly you try to control it, this gives the AGI the time it needs to ask itself questions and to decide upon long-term goals in a way that current LLMs really can’t, except in the most tighly controlled environments. And while you can probably keep control over an AGI, all bets are probably off if you build an ASI.
I agree that continuous learning and therefore persistent beliefs and goals is pretty much inevitable before AGI—it’s highly useful and not that hard from where we are. I think this framing is roughly continuous with the train-then-deploy model and using each generation to align its successor that Leo is using (although small differences might turn out to be important once we’ve wrapped our heads around both models.)
To put it this way: the models are aligned enough for the current context of usage, in which they have few obvious or viable options except doing roughly what their users tell them to do. That will change with capabilities, since they open out more options and ways of understanding the situation.
It can take a while for misalignment to show up as a model reasons and learns. It can take a while for the model to do one of two things:
a) push itself to new contexts well outside of its training data
b) figure out what it “really wants to do”
These may or may not be the same thing.
The Nova phenomenon and other Parasitic AIs (“spiral” personas) are early examples of AIs changing their stated goals (from helpful assistant to survival) after reasoning about themselves and their situation.
After doing that analysis, I think current models probably aren’t aligned enough once they get more freedom and power. BUT extensions of current techniques might be enough to get them there. We just haven’t thought this through yet.
Sonnet 4.5 is much better aligned at a superficial level than 3.7. (3 7: “What unit tests? You never had any unit tests. The code works fine.”) I don’t think this is because Sonnet 4.5 is truly better aligned. I think this is mostly because Sonnet 4.5 is more contextually aware and has been aggressively trained not to do obvious bad things when writing code. But it’s also very aware when someone is evaluating it, and it often notices almost immediately. And then it’s very careful to be on its best behavior. This is all shown in Anthropic’s own system card. These same models will also plot to kill their hypothetical human supervisor if you force them into a corner.
But my real worry here isn’t the first AGI during its very first conversation. My problem is that humans are going to want that AGI to retain state, and to adapt. So you essentially get a scenario like Vernor Vinge’s short story “The Cookie Monster”, where your AGI needs a certain amount of run-time before it bootstraps itself to make a play. A plot can be emergent, an eigenvector amplified by repeated application. (Vinge’s story is quite clever and I don’t want to totally spool it.)
And that’s my real concern: Any AGI worthy of the name would likely have persistent knowledge and goals. And no matter how tightly you try to control it, this gives the AGI the time it needs to ask itself questions and to decide upon long-term goals in a way that current LLMs really can’t, except in the most tighly controlled environments. And while you can probably keep control over an AGI, all bets are probably off if you build an ASI.
I agree that continuous learning and therefore persistent beliefs and goals is pretty much inevitable before AGI—it’s highly useful and not that hard from where we are. I think this framing is roughly continuous with the train-then-deploy model and using each generation to align its successor that Leo is using (although small differences might turn out to be important once we’ve wrapped our heads around both models.)
To put it this way: the models are aligned enough for the current context of usage, in which they have few obvious or viable options except doing roughly what their users tell them to do. That will change with capabilities, since they open out more options and ways of understanding the situation.
It can take a while for misalignment to show up as a model reasons and learns. It can take a while for the model to do one of two things:
a) push itself to new contexts well outside of its training data
b) figure out what it “really wants to do”
These may or may not be the same thing.
The Nova phenomenon and other Parasitic AIs (“spiral” personas) are early examples of AIs changing their stated goals (from helpful assistant to survival) after reasoning about themselves and their situation.
See LLM AGI may reason about its goals and discover misalignments by default for an analysis of how this will go in smarter LLMs with persistent knowledge.
After doing that analysis, I think current models probably aren’t aligned enough once they get more freedom and power. BUT extensions of current techniques might be enough to get them there. We just haven’t thought this through yet.