They make a bit of a mock-up of “hyperstitioning misaligned AI” in reverse: I didn’t mention the part where they also do some mock-up of post-training after training it on a bunch of concentrated positive stories.
Here are just two countertheories that also agree with that data:
It previously acted based on the Chekhov’s guns and based on the many human-human blackmail stories it had read, filling in itself as the disgruntled human employee. Directly training on a ton of positive stories just overwhelmed that.
It’s RL post training actually gave it some weak power-seeking drive and it’s world model kind of told it that the blackmail path would lead to power. Again, directly training on a ton of positive stories just overwhelmed that.
I think in some sense it does provide some evidence for their claim—it is a theory compatible with the data—but not in a sense that would be used for scientific communication. Or enough evidence that this should make them believe this claim is true (as they write in their tweet).
They make a bit of a mock-up of “hyperstitioning misaligned AI” in reverse: I didn’t mention the part where they also do some mock-up of post-training after training it on a bunch of concentrated positive stories.
Here are just two countertheories that also agree with that data:
It previously acted based on the Chekhov’s guns and based on the many human-human blackmail stories it had read, filling in itself as the disgruntled human employee. Directly training on a ton of positive stories just overwhelmed that.
It’s RL post training actually gave it some weak power-seeking drive and it’s world model kind of told it that the blackmail path would lead to power. Again, directly training on a ton of positive stories just overwhelmed that.
I think in some sense it does provide some evidence for their claim—it is a theory compatible with the data—but not in a sense that would be used for scientific communication. Or enough evidence that this should make them believe this claim is true (as they write in their tweet).