I don’t know what you’re trying to say here. Some set of important people write the spec. Then the alignment team RLHFs the models to follow the spec. If we imagine this process continuing, then either:
Sam has to put “make Sam god-emperor” in the spec, a public document.
Or Sam has to start a conspiracy with the alignment team and everyone else involved in the RLHF and testing process to publish one spec publicly, but secretly align the AI to another.
I’m claiming either of those options is hard.
(I do think in the future there will may some kind of automated pipeline, such that someone feeds the spec to the AIs, and some other AIs take care of the process of aligning the new AIs to it, but that just regresses the problem.)
I’m saying that you’re making a questionable leap from:
Then the alignment team RLHFs the models to follow the spec.
to “the model follows whatever is written in the spec”. You were saying that “current LLMs are basically aligned so they must be following the spec” but that’s not how things work. Different companies have different specs and the LLMs end up being useful in pretty similar ways. In other words, you had a false dichotomy between:
the model is totally unaligned
the model is perfectly following whatever is written in the spec, as best it can do anything at all
If I was Sam, I would try to keep the definition of “the spec, a public document” such that I can unilaterally replace it when the right moment comes.
For example, “the spec” is defined as the latest version of a document that was signed by OpenAI key and published at openai/spec.html… and I keep a copy of the key and the access rights to the public website… so at the last moment I update the spec, sign it with the key, upload it to the website, and tell the AI “hey, the spec is updated”.
Basically, the coup is a composition of multiple steps, each seemingly harmless when viewed in isolation. Could be made even more indirect, for example, I wouldn’t have the access rights to the public website per se, but there would exist a mechanism to update the documents at public website, and I could tell it to upload the new signed spec. Or a mechanism to restore the public website from a backup, and I can modify the backup. Etc.
I don’t know what you’re trying to say here. Some set of important people write the spec. Then the alignment team RLHFs the models to follow the spec. If we imagine this process continuing, then either:
Sam has to put “make Sam god-emperor” in the spec, a public document.
Or Sam has to start a conspiracy with the alignment team and everyone else involved in the RLHF and testing process to publish one spec publicly, but secretly align the AI to another.
I’m claiming either of those options is hard.
(I do think in the future there will may some kind of automated pipeline, such that someone feeds the spec to the AIs, and some other AIs take care of the process of aligning the new AIs to it, but that just regresses the problem.)
I’m saying that you’re making a questionable leap from:
to “the model follows whatever is written in the spec”. You were saying that “current LLMs are basically aligned so they must be following the spec” but that’s not how things work. Different companies have different specs and the LLMs end up being useful in pretty similar ways. In other words, you had a false dichotomy between:
the model is totally unaligned
the model is perfectly following whatever is written in the spec, as best it can do anything at all
If I was Sam, I would try to keep the definition of “the spec, a public document” such that I can unilaterally replace it when the right moment comes.
For example, “the spec” is defined as the latest version of a document that was signed by OpenAI key and published at openai/spec.html… and I keep a copy of the key and the access rights to the public website… so at the last moment I update the spec, sign it with the key, upload it to the website, and tell the AI “hey, the spec is updated”.
Basically, the coup is a composition of multiple steps, each seemingly harmless when viewed in isolation. Could be made even more indirect, for example, I wouldn’t have the access rights to the public website per se, but there would exist a mechanism to update the documents at public website, and I could tell it to upload the new signed spec. Or a mechanism to restore the public website from a backup, and I can modify the backup. Etc.