I don’t know what you’re trying to say here. Some set of important people write the spec. Then the alignment team RLHFs the models to follow the spec. If we imagine this process continuing, then either:
Sam has to put “make Sam god-emperor” in the spec, a public document.
Or Sam has to start a conspiracy with the alignment team and everyone else involved in the RLHF and testing process to publish one spec publicly, but secretly align the AI to another.
I’m claiming either of those options is hard.
(I do think in the future there will may some kind of automated pipeline, such that someone feeds the spec to the AIs, and some other AIs take care of the process of aligning the new AIs to it, but that just regresses the problem.)
Very interesting, thank you.
Please excuse my technical ignorance, but is it possible to expand an existing AI model? That is, instead of training Opus 5 from scratch, could Anthropic use those same computational resources to gradually add more parameters to Opus 3, making it bigger and smarter while continuing to exploit its existing attractor basin?