Epistemic Status: You Get About Five Words; explaining my top takeaway from the post.
Of course this difference is to some degree not as much a difference in the model behavior training of either company as a difference in the documents that each choose to make public.
This sentence weakly implies the Five Words: “instrumental convergence of model design”. If we take this principle further, we get the idea that Claude N, ChatGPT N and Gemini N are all likely to be very similar as N increases.
This has implications for alignment work that I’m not quite sure how to think about:
- Any current technique that only works on some SoTA models is brittle;
- If Lab X solves a key part of the alignment problem internally (e.g. corrigibility), either the change makes the models more robustly useful, and is adopted by other labs quickly, or it makes the model slightly worse, and there is strong pressure to leave the work in a “safety-related bubble”;
- ???
Agreed. I think a more measured phrasing is “I am confident that across a significant number of current AI companies, in the near future, the models are more similar than one would expect from reading the difference between (representative example) Claude’s Constitution and OpenAI’s Model Spec.”
I think I am personally holding the strong conclusion (that you are pushing back on) as plausible-enough and important-if-true-enough that I want to be tracking how true I think it is as I get more information.
If even moderately weak forms of “instrumental convergence of model design” end up clearly false in O(5-years) time, I would be surprised but not confused.
If anyone with more context on modern AI models than me disagrees, please say so; me knowing multiple perspectives on all of this is useful and makes my world-models better.