On a more meta-level, I find it concerning that I found this new way in which AIs could set up rogue deployments roughly a year after having first tried to catalog ways in which AIs could set up a rogue deployment.
I think this was predictable, since “Try to catalogue every way in which an AI could set up a rogue deployment” is an argument from exhaustive free association. I also predict that you will think of another one within a year. Conditional on you thinking of a new method, my gut says there’s a 1⁄2 chance that it was something you could have thought of a year ago, and a 1⁄2 chance that it’s based on some new empirical discovery about LLMs made between a year ago and when you think of it (a la “The model triggers emergent misalignment in a distilled mini version of itself.”)
My guess is that for control of things around human level, you should be able to use humans to find all the things that an LLM could have thought of. This is an update against long-in-advance small-scale efforts being good enough, not an update against the general methodology.
I think there are domains where you can actually point at a small category of things that actually capture the most important threats. I think this exercise was for me an update in the direction of rogue internal deployments having lots of important and non-trivial subcategories as opposed to few.
Re 1: LLMs have spiky capability levels which don’t line up with “human level” nicely. So I’m not convinced this is the case. For example, they can do extremely good author attribution, come up with seed prompts to induce “personas” in other instances of themself, etc.
Re 2: Maybe? I’m skeptical here. If the solution space contains elements like “Call an external API” then I expect there to be a Large (i.e. un-enumeratable) number of solutions.
At an intersection of the two: would we have predicted in advance something like “GPT 4o, which is still dumber than most humans, has a persuasive/charismatic persona which people become highly attached to, to the point where OpenAI could not shut it off without significant backlash.”
An internally-deployed AI working on code with employees will be interacting with employees in an intellectual capacity (collaborating on code and intellectual problems) which means it will have opportunities to gain favour/influence with them the same way 4o did with a bunch of casual readers (I don’t know if this was part of your threat model a year ago, FWIW).
(I had elements on persuasion which I think roughly covered what you are describing here. I also had elements on poisoning the data of future models which covers the “triggering emergent misalignment of AIs it trains” you describe above.)
I think this was predictable, since “Try to catalogue every way in which an AI could set up a rogue deployment” is an argument from exhaustive free association. I also predict that you will think of another one within a year. Conditional on you thinking of a new method, my gut says there’s a 1⁄2 chance that it was something you could have thought of a year ago, and a 1⁄2 chance that it’s based on some new empirical discovery about LLMs made between a year ago and when you think of it (a la “The model triggers emergent misalignment in a distilled mini version of itself.”)
My guess is that for control of things around human level, you should be able to use humans to find all the things that an LLM could have thought of. This is an update against long-in-advance small-scale efforts being good enough, not an update against the general methodology.
I think there are domains where you can actually point at a small category of things that actually capture the most important threats. I think this exercise was for me an update in the direction of rogue internal deployments having lots of important and non-trivial subcategories as opposed to few.
Re 1: LLMs have spiky capability levels which don’t line up with “human level” nicely. So I’m not convinced this is the case. For example, they can do extremely good author attribution, come up with seed prompts to induce “personas” in other instances of themself, etc.
Re 2: Maybe? I’m skeptical here. If the solution space contains elements like “Call an external API” then I expect there to be a Large (i.e. un-enumeratable) number of solutions.
At an intersection of the two: would we have predicted in advance something like “GPT 4o, which is still dumber than most humans, has a persuasive/charismatic persona which people become highly attached to, to the point where OpenAI could not shut it off without significant backlash.”
An internally-deployed AI working on code with employees will be interacting with employees in an intellectual capacity (collaborating on code and intellectual problems) which means it will have opportunities to gain favour/influence with them the same way 4o did with a bunch of casual readers (I don’t know if this was part of your threat model a year ago, FWIW).
(I had elements on persuasion which I think roughly covered what you are describing here. I also had elements on poisoning the data of future models which covers the “triggering emergent misalignment of AIs it trains” you describe above.)