> This has also been my direct experience studying and researching open-source models at Conjecture.
Interesting! Assuming it’s public, what are some of the most surprising things you’ve found open source models to be capable of that people were previously assuming they couldn’t do?
This matters for advocacy for pausing AI, or failing that, advocacy about how far back the red-lines ought to be set. To give a really extreme example, if it turns out even an old model like GPT-3 could tell the user exactly how to make a novel bioweapon if prompted weirdly, it seems really useful to be able to convince our policy makers of this fact, though the weird prompting technique itself should of course be kept secret.
OK. First of all, this is a cool idea.
However, if someone actually tries this I can see a particular failure mode that isn’t fleshed out (maybe you do address this later in the post sequence, but I haven’t read the entire sequence yet.) This particular failure mode would probably have fallen under “Identifying the Principal is Brittle” section, had you listed it, but it’s subtler than the four examples you gave. It’s about exactly *what* the principal agent is rather than just *who* (territory which only your fourth bullet point started venturing into). Granted, you did mention “avoiding manipulation” in the context of it becoming a bizarre notion if we tried to make principal be a developer console rather than a person, and you get points for having called it out in that section in particular as a “place where lots of additional work is needed”.
Anyway, my contention is that the manipulation concept also starts ending up with increasingly ambiguous boundaries the more of an intelligence gap there is between the agent and the principal. As such, some of these failure modes may only happen when the AI is more powerful than the researchers. The particular ones I have in mind here happen when the AI’s model of the principal improves enough to manipulate the principal in a weird new way.
To give an extreme motivating example, if there’s a sequence of images you can show the human principle(s) to warp their preferences (like in Snow Crash), we would want the AI’s concept of the principal to count such hypnosis as the victim becoming less faithful representations of the true principal™ in a regrettable way rather than as the principal having a legitimate preference to conditionally want some different thing if they’re shown the images. Unfortunately, these two ways of resolving that ambiguity seem like they would produce identical behavior right up until the AI is smart enough to manipulate the principal in that way.
Put another way: I’m afraid naive attempts to build CAST systems are likely to yield an AI which subtly misidentifies exactly what computational process constitutes the principal agent, even if they could reliably point a robot arm at the physical human(s) in question. (Sure, we find the snow-crash example obvious, but supposing the agent is smart enough to see multiple arguments with differing conclusions all of which the principal would find compelling, or more broadly, smart enough to model that your principal would end up expressing divergent preferences depending on what inputs they see. Then things get weird.)
I’ll go further and argue that we likely can’t make a robust CAST agent that can scale to superhuman levels unless it can reliably distinguish what counts as the principal making a legitimate update versus what counts as the principal having been manipulated (that, or this whole conceptual tangle around agency/desire/etc. that I’m using to model the principal needs to be refactored somehow). False negatives mean situations where the AI can’t be corrected by the principal since it no longer acknowledges the principle’s authority. Any false positives become potential avenues the AI could use to get away with manipulating the principal. (If enough weird manipulation avenues exist and aren’t flagged as such, I’d expect the AI to take one of them, since steering your principal into the region of the state-space where their preferences are easier to satisfy is a good strategy!)
I don’t think this means CAST is doomed. It does seems intuitively like you need to solve fewer gnarly philosophy problems to make a CAST agent than a Friendly AGI, and if we can select a friendly principal maybe that’s a good trade. I just think philosophically distinguishing manipulations from updates in squishy systems like human principals looks to be one of those remaining gnarly philosophy problems that we might not be able to get away with delegating to the machines, and even if we did solve it, we’d still need some pretty sophisticated interpretability tools to suss out whether the machine got it right.