I’m having trouble understanding how the maze example is different from the cat example. The maze AI was trained on a set of mazes that had a red door along the shortest path, so it learned to go to those red doors. When it was deployed on a different set of mazes, the goal it had learned didn’t match up with the goal its programmers wanted it to have. This seems like the same type of out-of-distribution behavior that you illustrated with the AI that learned to look for white animals rather than cats.
You presented the maze AI as different from the cat AI because it had an outer goal of “find the shortest path through the maze” and implemented that goal by iterating the inner goal of “breadth-first search for a red door”. The inner goal is aligned with the outer goal for all training mazes, but not for the real mazes. But couldn’t you frame the cat AI the same way? Maybe it has an outer goal of “check for a cat” and it implements that with an inner goal of “divide the image into a set of shapes that each contain only colors within [margin] of the average color. If there is at least one shape that’s within [margin] of white and has [shape] return yes, otherwise return no.”
How is the maze AI fundamentally different from the cat AI? Why is the inner/outer alignment model of thinking about an AI system more useful than thinking about it as a single optimizer that was trained on a flawed distribution?
(This is the second time someone asks this, so the fault is probably with the post and I should edit it somehow.)
The difference is that the maze AI is running a search. (The classifier isn’t; it’s just applying a bunch of rules.) This matters because that’s where the whole thing gets dangerous. If you get the last part on deceptive and proxy alignment, those concepts only make sense once we’re in the business of optimizing, i.e., running a search for actions that score well according to some utility function. In that setting, it makes sense to think of the inner thing as an “optimizer” or “agent” that has goals/wants things/etc.
What’s the conceptual difference between “running a search” and “applying a bunch of rules”? Whatever rules the cat AI is applying to the image must be implemented by some step-by-step algorithm, and it seems to me like that could probably be represented as running a search over some space. Similarly, you could abstract away the step-by-step understanding of how breadth-first search works and say that the maze AI is applying the rule of “return the shortest path to the red door”.
Yeah, very good question. The honest answer is that I don’t know; I had this distinction in mind when I wrote the post, but pressed with it, I don’t know if there’s a simple way to capture it. Someone on the AstralCodexTen article just asked the same, and the best I came up with is “the set of possible outputs is very large and contains harmful elements”. This would certainly be a necessary criterion; if every output is harmless, the system can’t be dangerous. (GPT already fails this.)
But even if there is no qualitative step, you can view it as a spectrum of competence, and deceptive/proxy alignment start being a possibility at some point on the spectrum. Not having the crisp characterization doesn’t make the dangerous behavior go away.
I like this thread; I think it represents an important piece of the puzzle, and I’m hoping to write something more detailed on it soon, but here’s a brief one.
My take is roughly: search/planning is one important ingredient of ‘consequentialism’ (in fact it is perhaps definitional, the way I understand consequentialism). When you have a consequentialist adversary (with strategic awareness[1]), you should (all other things equal) expect it to be more resilient to your attempts to put it out of action. Why? An otherwise similarly-behaved-in-training system which isn’t a consequentialist must have learned some heuristics during training. Those heuristics will sometimes be lucky and rest on abstractions which generalise some amount outside of the training distribution. But the further from the training distribution, the more vanishingly-probable it is that the abstractions and heuristics remain suitable. So you should be more optimistic about taking it down if you need to. In contrast, a consequentialist system can refine or even replace its heuristics in response to changes in inputs (by doing search/planning).
Another (perhaps independent?) ingredient is the ability to refine and augment the abstractions and world/strategic model on which the planning rests (play/experimentation). I would be even more pessimistic about a playful consequentialist adversary, because I’d expect its consequentialism to keep working even further (perhaps indefinitely far) outside the training distribution, given the opportunity to experiment.
I’m having trouble understanding how the maze example is different from the cat example. The maze AI was trained on a set of mazes that had a red door along the shortest path, so it learned to go to those red doors. When it was deployed on a different set of mazes, the goal it had learned didn’t match up with the goal its programmers wanted it to have. This seems like the same type of out-of-distribution behavior that you illustrated with the AI that learned to look for white animals rather than cats.
You presented the maze AI as different from the cat AI because it had an outer goal of “find the shortest path through the maze” and implemented that goal by iterating the inner goal of “breadth-first search for a red door”. The inner goal is aligned with the outer goal for all training mazes, but not for the real mazes. But couldn’t you frame the cat AI the same way? Maybe it has an outer goal of “check for a cat” and it implements that with an inner goal of “divide the image into a set of shapes that each contain only colors within [margin] of the average color. If there is at least one shape that’s within [margin] of white and has [shape] return yes, otherwise return no.”
How is the maze AI fundamentally different from the cat AI? Why is the inner/outer alignment model of thinking about an AI system more useful than thinking about it as a single optimizer that was trained on a flawed distribution?
(This is the second time someone asks this, so the fault is probably with the post and I should edit it somehow.)
The difference is that the maze AI is running a search. (The classifier isn’t; it’s just applying a bunch of rules.) This matters because that’s where the whole thing gets dangerous. If you get the last part on deceptive and proxy alignment, those concepts only make sense once we’re in the business of optimizing, i.e., running a search for actions that score well according to some utility function. In that setting, it makes sense to think of the inner thing as an “optimizer” or “agent” that has goals/wants things/etc.
What’s the conceptual difference between “running a search” and “applying a bunch of rules”? Whatever rules the cat AI is applying to the image must be implemented by some step-by-step algorithm, and it seems to me like that could probably be represented as running a search over some space. Similarly, you could abstract away the step-by-step understanding of how breadth-first search works and say that the maze AI is applying the rule of “return the shortest path to the red door”.
Yeah, very good question. The honest answer is that I don’t know; I had this distinction in mind when I wrote the post, but pressed with it, I don’t know if there’s a simple way to capture it. Someone on the AstralCodexTen article just asked the same, and the best I came up with is “the set of possible outputs is very large and contains harmful elements”. This would certainly be a necessary criterion; if every output is harmless, the system can’t be dangerous. (GPT already fails this.)
But even if there is no qualitative step, you can view it as a spectrum of competence, and deceptive/proxy alignment start being a possibility at some point on the spectrum. Not having the crisp characterization doesn’t make the dangerous behavior go away.
I like this thread; I think it represents an important piece of the puzzle, and I’m hoping to write something more detailed on it soon, but here’s a brief one.
My take is roughly: search/planning is one important ingredient of ‘consequentialism’ (in fact it is perhaps definitional, the way I understand consequentialism). When you have a consequentialist adversary (with strategic awareness[1]), you should (all other things equal) expect it to be more resilient to your attempts to put it out of action. Why? An otherwise similarly-behaved-in-training system which isn’t a consequentialist must have learned some heuristics during training. Those heuristics will sometimes be lucky and rest on abstractions which generalise some amount outside of the training distribution. But the further from the training distribution, the more vanishingly-probable it is that the abstractions and heuristics remain suitable. So you should be more optimistic about taking it down if you need to. In contrast, a consequentialist system can refine or even replace its heuristics in response to changes in inputs (by doing search/planning).
Another (perhaps independent?) ingredient is the ability to refine and augment the abstractions and world/strategic model on which the planning rests (play/experimentation). I would be even more pessimistic about a playful consequentialist adversary, because I’d expect its consequentialism to keep working even further (perhaps indefinitely far) outside the training distribution, given the opportunity to experiment.
roughly I mean ‘knows about humans and ways to interact with and influence them’ https://www.alignmentforum.org/posts/HduCjmXTBD4xYTegv/draft-report-on-existential-risk-from-power-seeking-ai (and see some discussion here https://www.alignmentforum.org/posts/cCMihiwtZx7kdcKgt/comments-on-carlsmith-s-is-power-seeking-ai-an-existential)