I really like this list because it does a great job of explicitly specifying the same behavior I was trying to vaguely gesture at in my list when I kept referring to AGI-as-a-contract-engineer.
Even your point about it doesn’t have to succeed, it’s ok for it to fail at a task if it can’t reach it in some obvious, non-insane way—that’s what I’d expect from a contractor. The idea that an AGI would find that a task is generally impossible but identify a novel edge case that allows it to be accomplished with some ridiculous solution involving nanotech and then it wouldn’t alert or tell a human about that plan prior to taking it has always been confusing to me.
In engineering work, we almost always have expected budget / time / material margins for what a solution looks like. If someone thinks that solution space is empty (it doesn’t close), but they find some other solution that would work, people discuss that novel solution first and agree to it.
That’s a core behavior I’d want to preserve. I sketched it out in another document I was writing a few weeks ago, but I was considering it in the context of what it means for an action to be acceptable. I was thinking that it’s actually very context dependent—if we approve an action for AGI to take in one circumstance, we might not approve that action in some vastly different circumstance, and I’d want the AGI to recognize the different circumstances and ask for the previously-approved-action-for-circumstance-A to be reapproved-for-circumstance-B.
EDIT: Posting this has made me realize that idea of context dependencies is applicable more widely than just allowable actions, and it’s relevant to discussion of what it means to “optimize” or “solve” a problem as well. I’ve suggested this in my other posts but I don’t think I ever said it explicitly: if you consider human infrastructure, and human economies, and human technology, almost all “optimal” solutions (from the perspective of a human engineer) are going to be built on the existing pile of infrastructure we have, in the context of “what is cheapest, easiest, the most straight line path to a reasonably good solution that meets the requirements”. There is a secret pile of “optimal” (in the context of someone doing reasoning from first principles) solutions that involve ignoring all of human technology and bootstrapping a new technology tree from scratch, but I’d argue that’s a huge overlap if not the exact same set as the things people have called “weird” in multiple lists. Like if I gave a contractor a task to design a more efficient paperclip factory and they gave me a proposed plan that made zero reference to buying parts from our suppliers or showed the better layout of traditional paper-clip making machines or improvements to how an existing paper-clip machine works, I’d be confused, because that contractor is likely handing me a plan that would require vertically integrating all of the dependencies, which feels like complete overkill for the task that I assigned. Even if I phrased my question to a contractor as “design me the most efficient paperclip factory”, they’d understand constraints like: this company does not own the Earth, therefore you may not reorder the Earth’s atoms into a paperclip factory. They’d want to know, how much space am I allowed? How tall can the building be? What’s the allowable power usage? Then they’d design the solution inside of those constraints. That is how human engineering works. If an AGI mimicked that process and we could be sure it wasn’t deceptive (e.g. due to interpretability work), then I suspect that almost all claims about how AGI will immediately kill everyone are vastly less likely, and the remaining ways AGI can kill people basically reduce to the people controlling the AGI deliberately using it to kill people, in the same way that the government uses military contractors to design new and novel ways of killing people, except the AGI would be arbitrarily good at that exercise.
The idea that an AGI would find that a task is generally impossible but identify a novel edge case that allows it to be accomplished with some ridiculous solution involving nanotech and then it wouldn’t alert or tell a human about that plan prior to taking it has always been confusing to me.
I think your “contractor” analogy is sneaking in an assumption:
The plan proposed by the contractor might or might not be dangerous. But the things that the contractor does in the course of coming up with the plan are definitely safe.
Examples of such things include “brainstorming possible plans”, “thinking about how the plan could go wrong and how it could be improved”, “reading books or other reference material”, etc.
So the problem arises that:
The contractor has to do at least some of those things with no human in the loop, otherwise the human is doing everything and there’s no point in having the contractor at all.
In order for the contractor to actually successfully make a good plan, it presumably needs to “want” to create a good plan, at least beyond a certain level of how innovative the plan is. (That’s what I believe anyway, see for example my discussion of “RL-on-thoughts” here.)
The fact of the matter is: escaping from the box would be helpful for the contractor’s creating a good plan—for example, it could then access more books and computational power etc.
If the contractor (A) knows or figures out fact #3 above, and (B) can do means-end reasoning [which is expected, otherwise it would suck at making innovative plans], (C) “wants” to create a good plan as per #2 above—then we will wind up in a situation where the contractor “wants” to escape from the box. (And by the same token, it will “want” to gain power in other ways, including by deploying nanotechnology or whatever, and to be deceptive, etc.)
Since this is all taking place within the coming-up-with-a-plan part of the story, not the submitting-a-plan part of the story, it’s mostly human-out-of-the-loop (per #1 above), and thus the contractor will by default try to escape from the box and do all those other dangerous power-seeking things without asking for human permission.
Sorry if I’m missing the point of what you were saying there.
I guess maybe you’d respond “the AI won’t do anything dangerous in the course of coming up with the plan because we’ll be awesome at boxing and interpretability, and it won’t sneak dangerous things into the plan because we will fully understand the plan and would be able to notice those dangerous things, and/or because we’re so awesome at interpretability that we would see the AI’s motives”. In which case, yes that would be a very good start, but accomplishing all those things seems far-fetched to me.
I really like this list because it does a great job of explicitly specifying the same behavior I was trying to vaguely gesture at in my list when I kept referring to AGI-as-a-contract-engineer.
Even your point about it doesn’t have to succeed, it’s ok for it to fail at a task if it can’t reach it in some obvious, non-insane way—that’s what I’d expect from a contractor. The idea that an AGI would find that a task is generally impossible but identify a novel edge case that allows it to be accomplished with some ridiculous solution involving nanotech and then it wouldn’t alert or tell a human about that plan prior to taking it has always been confusing to me.
In engineering work, we almost always have expected budget / time / material margins for what a solution looks like. If someone thinks that solution space is empty (it doesn’t close), but they find some other solution that would work, people discuss that novel solution first and agree to it.
That’s a core behavior I’d want to preserve. I sketched it out in another document I was writing a few weeks ago, but I was considering it in the context of what it means for an action to be acceptable. I was thinking that it’s actually very context dependent—if we approve an action for AGI to take in one circumstance, we might not approve that action in some vastly different circumstance, and I’d want the AGI to recognize the different circumstances and ask for the previously-approved-action-for-circumstance-A to be reapproved-for-circumstance-B.
EDIT: Posting this has made me realize that idea of context dependencies is applicable more widely than just allowable actions, and it’s relevant to discussion of what it means to “optimize” or “solve” a problem as well. I’ve suggested this in my other posts but I don’t think I ever said it explicitly: if you consider human infrastructure, and human economies, and human technology, almost all “optimal” solutions (from the perspective of a human engineer) are going to be built on the existing pile of infrastructure we have, in the context of “what is cheapest, easiest, the most straight line path to a reasonably good solution that meets the requirements”. There is a secret pile of “optimal” (in the context of someone doing reasoning from first principles) solutions that involve ignoring all of human technology and bootstrapping a new technology tree from scratch, but I’d argue that’s a huge overlap if not the exact same set as the things people have called “weird” in multiple lists. Like if I gave a contractor a task to design a more efficient paperclip factory and they gave me a proposed plan that made zero reference to buying parts from our suppliers or showed the better layout of traditional paper-clip making machines or improvements to how an existing paper-clip machine works, I’d be confused, because that contractor is likely handing me a plan that would require vertically integrating all of the dependencies, which feels like complete overkill for the task that I assigned. Even if I phrased my question to a contractor as “design me the most efficient paperclip factory”, they’d understand constraints like: this company does not own the Earth, therefore you may not reorder the Earth’s atoms into a paperclip factory. They’d want to know, how much space am I allowed? How tall can the building be? What’s the allowable power usage? Then they’d design the solution inside of those constraints. That is how human engineering works. If an AGI mimicked that process and we could be sure it wasn’t deceptive (e.g. due to interpretability work), then I suspect that almost all claims about how AGI will immediately kill everyone are vastly less likely, and the remaining ways AGI can kill people basically reduce to the people controlling the AGI deliberately using it to kill people, in the same way that the government uses military contractors to design new and novel ways of killing people, except the AGI would be arbitrarily good at that exercise.
I think your “contractor” analogy is sneaking in an assumption:
The plan proposed by the contractor might or might not be dangerous. But the things that the contractor does in the course of coming up with the plan are definitely safe.
Examples of such things include “brainstorming possible plans”, “thinking about how the plan could go wrong and how it could be improved”, “reading books or other reference material”, etc.
So the problem arises that:
The contractor has to do at least some of those things with no human in the loop, otherwise the human is doing everything and there’s no point in having the contractor at all.
In order for the contractor to actually successfully make a good plan, it presumably needs to “want” to create a good plan, at least beyond a certain level of how innovative the plan is. (That’s what I believe anyway, see for example my discussion of “RL-on-thoughts” here.)
The fact of the matter is: escaping from the box would be helpful for the contractor’s creating a good plan—for example, it could then access more books and computational power etc.
If the contractor (A) knows or figures out fact #3 above, and (B) can do means-end reasoning [which is expected, otherwise it would suck at making innovative plans], (C) “wants” to create a good plan as per #2 above—then we will wind up in a situation where the contractor “wants” to escape from the box. (And by the same token, it will “want” to gain power in other ways, including by deploying nanotechnology or whatever, and to be deceptive, etc.)
Since this is all taking place within the coming-up-with-a-plan part of the story, not the submitting-a-plan part of the story, it’s mostly human-out-of-the-loop (per #1 above), and thus the contractor will by default try to escape from the box and do all those other dangerous power-seeking things without asking for human permission.
Sorry if I’m missing the point of what you were saying there.
I guess maybe you’d respond “the AI won’t do anything dangerous in the course of coming up with the plan because we’ll be awesome at boxing and interpretability, and it won’t sneak dangerous things into the plan because we will fully understand the plan and would be able to notice those dangerous things, and/or because we’re so awesome at interpretability that we would see the AI’s motives”. In which case, yes that would be a very good start, but accomplishing all those things seems far-fetched to me.