This is a helpful response. I think I rounded to agents because in my head I see corrigibility as a property of agents, and I don’t really know what “corrigible goal” even means. Your point about constraints is illuminating, as I tend not to focus on constraints when thinking about corrigibility. But let me see if I understand what you’re trying to say.
Suppose we’re optimizing for paperclips, and we form a plan to build paperclip factories to accomplish that (top level) goal. Building factories then can be seen as a subgoal, but of course we should be careful when building paperclip factories not to inadvertently ruin our ability to make paperclips. One way of protecting the terminal goal even when focusing on subgoals is to forward-check actions to see if they conflict with the destination. (This is similar to how a corrigible agent might check for confirmation from its principal before doing something with heavy, irreversible consequences.) Forward-checking, for obvious reasons, requires there to actually be a terminal goal to check, and we should not expect this to work in an agent “without a terminal goal.” But there’s another way to prevent optimizing a subgoal to inadvertently hurt global success: constrain the optimization. If we can limit the kinds of changes that we make when pursuing the subgoal to nice, local, reversible ones, then we can pursue building paperclip factories myopically, expecting that we won’t inadvertently produce side-effects that ruin the overall ability to make paperclips. This is especially useful when pursuing several subgoals in parallel, as forward-checking a combination of moves is combinatorially costly—better to have the agent’s parallel actions constrained to nice parts of the space.
If it turns out there’s a natural kind of constraint that shows up when making plans in a complex world, such that optimizing under that set of constraints is naturally unlikely to harm ability to accomplish goals in general, then perhaps we have some hope in naming that natural kind, and building agents which are always subject to these constraints, regardless of what they’re working on.
Is that right?
(This is indeed a very different understanding of what you were saying than I originally had. Apologies for the misunderstanding.)
This is especially useful when pursuing several subgoals in parallel, as forward-checking a combination of moves is combinatorially costly—better to have the agent’s parallel actions constrained to nice parts of the space.
If I were a singleton AGI, but not such a Jupiter brain that I could deal with the combinatorial explosion of directly jointly-optimizing every motion of every robot, I would presumably set up an internal “free market” with spot-prices for iron ore and robot-hours and everything else. Then I would iteratively cycle through all my decision-points and see if there are ways to “make money” locally, and then update virtual “prices” accordingly.
In fact, I think there’s probably a theorem that says that the optimal solution of a complex resource allocation problem is isomorphic to a system where things have prices. (Something to do with Lagrange multipliers? Shrug.)
(Fun fact: In the human world, propagating prices within firms—e.g. if the couch is taking up 4m² of the 4000m² floor space at the warehouse, then that couch is “charged” 0.1% of the warehouse upkeep costs, etc.—is very rarely done but leads directly to much better decisions and massive overall profit increases! See here.)
Externalities are not an issue in this virtual “economy” because I can “privatize” everything—e.g. I can invent fungible allowances to pollute the atmosphere in thus-and-such way etc. This is all just a calculation trick happening in my own head, so there aren’t coordination problems or redistribution concerns or information asymmetries or anything like that. Since I understand everything (even if I can’t juggle it all in my head simultaneously), I’ll notice if there’s some relevant new unpriced externality and promptly give it a price.
So then (this conception of) corrigibility would correspond to something like “abiding by this particular system of (virtual) property rights”. (Including all the weird “property rights” like purchasing allowances to emit noise or heat or run conscious minds or whatever, and including participating in the enterprise of discovering new unpriced externalities.) Do you agree?
...theorem that says that the optimal solution of a complex resource allocation problem is isomorphic to a system where things have prices.
This is true, if we interpret “optimal solution” as the set of Pareto efficient allocations and make some very general assumptions about monotonic utility and demand independence![1] The idea of the proof is to take the dual of the resource allocation problem to turn it into a resource valuation problem, and by strong duality theorem these have the same optima (the solution to the dual of a resource allocation problem is a vector of prices). This is an A-D model, so it is known that markets will clear at these prices. By Welfare Thm I, we know that markets that clear result in a Pareto efficient outcome, and by Welfare Thm II, we know that any Pareto efficient outcome can be supported by some set of prices produced by this process. Any process optimizing an allocation of scarce resources is isomorphic to a set of utility-maximizing agents exchanging those resources in a free market. Pretty cool!
I’m pretty sure they need to be at least locally true to guarantee what I said. Are there specific scenarios you’re imagining?
For what it’s worth, in most cases you’ll probably be able to reframe the resource allocation problem so that these assumptions hold. For example, you can account for negative utility goods by just taking the complement and giving it a positive price. If I want to account for agents having a “negative” price for pollution, you can instead model how much agents value clean air and give it a positive price to satisfy monotonic utility.
One of the scenarios I’m imagining is a scenario where network effects exist, such that you don’t want to have something if you are the only person having it, but you do want it if others have it.
Arguably, a lot of public goods/social media are like this, where there’s 0 demand at a limited size, but have lots of demand when the size starts increasing beyond a threshold.
In essence, I’m asking if we can remove the demand independence assumption and still get an isomorphism between optimal solutions to scarce resources and a system having prices.
Ah, well if there are interdependent demand functions then there is no guarantee of a (general) equilibrium, so the starting resource allocation problem may have 0 or many solutions. So I would say no, the isomorphism doesn’t hold in that case.
Hm, I think that’s more of a supply independence thing, what economists would call “non-excludable”. If the government funds a police force, it’s not as if they protect some citizens but not others. But that’s not a violation of the demand independence assumption because I care about living in a country with a strong police force regardless of whether you want that or not.
Goods with demand independence, from Ferraris to Facebook, generally do get provided by markets in real life, they just don’t have stable prices. It breaks some of the equilibrium models because it can cause divergences or circularity in your demand function, and then there’s no fixed point in positive price/demand space.
Luxury is a good example of this that happens in real life. Here’s an intuition-prompting setup:
Suppose I’m rich and I buy a Gucci bag
You’re poor, but you want to look rich so you also buy a Gucci handbag
Now I don’t think the bag is as exclusive, so I don’t want mine any more
Now that the rich guy isn’t wearing it anymore, you don’t want yours either
But now no one has it, so it seems exclusive again, so now I want it
Repeat
This doesn’t mean markets won’t provide Gucci bags (obviously, they do), but there isn’t a price equilibrium, it will fluctuate forever. In terms of the original point, the Gucci bag allocation problem isn’t isomorphic to a market equilibrium, because there is no such equilibrium.
Sure, I think social media is probably the best example of this. Suppose there are two platforms, A and B, and social media sites are worth more when more people are on it. Our “resource allocation” problem is to maximize utility, so we want to get everyone on the same site. There are two equilibria here; we can either set the price for A much higher than B and everyone will move to B, or vice versa.
If the demand functions weren’t interdependent and every agent just got some amount of utility from A and some amount of utility from B, there would be exactly one equilibrium price.
This seems mostly right. I think there still might be problems where identifying and charging for relevant externalities is computationally harder than routing around them. For instance, say you’re dealing with a civilization (such as humanity) that is responding to your actions in complex and chaotic ways, it may be intractable to find a way to efficiently price “reputation damage” and instead you might want to be overly cautious (i.e. “impose constraints”) and think through deviations from that cautious baseline on a case-by-case basis (i.e. “forward-check”). Again, I think your point is mostly right, and a useful frame—it makes me less likely to expect the kinds of hard constraints that Wentworth and Lorell propose to show up in practice.
TBC, I don’t particularly expect hard constraints to show up, that was more a way of illustrating the underlying concept. The same underlying concept in the the market-style picture would be: across many different top-level goals, there are convergent ways of carving up “property rights”. So, a system can be generally corrigible by “respecting the convergent property rights”, so to speak.
Now that I feel like we’re at least on the same page, I’ll give some thoughts.
This is a neat idea, and one that I hadn’t thought of before. Thanks!
I think I particularly like the way in which it might be a way of naturally naming constraints that might be useful to point at.
I am unsure how much these constraints actually get strongly reified in practice. When planning in simple contexts, I expect forward-checking to be more common. The centrality of forward-checking in my conception of the relationship between terminal and instrumental goals is a big part of where I think I originally got confused and misunderstood you.
One of the big reasons I don’t focus so much on constraints when thinking about corrigibility is because I think constraints are usually either brittle or crippling. I think corrigible agents will, for example, try to keep their actions reversible, but I don’t see a way to instantiate this as a constraint in a way that both allows normal action and forbids Goodharting. Instead, I tend to think about heuristics that fall-back on getting help from the principal. (“I have a rough sense of how reversible things should normally be, and if it looks like I might be going outside the normal bounds I’ll stop and check.”)
Thus, my guess is that if one naively tries to implement an agent that is genuinely constrained according to the natural set of “instrumental constraints” or whatever we want to call them, it’ll end up effectively paralyzing them.
The thing that allows a corrigible agent not to be paralyzed, in my mind, is the presence of a principal. But if I’m understanding you right, “instrumental constraint” satisfying agents don’t (necessarily) have a principal. This seems like a major difference between this idea and corrigibility.
I have some additional thoughts on how exactly the Scylla and Charybdis of being paralyzed by constraints and cleverly bypassing constraints kills you, for example with regard to resource accumulation/protection, but I think I want to end by noting a sense that naively implementing these in some kind of straightforward constrained-optimizer isn’t where the value of this idea lies. Instead, I am most interested in whether this frame can be used as a generator for corrigibility heuristics (and/or a corrigibility dataset). 🤔
This is a helpful response. I think I rounded to agents because in my head I see corrigibility as a property of agents, and I don’t really know what “corrigible goal” even means. Your point about constraints is illuminating, as I tend not to focus on constraints when thinking about corrigibility. But let me see if I understand what you’re trying to say.
Suppose we’re optimizing for paperclips, and we form a plan to build paperclip factories to accomplish that (top level) goal. Building factories then can be seen as a subgoal, but of course we should be careful when building paperclip factories not to inadvertently ruin our ability to make paperclips. One way of protecting the terminal goal even when focusing on subgoals is to forward-check actions to see if they conflict with the destination. (This is similar to how a corrigible agent might check for confirmation from its principal before doing something with heavy, irreversible consequences.) Forward-checking, for obvious reasons, requires there to actually be a terminal goal to check, and we should not expect this to work in an agent “without a terminal goal.” But there’s another way to prevent optimizing a subgoal to inadvertently hurt global success: constrain the optimization. If we can limit the kinds of changes that we make when pursuing the subgoal to nice, local, reversible ones, then we can pursue building paperclip factories myopically, expecting that we won’t inadvertently produce side-effects that ruin the overall ability to make paperclips. This is especially useful when pursuing several subgoals in parallel, as forward-checking a combination of moves is combinatorially costly—better to have the agent’s parallel actions constrained to nice parts of the space.
If it turns out there’s a natural kind of constraint that shows up when making plans in a complex world, such that optimizing under that set of constraints is naturally unlikely to harm ability to accomplish goals in general, then perhaps we have some hope in naming that natural kind, and building agents which are always subject to these constraints, regardless of what they’re working on.
Is that right?
(This is indeed a very different understanding of what you were saying than I originally had. Apologies for the misunderstanding.)
If I were a singleton AGI, but not such a Jupiter brain that I could deal with the combinatorial explosion of directly jointly-optimizing every motion of every robot, I would presumably set up an internal “free market” with spot-prices for iron ore and robot-hours and everything else. Then I would iteratively cycle through all my decision-points and see if there are ways to “make money” locally, and then update virtual “prices” accordingly.
In fact, I think there’s probably a theorem that says that the optimal solution of a complex resource allocation problem is isomorphic to a system where things have prices. (Something to do with Lagrange multipliers? Shrug.)
(Fun fact: In the human world, propagating prices within firms—e.g. if the couch is taking up 4m² of the 4000m² floor space at the warehouse, then that couch is “charged” 0.1% of the warehouse upkeep costs, etc.—is very rarely done but leads directly to much better decisions and massive overall profit increases! See here.)
Externalities are not an issue in this virtual “economy” because I can “privatize” everything—e.g. I can invent fungible allowances to pollute the atmosphere in thus-and-such way etc. This is all just a calculation trick happening in my own head, so there aren’t coordination problems or redistribution concerns or information asymmetries or anything like that. Since I understand everything (even if I can’t juggle it all in my head simultaneously), I’ll notice if there’s some relevant new unpriced externality and promptly give it a price.
So then (this conception of) corrigibility would correspond to something like “abiding by this particular system of (virtual) property rights”. (Including all the weird “property rights” like purchasing allowances to emit noise or heat or run conscious minds or whatever, and including participating in the enterprise of discovering new unpriced externalities.) Do you agree?
This is true, if we interpret “optimal solution” as the set of Pareto efficient allocations and make some very general assumptions about monotonic utility and demand independence![1] The idea of the proof is to take the dual of the resource allocation problem to turn it into a resource valuation problem, and by strong duality theorem these have the same optima (the solution to the dual of a resource allocation problem is a vector of prices). This is an A-D model, so it is known that markets will clear at these prices. By Welfare Thm I, we know that markets that clear result in a Pareto efficient outcome, and by Welfare Thm II, we know that any Pareto efficient outcome can be supported by some set of prices produced by this process. Any process optimizing an allocation of scarce resources is isomorphic to a set of utility-maximizing agents exchanging those resources in a free market. Pretty cool!
Agents weakly prefer more of a good to less of it, and agents’ demand for a good doesn’t depend on other people having it.
Can we remove one of the assumptions, or are both assumptions necessary to get the result stated?
I’m pretty sure they need to be at least locally true to guarantee what I said. Are there specific scenarios you’re imagining?
For what it’s worth, in most cases you’ll probably be able to reframe the resource allocation problem so that these assumptions hold. For example, you can account for negative utility goods by just taking the complement and giving it a positive price. If I want to account for agents having a “negative” price for pollution, you can instead model how much agents value clean air and give it a positive price to satisfy monotonic utility.
One of the scenarios I’m imagining is a scenario where network effects exist, such that you don’t want to have something if you are the only person having it, but you do want it if others have it.
Arguably, a lot of public goods/social media are like this, where there’s 0 demand at a limited size, but have lots of demand when the size starts increasing beyond a threshold.
In essence, I’m asking if we can remove the demand independence assumption and still get an isomorphism between optimal solutions to scarce resources and a system having prices.
Ah, well if there are interdependent demand functions then there is no guarantee of a (general) equilibrium, so the starting resource allocation problem may have 0 or many solutions. So I would say no, the isomorphism doesn’t hold in that case.
And this is why some goods, like public safety can’t be provided by markets, because the assumption of demand independence is violated.
Hm, I think that’s more of a supply independence thing, what economists would call “non-excludable”. If the government funds a police force, it’s not as if they protect some citizens but not others. But that’s not a violation of the demand independence assumption because I care about living in a country with a strong police force regardless of whether you want that or not.
Goods with demand independence, from Ferraris to Facebook, generally do get provided by markets in real life, they just don’t have stable prices. It breaks some of the equilibrium models because it can cause divergences or circularity in your demand function, and then there’s no fixed point in positive price/demand space.
Luxury is a good example of this that happens in real life. Here’s an intuition-prompting setup:
Suppose I’m rich and I buy a Gucci bag
You’re poor, but you want to look rich so you also buy a Gucci handbag
Now I don’t think the bag is as exclusive, so I don’t want mine any more
Now that the rich guy isn’t wearing it anymore, you don’t want yours either
But now no one has it, so it seems exclusive again, so now I want it
Repeat
This doesn’t mean markets won’t provide Gucci bags (obviously, they do), but there isn’t a price equilibrium, it will fluctuate forever. In terms of the original point, the Gucci bag allocation problem isn’t isomorphic to a market equilibrium, because there is no such equilibrium.
Any examples of where the starting resource allocation has more than 1 solution/equilibrium, assuming demand independence is violated?
Sure, I think social media is probably the best example of this. Suppose there are two platforms, A and B, and social media sites are worth more when more people are on it. Our “resource allocation” problem is to maximize utility, so we want to get everyone on the same site. There are two equilibria here; we can either set the price for A much higher than B and everyone will move to B, or vice versa.
If the demand functions weren’t interdependent and every agent just got some amount of utility from A and some amount of utility from B, there would be exactly one equilibrium price.
This seems mostly right. I think there still might be problems where identifying and charging for relevant externalities is computationally harder than routing around them. For instance, say you’re dealing with a civilization (such as humanity) that is responding to your actions in complex and chaotic ways, it may be intractable to find a way to efficiently price “reputation damage” and instead you might want to be overly cautious (i.e. “impose constraints”) and think through deviations from that cautious baseline on a case-by-case basis (i.e. “forward-check”). Again, I think your point is mostly right, and a useful frame—it makes me less likely to expect the kinds of hard constraints that Wentworth and Lorell propose to show up in practice.
TBC, I don’t particularly expect hard constraints to show up, that was more a way of illustrating the underlying concept. The same underlying concept in the the market-style picture would be: across many different top-level goals, there are convergent ways of carving up “property rights”. So, a system can be generally corrigible by “respecting the convergent property rights”, so to speak.
Yup, exactly, and good job explaining it too.
:)
Now that I feel like we’re at least on the same page, I’ll give some thoughts.
This is a neat idea, and one that I hadn’t thought of before. Thanks!
I think I particularly like the way in which it might be a way of naturally naming constraints that might be useful to point at.
I am unsure how much these constraints actually get strongly reified in practice. When planning in simple contexts, I expect forward-checking to be more common. The centrality of forward-checking in my conception of the relationship between terminal and instrumental goals is a big part of where I think I originally got confused and misunderstood you.
One of the big reasons I don’t focus so much on constraints when thinking about corrigibility is because I think constraints are usually either brittle or crippling. I think corrigible agents will, for example, try to keep their actions reversible, but I don’t see a way to instantiate this as a constraint in a way that both allows normal action and forbids Goodharting. Instead, I tend to think about heuristics that fall-back on getting help from the principal. (“I have a rough sense of how reversible things should normally be, and if it looks like I might be going outside the normal bounds I’ll stop and check.”)
Thus, my guess is that if one naively tries to implement an agent that is genuinely constrained according to the natural set of “instrumental constraints” or whatever we want to call them, it’ll end up effectively paralyzing them.
The thing that allows a corrigible agent not to be paralyzed, in my mind, is the presence of a principal. But if I’m understanding you right, “instrumental constraint” satisfying agents don’t (necessarily) have a principal. This seems like a major difference between this idea and corrigibility.
I have some additional thoughts on how exactly the Scylla and Charybdis of being paralyzed by constraints and cleverly bypassing constraints kills you, for example with regard to resource accumulation/protection, but I think I want to end by noting a sense that naively implementing these in some kind of straightforward constrained-optimizer isn’t where the value of this idea lies. Instead, I am most interested in whether this frame can be used as a generator for corrigibility heuristics (and/or a corrigibility dataset). 🤔