I wrote this post imagining “strategy-stealing assumption” as something you would assume for the purpose of an argument, for example I might want to justify an AI alignment scheme by arguing “Under a strategy-stealing assumption, this AI would result in an OK outcome.” The post was motivated by trying to write up another argument where I wanted to use this assumption, spending a bit of time trying to think through what the assumption was, and deciding it was likely to be of independent interest. (Although that hasn’t yet appeared in print.)
I’d be happy to have a better name for the research goal of making it so that this kind of assumption is true. I agree this isn’t great. (And then I would probably be able to use that name in the description of this assumption as well.)
(See also the concept of “decoupled RL” from some DeepMind folks.)
“Social cost” in economics usually refers to the sum of private costs (wikipedia), such that a transfer from one person to another would have no social cost.
“Internalizing an externality” usually means making the private costs better reflect the social costs (so it’s not relevant what is done with the tax revenue).
By way of illustration, AlphaZero’s long-term preference is to win the chess game, its short-term preference is whatever its policy network spits out as the best move to make next, and its short-term preference-on-reflection is the move it wants to make next after doing a fuck-ton of MCTS.
Short-term preferences are the value function one or a few moves out. If the algorithm is “reasonable,” then its short-term preference-on-reflection are the true function P(I win the game|I make this move). You could also talk about intermediate degrees of reflection.
Now that I understand “corrigible” isn’t synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn’t seem enough to imply these things
I agree that you still need the AI to be trying to do the right thing (even though we don’t e.g. have any clear definition of “the right thing”), and that seems like the main way that you are going to fail.
As I understand it, the original motivation for corrigibility_MIRI was to make sure that someone can always physically press the shutdown button, and the AI would shut off. But if a corrigible_Paul AI thinks (correctly or incorrectly) that my preferences-on-reflection (or “true” preferences) is to let the AI keep running, it will act against my (actual physical) attempts to shut down the AI, and therefore it’s not corrigible_MIRI.
Note that “corrigible” is not synonymous with “satisfying my short-term preferences-on-reflection” (that’s why I said: “our short-term preferences, including (amongst others) our preference for the agent to be corrigible.”)
I’m just saying that when we talk about concepts like “remain in control” or “become better informed” or “shut down,” those all need to be taken as concepts-on-reflection. We’re not satisfying current-Paul’s judgment of “did I remain in control?” they are the on-reflection notion of “did I remain in control”?
Whether an act-based agent is corrigible depends on our preferences-on-reflection (this is why the corrigibility post says that act-based agents “can be corrigible”). It may be that our preferences-on-reflection are for an agent to not be corrigible. It seems to me that for robustness reasons we may want to enforce corrigibility in all cases even if it’s not what we’d prefer-on-reflection, for robustness reasons.
That said, even without any special measures, saying “corrigibility is relatively easy to learn” is still an important argument about the behavior of our agents, since it hopefully means that either (i) our agents will behave corrigibly, (ii) our agents will do something better than behaving corriglby, according to our preferences-on-reflection, (iii) our agents are making a predictable mistake in optimizing our preferences-on-reflection (which might be ruled out by them simply being smart enough and understanding the kinds of argument we are currently making).
By “corrigible” I think we mean “corrigible by X” with the X implicit. It could be “corrigible by some particular physical human.”
(In that post I did use narrow in the way we are currently using short-term, contrary to my claim the grandparent. Sorry for the confusion this caused.)
Like I mentioned above, I mostly think of narrow value learning is a substitute for imitation learning or approval-direction, realistically to be used as a distillation step rather than as your whole AI. In particular, an agent trained with narrow value learning absolutely is probably not aligned+competitive in a way that might allow you to apply this kind of strategy-stealing argument.
In concrete approval-directed agents I’m talking about a different design, it’s not related to narrow value learning.
I don’t use narrow and short-term interchangeably. I’ve only ever used it in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning.
One of us just misunderstood (1), I don’t think there is any difference.
I mean preferences about what happens over the near future, but the way I rank “what happens in the near future” will likely be based on its consequences (further in the future, and in other possible worlds, and etc.). So I took (1) to be basically equivalent to (2).
“Terminal preferences over the near future” is not a thing I often think about and I didn’t realize it was a candidate interpretation (normally when I write about short-term preferences I’m writing about things like control, knowledge, and resource acquisition).
By “short” I mean short in sense (1) and (2). “Short” doesn’t imply anything about senses (3), (4), (5), or (6) (and “short” and “long” don’t seem like good words to describe those axes, though I’ll keep using them in this comment for consistency).
By “preferences-on-reflection” I mean long in sense (3) and neither in sense (6). There is a hypothesis that “humans with AI help” is a reasonable way to capture preferences-on-reflection, but they aren’t defined to be the same. I don’t use understandable and evaluable in this way.
I think (4) and (5) are independent axes. (4) just sounds like “is your AI good at optimizing,” not a statement about what it’s optimizing. In the discussion with Eliezer I’m arguing against it being linked to any of these other axes. (5) is a distinction about two senses in which an AI can be “optimizing my short-term preferences-on-reflection”
When discussing perfect estimations of preferences-on-reflection, I don’t think the short vs. long distinction is that important. “Short” is mostly important when talking about ways in which an AI can fall short of perfectly estimating preferences-on-reflection.
Assuming my interpretation is correct, my confusion is that you say we shouldn’t expect a situation where “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy” (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.
I introduced the term “preferences-on-reflection” in the previous comment to make a particular distinction. It’s probably better to say something like “actual preferences” (though this is also likely to be misinterpreted). The important property is that I’d prefer to have an AI that satisfies my actual preferences than to have any other kind of AI. We could also say “better by my lights” or something else.
There’s a hypothesis that “what I’d say after some particular idealized process of reflection” is a reasonable way to capture “actual preferences,” but I think that’s up for debate—e.g. it could fail if me-on-reflection is selfish and has values opposed to current-me, and certainly it could fail for any particular process of reflection and so it might just happen to be the case that there is no process of reflection that satisfies it.
The claim I usually make is that “what I’d say after some particular idealized process of reflection” describes the best mechanism we can hope to find for capturing “actual preferences,” because whatever else we might do to capture “actual preferences” can just be absorbed into that process of reflection.
“Actual preferences” is a pretty important concept here, I don’t think we could get around the need for it, I’m not sure if there is disagreement about this concept or just about the term being used for it.
All three of these corrigible AIs deal with much narrower preferences than “acquire flexible influence that I can use to get what I want”. The narrow value learner post for example says:
Imitation learning, approval-direction, and narrow value learning are not intended to exceed the overseer’s capabilities. These are three candidates for the distillation step in iterated distillation and amplification.
The AI we actually deploy, which I’m discussing in the OP, is produced by imitating (or learning the values of, or maximizing the approval of) an even smarter AI—whose valuations of resources reflect everything that unaligned AIs know about which resources will be helpful.
Corrigibility is about short-term preferences-on-reflection. I see how this is confusing. Note that the article doesn’t make sense at all when interpreted in the other way. For example, the user can’t even tell whether they are in control of the situation, so what does it mean to talk about their preference to be in control of the situation if these aren’t supposed to be preferences-on-reflection? (Similarly for “preference to be well-informed” and so on.) The desiderata discussed in the original corrigibility post seem basically the same as the user not being able to tell what resources will help them achieve their long-term goals, but still wanting the AI to accumulate those resources.
I also think the act-based agents post is correct if “preferences” means preferences-on-reflection. It’s just that the three approaches listed at the top are limited to the capabilities of the overseer. I think that distinguishing between preferences-as-elicited and preferences-on-reflection is the most important thing to disambiguate here. I usually use “preference” to mean preference-on-idealized-reflection (or whatever “actual preference” should mean, acknowledging that we don’t have a real ground truth definition), which I think is the more typical usage. I’d be fine with suggestions for disambiguation.
If there’s somewhere else I’ve equivocated in the way you suggest, then I’m happy to correct it. It seems like a thing I might have done in a way that introduces an error. I’d be surprised if it hides an important problem (I think the big problems in my proposal are lurking other places, not here), and I think in the corrigibility post I think that I have these concepts straight.
One thing you might have in mind is the following kind of comment:
If on average we are unhappy with the level of corrigibility of a benign act-based agent, then by construction it is mistaken about our short-term preferences.
That is, you might be concerned: “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy.” I’m saying that you shouldn’t expect this to happen, if the AI is well-calibrated and has enough of an understanding of humans to understand e.g. this discussion we are currently having—if it decides not to be corrigible, we should expect it to be right on average.
This seems too glib, if “long-term preferences” are in some sense the “right” preferences, e.g., if under reflective equilibrium we would wish that we currently put a lot more weight on long-term preferences. Even if we only give unaligned AIs a one-time advantage (which I’m not sure about LW), that could still cause us to lose much of the potential value of the universe.
To be clear, I am worried about people not understanding or caring about the long-term future, and AI giving them new opportunities to mess it up.
I’m particularly concerned about things like people giving their resources to some unaligned AI that seemed like a good idea at the time, rather than simply opting out of competition so that unaligned AIs might represent a larger share of future-influencers. This is another failure of strategy-stealing that probably belongs in the post—even if we understand alignment, there may be plenty of people not trying to solve alignment and instead doing something else, and the values generated by that “something else” will get a natural boost.
To sum up, I think there’s a fundamental tension between corrigibility (in the sense of respecting the human user’s short-term preferences) and long-term success/competitiveness, which underlies many of the specific failure scenarios described in the OP, and worse, makes it unclear how “strategy-stealing” can work at all.
By short-term preference I don’t mean “Start a car company, I hear those are profitable,” I mean more like “Make me money, and then make sure that I remain in control of that company and its profits,” or even better “acquire flexible influence that I can use to get what I want.”
(This is probably not the response you were looking for. I’m still mostly intending to give up on communication here over the short term, because it seems too hard. If you are confused by particular things I’ve said feel free to quote them so that I can either clarify, register a disagreement, or write them off as sloppy or mistaken comments.)
The solution you propose also sounds really complicated when people have to optimize the timing of when they make capital gains with times when the risk-free rate is low.
When you sell assets you deduct the amount you paid for them. The proposal is to multiply that basis by the total amount of risk-free interest that would have accumulated over the intervening period, which can be calculated by looking up a single number in a table. I agree that using the risk-free rate when you sell would be insane.
(From the perspective of tax optimization, I think this is much simpler than the status quo. From the perspective of tax accounting, this mechanism takes the place of the distinction between long-term and short-term capital gains, and is radically simpler than that.)
What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)
As long as the user and AI appreciate the arguments we are making right now, then we shouldn’t expect it to do worse than stealing the unaligned AI’s strategy. There is all the usual ambiguity about “what the user wants,” but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user’s view) by doing what others are doing.
(I think I won’t have time to engage much on this in the near future, it seems plausible that I am skipping enough steps or using language in an unfamiliar enough way that this won’t make sense to readers in which case so it goes; it’s also possible that I’m missing something.)
This is basically the argument in my second bullet. You can make what you will of that argument, I think I still believe it—taxing savings just seems strictly worse than a similarly-progressive tax on income.
(I also don’t much like a wealth tax for the same reason.)
These are good options when available. You should start by setting all the Pigouvian taxes at optimal levels then go from there. Having not thought about it very much, taxes on unimproved value of land seem good but can’t fund something like a modern government without major distortions, so you’ll end up with lots of other stuff in your basket.
Also, my current understanding is that risk compensation is definitely not the large majority of investment returns. The last chapter of Cochrane’s Asset Pricing text has a great discussion of the topic. The main conclusion is that explaining returns via risk exposure requires unrealistically high levels of risk aversion—like, one or two orders of magnitude above the risk aversion levels implied by other activities.
What’s the competing explanation?
Haven’t looked at the historical numbers, but in recent times it seems like (i) with log utility and a naive model of “future=past,” optimal leverage is around 2x, (ii) most investors are much more risk averse than log utility (even for casino risk). So it seems like things basically add up here for most of the market. Was the situation an order of magnitude different in the past?