Yeah, I said “goal inference” instead of “value learning” but I mean the same thing. The “ambitious” part is that we are trying to do much better than humans, which I was taking for granted in this post (it’s six months older than ambitious vs. narrow value learning).
A lot of magic is happening in the prior over utility functions and optimization algorithms, removing that magic is the open problem.
(I’m pessimistic about making progress on that problem, and instead try to define value by using the human policy to guide a process of deliberation rather than trying to infer some underlying latent structure.)
I agree you should model the human as some kind of cognitively bounded agent. The question is how.
Often the bar is very visible though, which makes it trickier. I think outside might be a good option.
My first reaction was that this discussion focuses too much on sitting on your own, or inviting someone else from your group to move elsewhere. After all, large groups can also shrink when people leave a large group and go join a smaller group.
But on second thought, forming new groups is necessary, otherwise the number of groups will just keep decreasing and so they eventually will end up large. So in fact it seems very important that you can either split off subgroups or else go sit on your own.
I think in some spaces it is much easier for large groups to fission into small groups and then drift to different spaces. This might be an important consideration for party spaces / layouts.
Other random frictions:
Leaving a group signals that you don’t enjoy the conversation or people in it, so might happen too rarely.
Small groups are harder to leave and so riskier to join or start.
If the quality of conversations declines with size, here is one reason that groups might get too large (from the perspective of a benevolent social planner):
Conversations vary in their desirability—whether because of random drift in topics, the presence of fun people, access to the comfortable seating, or whatever. So by default some conversations will be better than others. This is partially visible from outside, whether by observing how much people are laughing, looking for the cool kids, or just gravitating to the couch.
If any conversation looks better than other conversations, then selfish conversational participants will preferentially join that conversation. We’d expect this to continue until the marginal party-goer is indifferent between each of the conversations. This will cause the good conversations to become larger and larger until they are no better than any other conversation.
To see how this leads to a problem, suppose that you have a house with a large number of spaces for conversation, one of which is nicer than the others—and simplify everything maximally, assuming that all conversation participants are interchangeable and that adding more people really just makes discussions worse (ignore the fact that you would obviously never throw a party in this world).
Then you end up with a bunch of 2 person groups, and one N person group using the nice space, where N is just large enough that the N person conversation is no better or worse than one of the 2 person groups. The net effect is exactly the same total welfare as if you had no nice space at all.
Of course the same thing happens if you add one person who is the life of the party, improving the quality of whatever conversation they are in. If you add just one such person, then the group containing them will grow until it is large enough to totally offset their value add.
The effect is less stark once you add enough cool kids and nice spaces—eventually a rising tide lifts all boats—but in general this kind of dynamic could lead to a leveling down to whatever the quality of the “reservation conversation” is, obliterating any gains from nice spaces, particularly fun people, or conversations that happened to go in a really interesting and fulfilling direction. (I’m not really sure about this last one, since if conversations sometimes go in an interesting direction then that also increases the expected value of starting a new conversation.)
To the extent that this is an important dynamic, there are possible fixes. A very brutish solution is just having a strong norm against joining conversations once they reach a certain size. If you could exogenously determine social judgments, you could deem it impolite to join 4-5 person conversations, taboo to join 6 person conversations, and good etiquette to leave 4-6 person conversations unless you are feeling particularly engaged.
Here is one reason that you’d expect people to sit on their own too little (from the perspective of a benevolent social planner):
If you are sitting on your own, the expected amount of time before someone joins you depends directly on how much the other party-goers want your company. So at any given moment, being seen sitting on your own is evidence of unpopularity. I think most people can feel that in their bones. So going and sitting on your own is guaranteed to generate some evidence that you are unpopular, the only question is how much of it. For people who aren’t constantly analyzing the signaling consequences of everything they do, this may just translate into feeling surprisingly uncomfortable about sitting on their own, even though they normally wouldn’t mind a few minutes of solitude.
It would be some work to actually find the equilibrium, but I think that on average you are going to take a hit (in terms of others’ estimates for your popularity) from striking out on your own. I’d be interested if anyone actually solves.
If true this is a little bit weird—you take an action that we’d expect popular people to take more often, and then people update negatively about your popularity? The trick is in the step where people have better ability to observe “who is sitting on their own right now” than to track the exact sequence of events that occur. For example suppose people scan the room once every few minutes. Then they can notice someone sitting on their own, but if they see a group of 2 people they don’t know who joined whom and so can’t tell whose popularity they should update positively about.
To the extent that’s an important dynamic there are lots of possible fixes.
One simple idea is to designate a space for forming new conversations that isn’t visible from the rest of the party. If I want to start a new conversation with someone random, I go to the designated room. If it’s just me, I do some math or browse the internet or whatever (personally I don’t mind solitude, but do mind awkwardness). When other people join, then we go somewhere else and it’s business as usual.
Of course you could do this better with a machine. I can pull out my phone and press the “new conversation” button, get told if someone else has also pressed “new conversation,” and then start a new group with them. This would be an easy app to make (everyone enters their name and sees a checkbox, when two people within 100 feet check the box, their boxes get unchecked and the second person sees the first person’s name). I would try it.
(ETA: a bolder and sillier solution is to have it be obvious who started the conversation even to observers who quickly scan the room, e.g. because the first person takes a designated sitting-on-your-own seat. Then in theory the positive update from being the first participant in a happening group should offset the negative update from sitting on your own.)
If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact. I have yet to see one, although they could exist. The evidence so far points towards “objective impact is sufficient”.
We’d like to build AI systems that help us resolve the tricky situation that we’re in. That help design and enforce agreements to avoid technological risks, build better-aligned AI, negotiate with other actors, predict and manage the impacts of AI, improve our institutions and policy, etc.
I think the default “terrible” scenario is one where increasingly powerful AI makes the world change faster and faster, and makes our situation more and more complex, with humans having less and less of a handle on what is going on or how to steer it in a positive direction. Where we must rely on AI to get anywhere at all, and thereby give up the ability to choose where we are going.
That may ultimately culminate with a catastrophic bang, but if it does it’s not going to be because we wanted the AI to have a small impact and it had a large impact. It’s probably going to be because we have a very limited idea what is going on, but we don’t feel like we have the breathing room to step back and chill out (at least not for long) because we don’t believe that everyone else is going to give us time.
If I’m trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does “low impact” mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?
(And realistically I doubt we’ll fail at alignment with a bang—it’s more likely that the world will just drift off the rails over the course of a few months or years. The intuition that we wouldn’t let things go off the rails gradually seems like the same kind of wishful thinking that predicts war or slow-rolling environmental disasters should never happen.)
It seems like “low objective impact” is what we need once we are in the unstable situation where we have the technology to build an AI that would quickly and radically transform the world, but we have all decided not to and so are primarily concerned about radically transforming the world by accident. I think that’s a coherent situation to think about and plan for, but we shouldn’t mistake it for the mainline. (I personally think it is quite unlikely, and it would definitely be unprecedented, though you could still think it’s the best hope if you were very pessimistic about what I consider “mainline” alignment.)
The result in the paper is “No theory satisfies the three assumptions (Q, C, S).”
The table in the paper says that MWI violates assumption S and is “?” on the other two assumptions.
Unsurprisingly, when I look at the assumptions they all seem wrong or incoherent, depending on how you make the fuzzy statements precise. I’d guess most LW-ers are in a similar place (as are most quantum computing people probably), so this wouldn’t really change any minds around here.
(Also their discussion of many-worlds sounds a bit silly. Nowhere in their table of interpretations is the natural one, “the wavefunction is all there is.“)
Adversarial training makes it harder for the malignant 1% to do anything dangerous, since it makes it harder to tell whether it’s in the test distribution. But it does little to prevent a malignant 1% from very conservatively trying (and usually failing) to ascertain whether it’s in the test distribution.
The question is roughly: can you be good enough at finding problematic inputs that, if a problematic input exists, it is much more likely to be found during the adversarial training step than at deployment time? If so, then a model with high reward probably won’t do anything bad at deployment time (exactly how good you need to be depends on exactly how overwhelmingly strong the prior for malign behavior is).
This likely involves introducing some asymmetry between the adversary’s task and the test time task. Examples of possible asymmetries including allowing the adversary to produce different kinds of inputs (e.g. setting activations arbitrarily), letting the adversary poke around inside the model to find bad cases (e.g. doing gradient descent, combining with interpretability, or backwards chaining through the model), or giving the adversary a much weaker job (e.g. only asking them to produce an input on which the model thinks anything remotely suspicious as judged by interpretability, or finding any conditions under which the model fails to instantly shut down a malicious scheme that it previously initiated).
(I don’t know what “prevent” means here.)
Interpretability makes it harder for the malignant 1% to be hidden, but doesn’t prevent malignant cognition it can’t detect.
I agree that if your interpretability doesn’t allow you to detect the agent thinking “Am I on the training distribution?” then it won’t work. So then the question is: can interpretability be strong enough that if the model is thinking something catastrophically dangerous, you have a very good chance of noticing it? (Again, how good you need to be depends on how overwhelming is the prior for bad behavior.)
It does prima facie seem that an agent spending 100% of its cognition being competent and corrigible achieves higher reward than an agent that only spends 99% of its cognition being competent and corrigible, and 1% of its cognition trying (and almost always failing) to see if it’s in the test distribution.
I don’t think this fact makes bad behavior unlikely on their own. But interpretability or adversarial training could exploit the fact that your model is trained by gradient descent. Regularization can strengthen the basic observation that a malign failure needs to use parts of the model that are exercised during training, making it easier to identify possible malign failures by examining behavior on the training distribution. We can potentially exploit the fact that gradient descent updates the model continuously.
I’m still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it’s in the test distribution, and executing a treacherous turn if so.
(“Well-hidden” here means “well-hidden by gradient descent,” not “well-hidden by the agent,” right?)
But if e.g. the overseer is only inspecting the distilled agent’s justifications for its behavior, and something like its verbal loop, I don’t see how things can work out.
You can imagine the overseer as inspecting the agent’s actions, and probing the agent’s behavior in hypothetical situations. The overseer only “looks inside” the agent’s head as a way to help evaluate behavior or identify possibly problematic situations (and there is not necessarily any explicit looking, it could be a weight/activation sharing scheme where the overseer is able to draw on the agent’s internal cognitive state in the way that the agent is).
1. Past a certain capabilities level, distilled agents can meaningfully be modeled as expected utility maximizers for short-term rewards from its overseer. (This is because expected utility maximization is instrumentally convergent, the same way that believing 51 to be composite is instrumentally convergent, as per Eliezer’s comment.)
Agents can be viewed as expected utility maximizers as soon as their preferences over lotteries become consistent, which probably happens very early in training.
2. If the agents reason using TDT, the effective utility functions of these act-based agents will optimize for long-term instrumental preferences of their users, because in expectation, adopting this policy raises whatever short-term instrumental reward they’ll get.
This seems like a non-sequitir / magical thinking. We can be more precise:
The policy which uses TDT to try to increase its own future reward will be straightforwardly selected against by the outer optimization.
The inner process may nevertheless use TDT if TDT doesn’t diverge from CDT on the training distribution, or it might learn to use TDT but “look nice” so that it doesn’t get selected against. From the perspective of my scheme this isn’t importantly different from e.g. the case where the agent just has a completely different goal and is waiting for a treacherous turn.
3. Expected utility maximizers for discounted future rewards of reward signals are hopelessly incorrigible.
I certainly agree that if we have an agent that is unaligned (but instrumentally deciding to appear aligned), that will be incorrigible. I don’t think that the other stuff about TDT adds anything beyond the usual situation though.
4. Agents produced through the distillation process, misaligned in this way, could enter a regime of devoting 99% of their cognition to doing exactly what a corrigible agent would do, and 1% of their cognition to finding strategies for determining how to / whether to execute a treacherous turn.
Yes, that’s always the threat model for adversarial training / verification / interpretability.
5. Because of how many ways there are to hide this 1% of cognition, a complete white-box understanding of the agent’s cognition is necessary to spot this sort of malign reasoning.
Of course this is the step where I disagree. We can talk briefly about the possible options (what we care about is some kind of disjunction over these options, along with all other options that we haven’t considered):
Adversarial training. The hope is that you can construct situations that will elicit a treacherous turn. You could do this by a combination of (a) probing a very broad distribution over possible future situations, including every situation you can think of that could appear in the near future, (b) having weight/activation sharing between the agent and the adversary—the hope is that if the agent is able to use that 1% of its brain to think of whether to use a treacherous turn, the adversary could in the same way use that 1% in order to identify situations that might elicit a treacherous turn, (c) allowing the adversary to submit “abstract” attacks that cover a large space of possible situations.
Verification. Speaking very vaguely, the hope is that instead of merely running the model on individual inputs, you can run it on some kind of relaxed input (e.g. intervals, or putative moments of a distribution over inputs, etc.). Then we verify that not only does the model behave acceptably on actual inputs, it behaves acceptably on these relaxed inputs. I don’t have a simple non-technical description of this idea, but I feel relatively confident that the intuition behind your claim is not precise enough to actually rule out this kind of solution.
Interpretability. The hope is that you can look inside the model and see the “should I do a treacherous turn?” computation. If the agent is doing any optimization to obscure this reasoning from us, then we want to look for that optimization instead. So we are only up against normal gradient descent here. Gradient descent will be pushing against whatever kinds of treacherous turns that can be detected by our machinery. But it will only push for things that improve behavior on the training distribution, so a treacherous turn can only appear where it is coupled with the machinery that is producing good predictions, giving us more traction for interpretability. In general, this looks like a tricky problem, but I don’t think you can get nearly as far as saying “a complete white-box understanding of the agent’s cognition is necessary to spot this sort of malign reasoning.” I think you should generally be skeptical when you have a vague intuitive appeal up against a universal quantifier.
I’m currently intuiting that there’s a broad basin of “seeming corrigible until you can perform a treacherous turn”, but not a broad basin of true corrigibility
I don’t think this argument says very much about whether there is a broad basin of attraction around corrigibility; if your agent waits for a treacherous turn and then behaves badly, that’s not in the basin. The point of being a broad basin is that “executes a treacherous turn” now needs to be a discrete thing to kick you out of the basin, it can’t be an infinitesimal degradation of performance. But we still have the question: even if “bad behavior” is a discrete event, can we actually use techniques for optimizing worst-case performance to avoid it?
I mentioned two seemingly valid approaches, that would lead to different beliefs for the human, and asked how the AI could choose between them. You then went up a level of meta, to preferences over the deliberative process itself.
The AI was choosing what text to show Petrov. I suggested the AI choose the text based on the features that would lead Petrov (or an appropriate idealization) to say that one text or the other is better, e.g. informativeness, concision, etc. I wouldn’t describe that as “going up a level of meta.”
But I don’t think the meta preferences are more likely to be consistent—if anything, probably less so. And the meta-meta-preferences are likely to be completely underdefined, except in a few philosophers.
It seems to me like Petrov does have preferences about descriptions that the AI could provide, e.g. views about which are accurate, useful, and non-manipulative. And he probably has views about what ways of thinking about things are going to improve accuracy. If you want to call those “meta preferences” then you can do that, but then why think that those are undefined?
Also it’s not like we are passing to the meta level to avoid inconsistencies in the object level. It’s that Petrov’s object level preference looks like “option #1 is better than option #2, but ‘whichever option I’d pick after thinking for a while’ is better than either of them”
Doing corrigibility without keeping an eye on the outcome seems, to me, to be similar to many failed AI safety approach—focusing on the local “this sounds good”, rather than on the global “but it may cause extinction of sentient life”.
This doesn’t seem right to me.
Though we are assuming that neither the AI nor the human is supposed to look at the conclusion, this may just result in either a random walk, or an optimisation pressure by hidden processes inside the definition.
Thinking about a problem without knowing the answer in advance is quite common. The fact that you don’t know the answer doesn’t mean that it’s a random walk. And the optimization pressure isn’t hidden—when I try to answer a question by thinking harder about it, there is a huge amount of optimization pressure to get to the right answer, it’s just that it doesn’t take the form of knowing which answer is correct and then backwards chaining from that to figure out what deliberative process would lead to the correct answer.
It doesn’t seem like you need sophisticated technology to “decide to make a decision without taking information X into account” in this case—the AI can just make the decision on the basis of particular features that aren’t X.
I’m saying Petrov has preferences over what text to read based on characteristics of the text (and implicitly over the deliberative process implied by that text)---does it make true claims, does it engage with his sympathies in a way that he endorses, does it get quickly to the point, etc..
Those preferences over text (and hence over deliberative process) will ultimately lead to Petrov’s preferences changing in one way or another, but it’s his preferences about text that imply his meta-preferences rather than the other way around.
Similarly, when I choose how I want to deliberate or reflect, I’m looking at the process of deliberation itself and deciding what process I think is best. That process then leads to some outcome, which I endorse because it was the outcome of the deliberative process I endorsed. I’m not picking a conclusion and then preferring the deliberative process that leads to that conclusion. If I’m in a state such that I’d prefer pick a conclusion and then choose the deliberative process that leads to that conclusion, then I’m not deliberating (in the epistemic sense) at all, my preferences are already settled.
I don’t see any of this as conflicting with corrigibility. If the AI is involved in my deliberative process, whether by choosing how to explain something or what evidence to show me or whatever, then the corrigible thing to do is to (try) to help me deliberate in the way that I would prefer to deliberate (as opposed to influencing my deliberation in a way that is intended to achieve any other end). Of course my values will change, my values are constantly changing, any AI that is embedded in the world in a realistic way is going to have an influence on the way our values change. The point of aligned AI in general is to help us get what we want, including what we want about the process by which our values change.
We seem to have a persistent disagreement about this point. I understand the position Wei Dai outlined in this thread and consider that to be an understandable quantitative disagreement—about the relative importance of value drift caused by errors in our understanding of deliberation (compared to what I consider the alignment problem proper). My view could change on that point, especially if I came to be more optimistic about narrow-sense alignment. If your position is different from that one, then I don’t yet understand it.
Predictably, if given A, Petrov will warn his superiors (and maybe set off a nuclear war), and, if given B, he will not.
Petrov has all kinds of preferences about what kind of introductory text is “best,” and the goal of the AI is to give the one that Petrov considers best. Petrov’s preferences about introductory text will not be based on backwards chaining from the effects on Petrov (otherwise he wouldn’t need to read the textbook), they will be based on features of the text itself. Likewise, the AI’s decision shouldn’t be based on backwards chaining from the effects on Petrov.
The planner’s behavior is an example of implicit extortion (it even follows the outline from that post: initially rewarding the desired behavior at low cost, briefly paying a large cost to both reward and penalize, and then transitioning to very cheap extortion). An RL agent that can be manipulated to cooperate by this mechanism can just as easily be made to hand the planner daily “protection” money. This suggests that agents that are successful in the real world will probably be at least somewhat resistant to this kind of extortion (or else the world will be some kind of weird equilibrium of these extortion games), either constitutionally or because of legal protections against this kind of extortion.
It seems like a satisfying model of / solution to this problem should somehow leverage the fact that cooperation is positive sum, such that an agent ought to be OK with the outcome.
If you were to actually apply the ideas from this paper, I think the interesting work is done by society agreeing that the planner has the right to use coercive violence to achieve their desired end. At that point it seems easiest to just describe this as a law against defection. The role of the planner seems exactly analogous to a legislator, reaching agreement about how the planner ought to behave is exactly as hard as reaching agreement about legislation, and there is no way to achieve the same outcome without such an agreement. Interpreted as a practical guide to legislation, I don’t think this kind of heuristic adds much beyond conventional political economy.
(Of course, in a world with powerful AI systems, such laws will be enforced primarily by other AI systems. That seems like a tricky problem, but I don’t see it as being meaningfully distinct from the normal alignment problem.)