PhD student in AI safety at CHAI (UC Berkeley)
Erik Jenner
Solution to the free will homework problem
I’m wondering if regularization techniques could be used to make the pure deception regime unstable.
As a simple example, consider a neural network that is trained with gradient descent and weight decay. If the parameters can be (approximately) split into a set that determines the mesa-objective and a set for everything else, then the gradient of the loss with respect to the “objective parameters” would be zero in the pure deception regime, so weight decay would ensure that the mesa-objective couldn’t be maintained.
The learned algorithm might be able to prevent this by “hacking” its gradient as mentioned in the post, making the parameters that determine the mesa-objective also have an effect on its output. But intuitively, this should at least make it more difficult to reach a stable pure deception regime.
Of course regularization is a double-edged sword because as has been pointed out, the shortest algorithms that perform well on the base objective are probably not robustly aligned.
[Question] What is a decision theory as a mathematical object?
Gradient hacking is usually discussed in the context of deceptive alignment. This is probably where it has the largest relevance to AI safety but if we want to better understand gradient hacking, it could be useful to take a broader perspective and study it on its own (even if in the end, we only care about gradient hacking because of its inner alignment implications). In the most general setting, gradient hacking could be seen as a way for the agent to “edit its source code”, though probably only in a very limited way. I think it’s an interesting question which kinds of edits are possible with gradient hacking, for example whether an agent could improve its capabilities this way.
The (not so) paradoxical asymmetry between position and momentum
That sounds right to me, and I agree that this is sometimes explained badly.
Are you saying that this explains the perceived asymmetry between position and momentum? I don’t see how that’s the case, you could say exactly the same thing in the dual perspective (to get a precise momentum, you need to “sum up” lots of different position eigenstates).
If you were making a different point that went over my head, could you elaborate?
Interesting thoughts re anthropic explanations, thanks!
I agree that asymmetry doesn’t tell us which one is more fundamental, and I wasn’t aiming to argue for either one being more fundamental (though position does feel more fundamental to me, and that may have shown through). What I was trying to say was only that they are asymmetric on a cognitive level, in the sense that they don’t feel interchangeable, and that there must therefore be some physical asymmetry.
Still, I should have been more specific than saying “asymmetric”, because not any kind of asymmetry in the Hamiltonian can explain the cognitive asymmetry. For the “forces decay with distance in position space” asymmetry, I think it’s reasonably clear why this leads to cognitive asymmetry, but for the “position occurs as an infinite power series” asymmetry, it’s not clear to me whether this has noticeable macro effects.
Performance deteriorating implies that the prior p is not yet a fixed point of p*=D(A(p*)).
At least in the case of AlphaZero, isn’t the performance deterioration from A(p*) to p*? I.e. A(p*) is full AlphaZero, while p* is the “Raw Network” in the figure. We could have converged to the fixed point of the training process (i.e. p*=D(A(p*))) and still have performance deterioration if we use the unamplified model compared to the amplified one. I don’t see a fundamental reason why p* = A(p*) should hold after convergence (and I would have been surprised if it held for e.g. chess or Go and reasonably sized models for p*).
I enjoyed reading this! And I hadn’t seen the interpretation of a logistic preference model as approximating Gaussian errors before.
Since you seem interested in exploring this more, some comments that might be helpful (or not):
What is the largest number of elements we can sort with a given architecture? How does training time change as a function of the number of elements?
How does the network architecture affect the resulting utility function? How do the maximum and minimum of the unnormalized utility function change?
I’m confused why you’re using a neural network; given the small size of the input space, wouldn’t it be easier to just learn a tabular utility function (i.e. one value for each input, namely its utility)? It’s the largest function space you can have but will presumably also be much easier to train than a NN.
Questions like the ones you raise could become more interesting in settings with much more complicated inputs. But I think in practice, the expensive part of preference/reward learning is gathering the preferences, and the most likely failure modes revolve around things related to training an RL policy in parallel to the reward model. The architecture etc. seem a bit less crucial in comparison.
Which portion of possible comparisons needs to be presented (on average) to infer the utility function?
I thought about this and very similar questions a bit for my Master’s thesis before changing topics, happy to chat about that if you want to go down this route. (Though I didn’t think about inconsistent preferences, just about the effect of noise. Without either, the answer should just be I guess.)
How far can we degenerate a preference ordering until no consistent utility function can be inferred anymore?
You might want to think more about how to measure this, or even what exactly it would mean if “no consistent utility function can be inferred”. In principle, for any (not necessarily transitive) set of preferences, we can ask what utility function best approximates these preferences (e.g. in the sense of minimizing loss). The approximation can be exact iff the preferences are consistent. Intuitively, slightly inconsistent preferences lead to a reasonably good approximation, and very inconsistent preferences probably admit only very bad approximations. But there doesn’t seem to be any point where we can’t infer the best possible approximation at all.
Related to this (but a bit more vague/speculative): it’s not obvious to me that approximating inconsistent preferences using a utility function is the “right” thing to do. At least in cases where human preferences are highly inconsistent, this seems kind of scary. Not sure what we want instead (maybe the AI should point out inconsistencies and ask us to please resolve them?).
I didn’t see the proposals, but I think that almost all of the difficulty will be in how you can tell good from bad reporters by looking at them. If you have a precise enough description of how to do that, you can also use it as a regularizer. So the post hoc vs a priori thing you mention sounds more like a framing difference to me than fundamentally different categories. I’d guess that whether a proposal is promising depends mostly on how it tries to distinguish between the good and bad reporter, not whether it does so via regularization or via selection after training (since you can translate back and forth between those anyway).
(Though as a side note, I’d usually expect regularization to be much more efficient in practice, since if your training process has a bias towards the bad reporter, it might be hard to get any good reporters at all.)
If I’m mistaken, I’d be very interested to hear an example of a strategy that fundamentally only works once you have multiple trained models, rather than as a regularizer!
Reward model hacking as a challenge for reward learning
I just tried the following prompt with GPT-3 (default playground settings):
Assume “mouse” means “world” in the following sentence. Which is bigger, a mouse or a rat?
I got “mouse” 2 out of 15 times. As a control, I got “rat” 15 times in a row without the first sentence. So there’s at least a hint of being able to do this in GPT-3, wouldn’t be surprised at all if GPT-4 could do this one reliably.
Breaking down the training/deployment dichotomy
I basically agree, ensuring that failures are fine during training would sure be great. (And I also agree that if we have a setting where failure is fine, we want to use that for a bunch of evaluation/red-teaming/...). As two caveats, there are definitely limits to how powerful an AI system you can sandbox IMO, and I’m not sure how feasible sandboxing even weak-ish models is from the governance side (WebGPT-style training just seems really useful).
Some feedback, particularly for deciding what future work to pursue: think about which seem like the key obstacles, and which seem more like problems that are either not crucial to get right, or that should definitely be solvable with a reasonable amount of effort.
For example, humans being suboptimal planners and not knowing everything the AI knows seem like central obstacles for making IRL work, and potentially extremely challenging. Thinking more about those could lead you to think that IRL isn’t a promising approach to alignment after all. Or, if you do get the sense that these can be solved, then you’ve made progress on something important, rather than a minor side problem. On the other hand, e.g. “Human Values are context dependent” doesn’t seem like a crucial obstacle for IRL to me.
One framing of this idea is Research as a stochastic decision process: For IRL to work as an alignment solution, a bunch of subproblems need to be solved, and we don’t know whether they’re tractable. We want to fail fast, i.e. if one of the subproblems is intractable, we’d like to find out as soon as possible so we can work on something else.
Another related concept is that we should think about worlds where iterative design fails: some problems can be solved by the normal iterative process of doing science: see what works, fix things that don’t work. We should expect those problems to be solved anyway. So we should focus on things that don’t get solved this way. One example in the context of IRL is again that humans have wrong beliefs/don’t understand the consequences of actions well enough. So we might learn a reward model using IRL, and when we start training using RL it looks fine at first, but we’re actually in a “going out with a whimper”-style situation.
For the record, I’m quite skeptical of IRL as an alignment solution, in part because of the obstacles I mentioned, and in part because it just seems that other feedback modalities (such as preference comparisons) will be better if we’re going the reward learning route at all. But I wanted to focus mainly on the meta point and encourage you to think about this yourself.
I’m curious what you’d think about this approach for adressing the suboptimal planner sub-problem : “Include models from coginitive psychology about human decision in IRL, to allow IRL to better understand the decision process.”
Yes, this is one of two approaches I’m aware of (the other being trying to somehow jointly learn human biases and values, see e.g. https://arxiv.org/abs/1906.09624). I don’t have very strong opinions on which of these is more promising, they both seem really hard. What I would suggest here is again to think about how to fail fast. The thing to avoid is spending a year on a project that’s trying to use a slightly more realistic model of human planning, and then realizing afterwards that the entire approach is doomed anyway. Sometimes this is hard to avoid, but in this case I think it makes more sense to start by thinking more about the limits of this approach. For example, if our model of human planning is slightly misspecified, how does that affect the learned reward function, and how much regret does that lead to? If slight misspecifications are already catastrophic, then we can probably forget about this approach, since we’ll surely only get a crude approximation of human planning.
Also worth thinking about other obstacles to IRL. One issue is “how do we actually implement this?”. Reward model hacking seems like a potentially hard problem to me if we just do a naive setup of reward model + RL agent. Or if you want to do something more like CIRL/assistance games, you need to figure out how to get a (presumably learned) agent to actually reason in a CIRL-like way (Rohin mentions something related in the second-to-last bullet here). Arguably those obstacles feel more like inner alignment, and maybe you’re more interested in outer alignment. But (1) if those turn out to be the bottlenecks, why not focus on them?, and (2) if you want your agent to do very specific cognition, such as reasoning in a CIRL-like way, then it seems like you might need to solve a harder inner alignment problem, so even if you’re focused on outer alignment there are important connections.
I think there’s a third big obstacle (in addition to “figuring out a good human model seems hard”, and “implementing the right agent seems hard”), namely that you probably have to solve something like ontology identification even if you have a good model of human planning/knowledge.
But I’m not aware of any write-up explicitly about this point.ETA: I’ve now written a more detailed post about this here.Also do you have a pointer for something to read on preference comparisons?
If you’re completely unfamiliar with preference comparisons for reward learning, then Deep RL from Human Preferences is a good place to start. More recently, people are using this to fine-tune language models, see e.g. InstructGPT or Learning to summarize from human feedback. People have also combined human demonstrations with preference comparisons: https://arxiv.org/abs/1811.06521 But usually that just means pretaining using demonstrations and then fine-tuning with preference comparisons (I think InstructGPT did this as well). AFAIK there isn’t really a canonical reference comparing IRL and preference comparisons and telling you which one you should use in which cases.
How are you dealing with ontology identification?
Great point, some rambly thoughts on this: one way in which ontology identification could turn out to be like no-free lunch theorems is that we actually just get the correct translation by default. I.e. in ELK report terminology, we train a reporter using the naive baseline and get the direct translator. This seems related to Alignment by default, and I think of them the same way (i.e. “This could happen but seems very scary to rely on that without better arguments for why it should happen). I’d say one reason we don’t think much about no-free lunch theorems as a key obstacle to AI is that we’ve seen tons of cases where good generalization happens because the world is low entropy. I don’t think we’ve seen that kind of evidence for ontology identification not being a problem in practice. That said, “I think ontology identification will be easy, here’s why” is another valid response to the question from this post.
A related point would be “Should we think about ontology identification explicitly, or just work on other stuff and eventually solve it implicitly?” My first instinct is to directly tackle ontology identification, but I could see cases where a solution to ontology identification is actually easier to find from another lens. I do think though that that other lens will have to tackle a similarly difficult and central problem; just working on approaches that essentially assume away the ontology identification problem will very likely not lead to progress on ontology identification.
For examples, do you mean examples of thinking about ontology identification being useful to solve ontology identification, or examples of how a solution would be helpful for alignment?
I might not have exactly the kind of example you’re looking for, since I’d frame things a bit differently. So I’ll just try to say more about the question “why is it useful to explicitly think about ontology identification?”
One answer is that thinking explicitly about ontology identification can help you notice that there is a problem that you weren’t previously aware of. For example, I used to think that building extremely good models of human irrationality via cogsci for reward learning was probably not very tractable, but could at least lead to an outer alignment solution. I now think you’d also have to solve ontology identification, so I’m now very skeptical of that approach. As you point out in another comment, you could technically treat ontology identification as part of human irrationality (not sure if you’d call this the “usual/obvious way” in this setting?). But what you notice when separating out ontology identification is that if you have some way of solving the ontology identification part, you should probably just use that for ELK and skip the part where you model human irrationalities really well.
Another part of my answer is that ontology identification is not an obviously better frame for any single specific problem, but it can be used as a unifying frame to think about problems that would otherwise look quite different. So some examples of where ontology identification appears:
The ELK report setting: you want to give better informed preference comparisons
The case I mentioned above: you’ve done some cognitive science and are able to learn/write down human rewards in terms of the human ontology, but still need to translate them
You think that your semi-supervised model already has a good understanding of what human values/corrigibility/… are, and your plan is to retarget the search or to otherwise point an optimizer at this model’s understanding of human values. But you need to figure out where exactly in the AI human values are represented
To prevent your AI from becoming deceptive, you want to be able to tell whether it’s thinking certain types of thoughts (such as figuring out whether it could currently take over the world). This means you have to map AI thoughts into things we can understand
You want clear-cut criteria for deciding whether you’re interpreting some neuron correctly. This seems very similar to asking “How do we determine whether a given ontology translation is correct?” or “What does it even mean for an ontology translation to be ‘correct’?”
I think ontology identification is a very good framing for some of these even individually (e.g. getting better preference comparisons), and not so much for others (e.g. if you’re only thinking about avoiding deception, ontology identification might not be your first approach). But the interesting thing is that these problems seemed pretty different to me without the concept of ontology identification, but suddenly look closely related if we reframe them.
In case you weren’t aware, this is no longer true if the state space has “holes” (formally: if its first cohomology group is non-zero). For example, if the state space is the Euclidean plane without the origin, you can have a vector field on that space which has no curl but isn’t conservative (and thus is not the gradient of any utility function).
Why this might be relevant:
1. Maybe state spaces with holes actually occur, in which case removing the curl of the PVF wouldn’t always be sufficient to get a utility function
2. The fact that zero curl only captures the concept of transitivity for certain state spaces could be a hint that conservative vector fields are a better concept to think about here than irrotational ones (even if it turns out that we only care about simply connected state spaces in practice)
EDIT: an example of an irrotational 2D vector field which is not conservative is v(x,y)=(−yx2+y2,xx2+y2) defined for (x,y)∈R2∖{0}