PhD student in AI safety at CHAI (UC Berkeley)
Erik Jenner
I’m wondering if regularization techniques could be used to make the pure deception regime unstable.
As a simple example, consider a neural network that is trained with gradient descent and weight decay. If the parameters can be (approximately) split into a set that determines the mesa-objective and a set for everything else, then the gradient of the loss with respect to the “objective parameters” would be zero in the pure deception regime, so weight decay would ensure that the mesa-objective couldn’t be maintained.
The learned algorithm might be able to prevent this by “hacking” its gradient as mentioned in the post, making the parameters that determine the mesa-objective also have an effect on its output. But intuitively, this should at least make it more difficult to reach a stable pure deception regime.
Of course regularization is a double-edged sword because as has been pointed out, the shortest algorithms that perform well on the base objective are probably not robustly aligned.
Gradient hacking is usually discussed in the context of deceptive alignment. This is probably where it has the largest relevance to AI safety but if we want to better understand gradient hacking, it could be useful to take a broader perspective and study it on its own (even if in the end, we only care about gradient hacking because of its inner alignment implications). In the most general setting, gradient hacking could be seen as a way for the agent to “edit its source code”, though probably only in a very limited way. I think it’s an interesting question which kinds of edits are possible with gradient hacking, for example whether an agent could improve its capabilities this way.
That sounds right to me, and I agree that this is sometimes explained badly.
Are you saying that this explains the perceived asymmetry between position and momentum? I don’t see how that’s the case, you could say exactly the same thing in the dual perspective (to get a precise momentum, you need to “sum up” lots of different position eigenstates).
If you were making a different point that went over my head, could you elaborate?
Interesting thoughts re anthropic explanations, thanks!
I agree that asymmetry doesn’t tell us which one is more fundamental, and I wasn’t aiming to argue for either one being more fundamental (though position does feel more fundamental to me, and that may have shown through). What I was trying to say was only that they are asymmetric on a cognitive level, in the sense that they don’t feel interchangeable, and that there must therefore be some physical asymmetry.
Still, I should have been more specific than saying “asymmetric”, because not any kind of asymmetry in the Hamiltonian can explain the cognitive asymmetry. For the “forces decay with distance in position space” asymmetry, I think it’s reasonably clear why this leads to cognitive asymmetry, but for the “position occurs as an infinite power series” asymmetry, it’s not clear to me whether this has noticeable macro effects.
Performance deteriorating implies that the prior p is not yet a fixed point of p*=D(A(p*)).
At least in the case of AlphaZero, isn’t the performance deterioration from A(p*) to p*? I.e. A(p*) is full AlphaZero, while p* is the “Raw Network” in the figure. We could have converged to the fixed point of the training process (i.e. p*=D(A(p*))) and still have performance deterioration if we use the unamplified model compared to the amplified one. I don’t see a fundamental reason why p* = A(p*) should hold after convergence (and I would have been surprised if it held for e.g. chess or Go and reasonably sized models for p*).
I enjoyed reading this! And I hadn’t seen the interpretation of a logistic preference model as approximating Gaussian errors before.
Since you seem interested in exploring this more, some comments that might be helpful (or not):
What is the largest number of elements we can sort with a given architecture? How does training time change as a function of the number of elements?
How does the network architecture affect the resulting utility function? How do the maximum and minimum of the unnormalized utility function change?
I’m confused why you’re using a neural network; given the small size of the input space, wouldn’t it be easier to just learn a tabular utility function (i.e. one value for each input, namely its utility)? It’s the largest function space you can have but will presumably also be much easier to train than a NN.
Questions like the ones you raise could become more interesting in settings with much more complicated inputs. But I think in practice, the expensive part of preference/reward learning is gathering the preferences, and the most likely failure modes revolve around things related to training an RL policy in parallel to the reward model. The architecture etc. seem a bit less crucial in comparison.
Which portion of possible comparisons needs to be presented (on average) to infer the utility function?
I thought about this and very similar questions a bit for my Master’s thesis before changing topics, happy to chat about that if you want to go down this route. (Though I didn’t think about inconsistent preferences, just about the effect of noise. Without either, the answer should just be I guess.)
How far can we degenerate a preference ordering until no consistent utility function can be inferred anymore?
You might want to think more about how to measure this, or even what exactly it would mean if “no consistent utility function can be inferred”. In principle, for any (not necessarily transitive) set of preferences, we can ask what utility function best approximates these preferences (e.g. in the sense of minimizing loss). The approximation can be exact iff the preferences are consistent. Intuitively, slightly inconsistent preferences lead to a reasonably good approximation, and very inconsistent preferences probably admit only very bad approximations. But there doesn’t seem to be any point where we can’t infer the best possible approximation at all.
Related to this (but a bit more vague/speculative): it’s not obvious to me that approximating inconsistent preferences using a utility function is the “right” thing to do. At least in cases where human preferences are highly inconsistent, this seems kind of scary. Not sure what we want instead (maybe the AI should point out inconsistencies and ask us to please resolve them?).
I didn’t see the proposals, but I think that almost all of the difficulty will be in how you can tell good from bad reporters by looking at them. If you have a precise enough description of how to do that, you can also use it as a regularizer. So the post hoc vs a priori thing you mention sounds more like a framing difference to me than fundamentally different categories. I’d guess that whether a proposal is promising depends mostly on how it tries to distinguish between the good and bad reporter, not whether it does so via regularization or via selection after training (since you can translate back and forth between those anyway).
(Though as a side note, I’d usually expect regularization to be much more efficient in practice, since if your training process has a bias towards the bad reporter, it might be hard to get any good reporters at all.)
If I’m mistaken, I’d be very interested to hear an example of a strategy that fundamentally only works once you have multiple trained models, rather than as a regularizer!
I just tried the following prompt with GPT-3 (default playground settings):
Assume “mouse” means “world” in the following sentence. Which is bigger, a mouse or a rat?
I got “mouse” 2 out of 15 times. As a control, I got “rat” 15 times in a row without the first sentence. So there’s at least a hint of being able to do this in GPT-3, wouldn’t be surprised at all if GPT-4 could do this one reliably.
I basically agree, ensuring that failures are fine during training would sure be great. (And I also agree that if we have a setting where failure is fine, we want to use that for a bunch of evaluation/red-teaming/...). As two caveats, there are definitely limits to how powerful an AI system you can sandbox IMO, and I’m not sure how feasible sandboxing even weak-ish models is from the governance side (WebGPT-style training just seems really useful).
Some feedback, particularly for deciding what future work to pursue: think about which seem like the key obstacles, and which seem more like problems that are either not crucial to get right, or that should definitely be solvable with a reasonable amount of effort.
For example, humans being suboptimal planners and not knowing everything the AI knows seem like central obstacles for making IRL work, and potentially extremely challenging. Thinking more about those could lead you to think that IRL isn’t a promising approach to alignment after all. Or, if you do get the sense that these can be solved, then you’ve made progress on something important, rather than a minor side problem. On the other hand, e.g. “Human Values are context dependent” doesn’t seem like a crucial obstacle for IRL to me.
One framing of this idea is Research as a stochastic decision process: For IRL to work as an alignment solution, a bunch of subproblems need to be solved, and we don’t know whether they’re tractable. We want to fail fast, i.e. if one of the subproblems is intractable, we’d like to find out as soon as possible so we can work on something else.
Another related concept is that we should think about worlds where iterative design fails: some problems can be solved by the normal iterative process of doing science: see what works, fix things that don’t work. We should expect those problems to be solved anyway. So we should focus on things that don’t get solved this way. One example in the context of IRL is again that humans have wrong beliefs/don’t understand the consequences of actions well enough. So we might learn a reward model using IRL, and when we start training using RL it looks fine at first, but we’re actually in a “going out with a whimper”-style situation.
For the record, I’m quite skeptical of IRL as an alignment solution, in part because of the obstacles I mentioned, and in part because it just seems that other feedback modalities (such as preference comparisons) will be better if we’re going the reward learning route at all. But I wanted to focus mainly on the meta point and encourage you to think about this yourself.
I’m curious what you’d think about this approach for adressing the suboptimal planner sub-problem : “Include models from coginitive psychology about human decision in IRL, to allow IRL to better understand the decision process.”
Yes, this is one of two approaches I’m aware of (the other being trying to somehow jointly learn human biases and values, see e.g. https://arxiv.org/abs/1906.09624). I don’t have very strong opinions on which of these is more promising, they both seem really hard. What I would suggest here is again to think about how to fail fast. The thing to avoid is spending a year on a project that’s trying to use a slightly more realistic model of human planning, and then realizing afterwards that the entire approach is doomed anyway. Sometimes this is hard to avoid, but in this case I think it makes more sense to start by thinking more about the limits of this approach. For example, if our model of human planning is slightly misspecified, how does that affect the learned reward function, and how much regret does that lead to? If slight misspecifications are already catastrophic, then we can probably forget about this approach, since we’ll surely only get a crude approximation of human planning.
Also worth thinking about other obstacles to IRL. One issue is “how do we actually implement this?”. Reward model hacking seems like a potentially hard problem to me if we just do a naive setup of reward model + RL agent. Or if you want to do something more like CIRL/assistance games, you need to figure out how to get a (presumably learned) agent to actually reason in a CIRL-like way (Rohin mentions something related in the second-to-last bullet here). Arguably those obstacles feel more like inner alignment, and maybe you’re more interested in outer alignment. But (1) if those turn out to be the bottlenecks, why not focus on them?, and (2) if you want your agent to do very specific cognition, such as reasoning in a CIRL-like way, then it seems like you might need to solve a harder inner alignment problem, so even if you’re focused on outer alignment there are important connections.
I think there’s a third big obstacle (in addition to “figuring out a good human model seems hard”, and “implementing the right agent seems hard”), namely that you probably have to solve something like ontology identification even if you have a good model of human planning/knowledge.
But I’m not aware of any write-up explicitly about this point.ETA: I’ve now written a more detailed post about this here.Also do you have a pointer for something to read on preference comparisons?
If you’re completely unfamiliar with preference comparisons for reward learning, then Deep RL from Human Preferences is a good place to start. More recently, people are using this to fine-tune language models, see e.g. InstructGPT or Learning to summarize from human feedback. People have also combined human demonstrations with preference comparisons: https://arxiv.org/abs/1811.06521 But usually that just means pretaining using demonstrations and then fine-tuning with preference comparisons (I think InstructGPT did this as well). AFAIK there isn’t really a canonical reference comparing IRL and preference comparisons and telling you which one you should use in which cases.
Great point, some rambly thoughts on this: one way in which ontology identification could turn out to be like no-free lunch theorems is that we actually just get the correct translation by default. I.e. in ELK report terminology, we train a reporter using the naive baseline and get the direct translator. This seems related to Alignment by default, and I think of them the same way (i.e. “This could happen but seems very scary to rely on that without better arguments for why it should happen). I’d say one reason we don’t think much about no-free lunch theorems as a key obstacle to AI is that we’ve seen tons of cases where good generalization happens because the world is low entropy. I don’t think we’ve seen that kind of evidence for ontology identification not being a problem in practice. That said, “I think ontology identification will be easy, here’s why” is another valid response to the question from this post.
A related point would be “Should we think about ontology identification explicitly, or just work on other stuff and eventually solve it implicitly?” My first instinct is to directly tackle ontology identification, but I could see cases where a solution to ontology identification is actually easier to find from another lens. I do think though that that other lens will have to tackle a similarly difficult and central problem; just working on approaches that essentially assume away the ontology identification problem will very likely not lead to progress on ontology identification.
For examples, do you mean examples of thinking about ontology identification being useful to solve ontology identification, or examples of how a solution would be helpful for alignment?
I might not have exactly the kind of example you’re looking for, since I’d frame things a bit differently. So I’ll just try to say more about the question “why is it useful to explicitly think about ontology identification?”
One answer is that thinking explicitly about ontology identification can help you notice that there is a problem that you weren’t previously aware of. For example, I used to think that building extremely good models of human irrationality via cogsci for reward learning was probably not very tractable, but could at least lead to an outer alignment solution. I now think you’d also have to solve ontology identification, so I’m now very skeptical of that approach. As you point out in another comment, you could technically treat ontology identification as part of human irrationality (not sure if you’d call this the “usual/obvious way” in this setting?). But what you notice when separating out ontology identification is that if you have some way of solving the ontology identification part, you should probably just use that for ELK and skip the part where you model human irrationalities really well.
Another part of my answer is that ontology identification is not an obviously better frame for any single specific problem, but it can be used as a unifying frame to think about problems that would otherwise look quite different. So some examples of where ontology identification appears:
The ELK report setting: you want to give better informed preference comparisons
The case I mentioned above: you’ve done some cognitive science and are able to learn/write down human rewards in terms of the human ontology, but still need to translate them
You think that your semi-supervised model already has a good understanding of what human values/corrigibility/… are, and your plan is to retarget the search or to otherwise point an optimizer at this model’s understanding of human values. But you need to figure out where exactly in the AI human values are represented
To prevent your AI from becoming deceptive, you want to be able to tell whether it’s thinking certain types of thoughts (such as figuring out whether it could currently take over the world). This means you have to map AI thoughts into things we can understand
You want clear-cut criteria for deciding whether you’re interpreting some neuron correctly. This seems very similar to asking “How do we determine whether a given ontology translation is correct?” or “What does it even mean for an ontology translation to be ‘correct’?”
I think ontology identification is a very good framing for some of these even individually (e.g. getting better preference comparisons), and not so much for others (e.g. if you’re only thinking about avoiding deception, ontology identification might not be your first approach). But the interesting thing is that these problems seemed pretty different to me without the concept of ontology identification, but suddenly look closely related if we reframe them.
Thanks! Starting from the paper you linked, I also found this, which seems extremely related: https://arxiv.org/abs/2103.15758 Will look into those more
Just for context, I’m usually assuming we already have a good AI model and just want to find the dashed arrow (but that doesn’t change things too much, I think). As for why this diagram doesn’t solve worst-case ELK, the ELK report contains a few paragraphs on that, but I also plan to write more about it soon.
Yep, the nice thing is that we can write down this commutative diagram in any category,[1] so if we want probabilities, we can just use the category with distributions as objects and Markov kernels as morphisms. I don’t think that’s too strict, but have admittedly thought less about that setting.
- ^
We do need some error metric to define “approximate” commutativity
Thanks for the comments!
One can define deception as a type of distributional shift. [...]
I technically agree with what you’re saying here, but one of the implicit claims I’m trying to make in this post is that this is not a good way to think about deception. Specifically, I expect solutions to deception to look quite different from solutions to (large) distributional shift. Curious if you disagree with that.
This was an interesting read, especially the first section!
I’m confused by some aspects of the proposal in section 4, which makes it harder to say what would go wrong. As a starting point, what’s the training signal in the final step (RL training)? I think you’re assuming we have some outer-aligned reward signal, is that right? But then it seems like that reward signal would have to do the work of making sure that the AI only gets rewarded for following human instructions in a “good” way—I don’t think we just get that for free. As a silly example, if we rewarded the AI whenever it literally followed our commands, then even with this setup, it seems quite clear to me we’d at best get a literal-command-following AI, and not an AI that does what we actually want. (Not sure if you even meant to imply that the proposal solved that problem, or if this is purely about inner alignment).
The complexity regularizer should ensure the AI doesn’t develop some separate procedure for interpreting commands (which might end up crucially flawed/misaligned). Instead, it will use the same model of humans it uses to make predictions, and inaccuracies in it would equal inaccuracies in predictions, which would be purged by the SGD as it improves the AI’s capabilities.
Since this sounds to me like you are saying this proposal will automatically lead to commands being interpreted the way we mean them, I’ll say more on this specifically: the AI will presumably have not just a model of what humans actually want when they give commands (even assuming that’s one of the things it internally represents). It should just as easily be able to interpret commands literally using its existing world model (it’s something humans can do as well if we want to). So which of these you get would depend on the reward signal, I think.
For related reasons, I’m not even convinced you get something that’s inner-aligned in this proposal. It’s true that if everything works out the way you’re hoping, you won’t be starting with pre-existing inner-misaligned mesa objectives, you just have a pure predictive model and GPS. But then there are still lots of objectives that could be represented in terms of the existing predictive model that would all achieve high reward. I don’t quite follow why you think the objective we want would be especially likely—my sense is that even if “do what the human wants” is pretty simple to represent in the AI’s ontology, other objectives will be too (as one example, if the AI is already modeling the training process from the beginning of RL training, then “maximize the number in my reward register” might also be a very simple “connective tissue”).
Thanks for the interesting comments!
Briefly, I think Katja’s post provides good arguments for (1) “things will go fine given slow take-off”, but this post interprets it as arguing for (2) “things will go fine given AI never becomes dangerously capable”. I don’t think the arguments here do quite enough to refute claim (1), although I’m not sure they are meant to, given the scope (“we are not discussing”).
Yeah, I didn’t understand Katja’s post as arguing (1), otherwise we’d have said more about that. Section C contains reasons for slow take-off, but my crux is mainly how much slow takeoff really helps (most of the reasons I expect iterative design to fail for AI still apply, e.g. deception or “getting what we measure”). I didn’t really see arguments in Katja’s post for why slow takeoff means we’re fine.
We don’t necessarily need to reach some “safe and stable state”. X-risk can decrease over time rapidly enough that total x-risk over the lifespan of the universe is less than 1.
Agreed, and I think this is a weakness of our post. I have a sense that most of the arguments you could make using the “existentially secure state” framing could also be made more generally, but I haven’t figured out a framing I really like yet unfortunately.
EtA: I am still more concerned about “not enough samples to learn human preferences” than ELK or inner optimization type failures. This seems to be a fairly unpopular view, and I haven’t scrutinized it too much (but would be interested to discuss it cooperatively).
Would be interested in discussing this more at some point. Given your comment, I’d now guess I dismissed this too quickly and there are things I haven’t thought of. My spontaneous reasoning for being less concerned about this is something like “the better our models become (e.g. larger and larger pretrained models), the easier it should be to make them output things humans approve of”. An important aspect is also that this is the type of problem where it’s more obvious if things are going wrong (i.e. iterative design should work here—as long as we can tell the model isn’t aligned yet, it seems more plausible we can avoid deploying it).
Two responses:
For “something that is very difficult to achieve (i.e. all of humanity is currently unable to achieve it)”, I didn’t have in mind things like “cure a disease”. Humanity might currently not have a cure for a particular disease, but we’ve found many cures before. This seems like the kind of problem that might be solved even without AGI (e.g. AlphaFold already seems helpful, though I don’t know much about the exact process). Instead, think along the lines of “build working nanotech, and do it within 6 months” or “wake up these cryonics patients”, etc. These are things humanity might do at some point, but there clearly outside the scope of what we can currently do within a short timeframe. If you tell a human “build nanotech within 6 months”, they don’t solve it the expected way, they just fail. Admittedly, our post is pretty unclear where to draw the boundary, and in part that’s because it seems hard to tell where it is exactly. I would guess it’s below nanotech or cryonics (and lots of other examples) though.
It shouldn’t be surprising that humans mostly do things that aren’t completely unexpected from the perspective of other humans. We all roughly share a cognitive architecture and our values. Plans of the form “Take over the world so I can revive this cryonics patient” just sound crazy to us; after all, what’s the point of reviving them if that kills most other humans? If we could instill exactly the right sense of which plans are crazy into an AI, that seems like major progress in alignment! Until then, I don’t think we can make the conclusion from humans to AI that easily.
In case you weren’t aware, this is no longer true if the state space has “holes” (formally: if its first cohomology group is non-zero). For example, if the state space is the Euclidean plane without the origin, you can have a vector field on that space which has no curl but isn’t conservative (and thus is not the gradient of any utility function).
Why this might be relevant:
1. Maybe state spaces with holes actually occur, in which case removing the curl of the PVF wouldn’t always be sufficient to get a utility function
2. The fact that zero curl only captures the concept of transitivity for certain state spaces could be a hint that conservative vector fields are a better concept to think about here than irrotational ones (even if it turns out that we only care about simply connected state spaces in practice)
EDIT: an example of an irrotational 2D vector field which is not conservative is v(x,y)=(−yx2+y2,xx2+y2) defined for (x,y)∈R2∖{0}