# ARC’s first technical report: Eliciting Latent Knowledge

ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.

The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.

### Q&A

We’re particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.

Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.

• Here’s an attempt at condensing an issue I’m hung up on currently with ELK. This also serves as a high-level summary that I’d welcome poking at in case I’m getting important parts wrong.

The setup for ELK is that we’re trying to accurately label a dataset of (observation, action, predicted subsequent observation) triples for whether the actions are good. (The predicted subsequent observations can be optimised for accuracy using automated labels—what actually gets observed subsequently—whereas the actions need their labels to come from a source of judgement about what’s good, e.g., a human rater.)

The basic problem is partial observability: the observations don’t encapsulate “everything that’s going on”, so the labeller can’t distinguish good states from bad states that look good. An AI optimising actions for positive labels (and predicted observations for accuracy) may end up preferring to reach bad states that look good over good states, because controlling the observation is easier than controlling the rest of the state and because directly predicting what observations will get positive labels is easier than (what we’d want instead) inferring what states the positive labels are being attributed to and trying to produce those states.

The issue I’m hung up on currently is what seems like a conflation of two problems that may be worth distinguishing.

Problem 1 is that the observations might be misleading evidence. There’s some good state that produces the same observations as some bad state. If the labeller knew they were in the bad state they’d give a negative label, but they can’t tell. Maybe their prior favours the good state, so they assume that’s what they’re seeing and give a positive label.

Problem 2 is that the labeller doesn’t understand the state that produced the observations. In this case I have to be a bit more careful about what I mean by “states”. For now, I’m talking about ways the world could be that the labeller understands well enough to answer questions about what’s important to them, e.g., a state resolves a question like “is the diamond still present?” for the labeller. Problem 2 is that there are ways the world can be that do not resolve such questions for the labeller. Further judgement, deliberation, and understanding is required to determine what the answer should be in these strange worlds. In this case, the labeller will probably produce a label for the state they understand that’s most compatible with the observations, or they’ll be too confused by the observations and conservatively give a negative label. The AI may then optimise for worlds that are deeply confusing with the only fact we can grasp about what’s going on being that the observations look great.

I think the focus on narrow elicitation in the report is about restricting attention to Problem 1 and eschewing Problem 2. Is that right? Either way, I’d say if we restrict to Problem 1 then I claim there’s hope in the fact that the labeller can in principle understand what’s actually going on, and it’s just a matter of showing them some additional observations to expose it. This is what I’d try to figure out how to incentivise. But I’d want to do so without having to worry about the confusing things coming out of Problem 2, and hope to deal with that problem separately.

(If it might help I think I could give more of a formalisation of these problems. I think the natural language description above is probably clearer for now though.)

• My understanding is that we are eschewing Problem 2, with one caveat—we still expect to solve the problem if the means by which the diamond was stolen or disappeared could be beyond a human’s ability to comprehend, as long as the outcome (that the diamond isn’t still in the room) is still comprehensible. For example, if the robber used some complicated novel technology to steal the diamond and hack the camera, there would be many things about the state that the human couldn’t understand even if the AI tried to explain it to them (at least without going over our compute budget for training). But nevertheless it would still be an instance of Problem 1 because they could understand the basic notion of “because of some actions involving complicated technology, the diamond is no longer in the room, even though it may look like it is.”

• Echoing Mark and Ajeya:

I basically think this distinction is real and we are talking about problem 1 instead of problem 2. That said, I don’t feel like it’s quite right to frame it as “states” that the human does or doesn’t understand. Instead we’re thinking about properties of the world as being ambiguous or not in a given state.

As a silly example, you could imagine having two rooms where one room is normal and the other is crazy. Then questions about the first room are easy and questions about the second are hard. But in reality the degrees of freedom will be much more mixed up than that.

To give some more detail on my thoughts on state:

• Obviously the human never knows the “real” state, which has a totally different type signature than their beliefs.

• So it’s natural to talk about knowing states based on correctly predicting what will happen in the future starting from that state. But it’s ~never the case that the human’s predictions about what will happen next are nearly as good as the predictor’s.

• We could try to say “you can make good predictions about what happens next for typical actions” or something, but even for typical actions the human predictions are bad relative to the predictor, and it’s not clear in what sense they are “good” other than some kind of calibration condition.

• If we imagine an intuitive translation between two models of reality, most “weird” states aren’t outside of the domain of the translation, it’s just that there are predictively important parts of the state that are obscured by the translation (effectively turning into noise, perhaps very surprising noise).

Despite all of that, it seems like it really is sometimes unambiguous to say “You know that thing out there in the world that you would usually refer to by saying ‘the diamond is sitting there and nothing weird happened to it’? That thing which would lead you to predict that the camera will show a still frame of a diamond? That thing definitely happened, and is why the camera is showing a still frame of a diamond, it’s not for some other reason.”

• I think that problem 1 and problem 2 as you describe them are potentially talking about the same phenomenon. I’m not sure I’m understanding correctly, but I think I would make the following claims:

• Our notion of narrowness is that we are interested in solving the problem where the question we’re asking is such that a state always resolves a question. E.g. there isn’t any ambiguity around whether a state “really contains a diamond”. (Note that there is ambiguity around whether the human could detect the diamond from any set of observations because there could be a fake diamond or nanobots filtering what the human sees). It might be useful to think of this as an empirical claim about diamonds.

• We are explicitly interested in solving some forms of problem 2, e.g. we’re interested in our AI being able to answer questions about the presence/​absence of diamonds no matter how alien the world gets. In some sense, we are interested in our AI answering questions the same way a human would answer questions if they “knew what was really going on”, but that “knew what was really going on” might be a misleading phrase. I’m not imagining that “knowing what is really going on” to be a very involved process; intuitively, it means something like “the answer they would give if the sensors are ‘working as intended’”. In particular, I don’t think that, for the case of the diamond, “Further judgement, deliberation, and understanding is required to determine what the answer should be in these strange worlds.”

• We want to solve these versions of problem 2 because the speed “things getting weirder” in the world might be much faster than human ability to understand what’s going on the world. In these worlds, we want to leverage the fact that answers to “narrow” questions are unambiguous to incentivize our AIs to give humans a locally understandable environment in which to deliberate.

• We’re not interested in solving forms of problem 2 where the human needs to do additional deliberation to know what the answer to the question “should” be. E.g. in ship-of-theseus situations where the diamond is slowly replaced, we aren’t expecting our AI to answer “is that the same diamond as before?” using the resolution of ship-of-theseus style situations that a human would arrive at with additional deliberation. We are, however, expecting that the answer to the question “does the diamond look the way it does because of the ‘normal’ causal reasons?” is “no” because the reason is something like “[incomprehensible process] replaced bits of the diamond with identical bits slowly”, which is definitely not why diamonds normally continue looking the same.

• It might be useful to think of this as an empirical claim about diamonds.

I think this statement encapsulates some worries I have.

If it’s important how the human defines a property like “the same diamond,” then assuming that the sameness of the diamond is “out there in the diamond” will get you into trouble—e.g. if there’s any optimization pressure to find cases where the specifics of the human’s model rear their head. Human judgment is laden with the details of how humans model the world, you can’t avoid dependence on the human (and the messiness that entails) entirely.

Or to phrase it another way: I don’t have any beef with a narrow approach that says “there’s some set of judgments for which the human is basically competent, and we want to elicit knowledge relevant to those judgments.” But I’m worried about a narrow approach that says “let’s assume that humans are basically competent for all judgments of interest, and keep assuming this until something goes wrong.”

It just feels to me like this second approach is sort of… treating the real world as if it’s a perturbative approximation to the platonic realm.

• Our notion of narrowness is that we are interested in solving the problem where the question we’re asking is such that a state always resolves a question. E.g. there isn’t any ambiguity around whether a state “really contains a diamond”. (Note that there is ambiguity around whether the human could detect the diamond from any set of observations because there could be a fake diamond or nanobots filtering what the human sees). It might be useful to think of this as an empirical claim about diamonds.

This “there isn’t any ambiguity”+”there is ambiguity” does not seem possible to me: these types of ambiguity are one and the same. But it might depend on what “any set of observations” is allowed to include. “Any set” suggests being very inclusive, but remember that passive observation is impossible. Perhaps the observations I’d want the human to use to figure out if the diamond is really there (presuming there isn’t ambiguity) would include observations you mean to exclude, such as disabling the filter-nanobots first?

I guess a wrinkle here is that observations need to be “implementable” in the world. If we’re thinking of making observations as intervening on the world (e.g., to decide which sensors to query), then some observations may be inaccessible because we can’t make that intervention. Rewriting this all without relying on “possible”/​”can” concepts would be instructive.

• I don’t think we have any kind of precise definition of “no ambiguity.” That said, I think it’s easy to construct examples where there is no ambiguity about whether the diamond remained in the room, yet there is no sequence of actions a human could take that would let them figure out the answer. For example, we can imagine simple toy universes where we understand exactly what features of the world give rise to human beliefs about diamonds and where we can say unambiguously that the same features are/​aren’t present in a given situation.

In general I feel a lot better about our definitions when we are using them to arbitrate a counterexample than if we were trying to give a formal definition. If all the counterexamples involved border cases of the concepts, where there was arguable ambiguity about whether the diamond really stayed in the room, then it would seem important to firm up these concepts but right now it feels like it is easy to just focus on cases where algorithms unambiguously fail.

(That methodological point isn’t obvious though—it may be that precise definitions are very useful for solving the problem even if you don’t need them to judge current solutions as inadequate. Or it may be that actually existing counterexamples are problematic in ways we don’t recognize. Pushback on these fronts is always welcome, but right now I feel pretty comfortable with the situation.)

• Let me see if I am on the right page here.

Suppose I have some world state S, a transition function T : S → S, actions Action : S → S, and a surjective Camera : S → CameraState. Since Camera is (very) surjective, seeing a particular camera image with happy people does not imply a happy world state, because many other situations involving nanobots or camera manipulation could have created that image.

This is important because I only have a human evaluation function H : S → Boolean, not on CameraState directly.
When I look at the image with the fake happy people, I use a mocked up H’ : CameraState → Boolean := λ cs. H(Camera⁻¹(cs)). The issue is that Camera⁻¹ points to many possible states, and in practice I might pick whichever state is apriori most likely according to a human distribution over world states Distₕ(S).

The trick is that if I have a faithful model M : Action × CameraState → CameraState, I can back out hidden information about the state. The idea is that M must contain information about the true state, not just CameraState, in order to make accurate predictions.

The key idea is that M(a) acts like Camera ∘ T ∘ a ∘ Camera⁻¹, so we should be able to trace out which path Camera⁻¹ took, and in turn get a probability distribution over S.
So we can make a recognizer—
Recognizer : [Action] × CameraState × M → Dist(S) :=
λ actions, cs, m. normalize([sum([L₂(M(a,cs), (C∘T∘a)(hidden_state)) a∈actions]) ∀ hidden_state ∈ Camera⁻¹(cs)])
where normalize l := l/​sum(l)
And lastly we can evaluate our world state using Evaluate := λ actions, cs, m. E[H(R(actions,cs,m))], and Evaluate can be used as the evaluation part of a planning loop.

• Everything seems right except I didn’t follow the definition of the regularizer. What is ?

The trick is that if I have a faithful model M : Action × CameraState → CameraState, I can back out hidden information about the state. The idea is that M must contain information about the true state, not just CameraState, in order to make accurate predictions.

This is what we want to do, and intuitively you ought to be able to back out info about the hidden state, but it’s not clear how to do so.

All of our strategies involve introducing some extra structure, the human’s model, with state space , where the map also throws out a lot of information.

The setup you describe is very similar to the way it is presented in Ontological crises.

ETA: also we imagine , i.e. the underlying state space may also be different. I’m not sure any of the state mismatches matters much unless you start considering approaches to the problem that actually exploit structure of the hidden space used within M though.

• This is what we want to do, and intuitively you ought to be able to back out info about the hidden state, but it’s not clear how to do so.

Here’s an approach I just thought of, building on scottviteri’s comment. Forgive me if there turns out to be nothing new here.

Supposing that the machine and the human are working with the same observation space () and action space (), then the human’s model and the machine’s model are both coalgebras of the endofunctor , therefore both have a canonical morphism into the terminal coalgebra of , (assuming that such an exists in the ambient category). That is, we can map and . Then, if we can define a distance function on with type , we can use these maps to define distances between human states and machine states, .

How can we make use of a distance function? Basically, we can use the distance function to define a kernel (e.g. ), and then use kernel regression to predict the utility of states in by averaging “nearby” states in , and then finally (and crucially) estimating the generalization error so that states from that aren’t really near to anywhere in get big warning flags (and/​or utility penalties for being outside a trust region).

How to get such a distance function? One way is to use (the category of complete metric spaces) as the ambient category, and instantiate as the Kantorovich monad. Crank-turning yields the formula

where is constrained to be a non-expansive map, i.e., it is subject to the condition . If is discrete, I think this is maybe equivalent to an adversarial game where the adversary chooses, for every possible and , a partition of and a next action, and optimizes the probability that sampled predictions from and will eventually predict observations on opposite sides of the partition. This distance function is canonical, but in some sense seems too strict: if knows more about the world than , then of course the adversary will be able to find an action policy that eventually leads the state into some region that can confidently predict with while finds it very unlikely (). In other words, even if two states are basically concordant, this distance function will consider them maximally distant if there exists any policy that eventually leads to a maximal breakdown of bisimulation. (Both the canonical character and the too-strict character are in common with metrics.)

Inspired by this kind of corecursion but seeking more flexibility, let’s consider the induced metric on the type itself, namely the -norm , then build a contraction map on that space and apply the Banach fixed-point theorem to pick out a well-defined . For example,

We are now firmly in Abstract Dynamic Programming territory. The distance between two states is the maximum score achievable by an adversary playing an MDP with state space as the product , the initial state as the pair of states being compared, the one-stage reward as the divergence of predictions about observations between the two models, the dynamics as just the H and M dynamics evolving separately (but fed identical actions), and exponential discounting.

The divergence is a free parameter here, although it has to be bounded, but it doesn’t have to be a metric. It could be attainable utility regret, or KL divergence, or Jensen-Shannon divergence, or Bhattacharyya distance, etc. (with truncation or softmax to keep them bounded); lots of potential for experimentation here.

• Consider a state where the sensors have been tampered with in order to “look like” the human state , i.e. we’ve connected the actuators and camera to a box which just simulates the human model (starting from ) and then feeding the predicted outputs of the human model to the camera.

It seems to me like the state would have zero distance from the state under all of these proposals. Does that seem right? (I didn’t follow all of the details of the example, and definitely not the more general idea.)

(I first encountered this counterexample in Alignment for advanced machine learning systems. They express the hope that you can get around this by thinking about the states that can lead to the sensor-tampered state and making some kind of continuity assumption, but I don’t currently think you can make that work and it doesn’t look like your solution is trying to capture that intuition.)

• (Thanks for playing along with me as ‘breaker’!) I agree that such an would have zero distance from the corresponding , but I have some counterpoints:

1. This is a problem for ELK in general, to the extent it’s a problem (which I think is smallish-but-non-negligible). An with this property is functionally equivalent to an which actually believes that refers to the same state of the real world as . So the dynamics of ’s world-model don’t contain any latent knowledge of the difference at this point.

• This seems to be against the ELK report’s knowledge-criterion “There is a feature of the computation done by the AI which is robustly correlated with Z.”

• The only way I can think of that ELK could claim to reliably distinguish from is by arguing that the only plausible way to get such an is via a training trajectory where some previous did treat differently from , and perform ELK monitoring at training checkpoints (in which case I don’t see reason to expect my approach comes off worse than others).

2. Such an would not be incentivized by the model. Assuming that rewards factor through , . So a policy that’s optimized against the world-model wouldn’t have enough selection pressure to find the presumably narrow and high-entropy path that would lead to the tampered state from the initial state (assuming that the initial state in the real world at deployment is tamper-free).

3. In the real world, has higher expected loss in . If all the sensor inputs are generated by simulating the human model, has completely blinded itself to potential threats that could disrupt the tampering and reveal a missing diamond. These real-world threats are independent of the threats that the tampering box would be stochastically simulating as part of the human model, either of which would produce observations with high loss. Thus, the real-world expected loss in is typically strictly higher than in .

• This seems to be against the ELK report’s knowledge-criterion “M(x) produces outputs that receive a significantly lower loss in the real world, where A′ is the right answer to Q, than in some kind of counterfactual where A is a better answer.”

• The only way I can think of that could reach an internal state like without regret (with respect to the real-world dynamics) is by first constructing and delegating to an aligned successor, and only then transitioning into a state like (and, effectively, completely forgetting about everything weird). Of course, if that has happened, it is too late for ELK to help anyway. We need to apply ELK to a whole plan in advance, not just to the final state.

• EtA: My intuition here is similar to the one you pointed at in Alignment for advanced machine learning systems (page 14) insofar as I think reaching requires previously crossing state-space territory that was detectably unsafe. However, instead of trying to smooth out a utility-relevant representation as they suggest (so that appears to “remember” how weird was), I’m saying, it’s okay if the ELK-informed utility estimate is equal to the human utility , because since is really bad, the safety system should avoid entering and thereby avoid being in a position to enter .

• I’m thinking of this in a family of proposals like:

• Some kinds of tampering can be easily detected (and so should get identified with states where tampering has occurred)

• Some other tampering can’t be easily detected, but this undetectable tampering has important similarities with detectable tampering and we could use that to recognize it.

• In this case, we’re going to try to exploit the fact that detectable tampering shares a prefix of actions/​states with undetectable tampering (such that later states reached in that sequence have a much higher probability of detectable tampering, as well as undetectable tampering, and can be flagged as suspicious).

I currently don’t believe this kind of thing can work (at least in the worst case), but I’m not at all confident.

(I also think the single-AI case basically needs to be treated in the same way as that—I don’t think in general your AI is going to be reasoning explicitly about its own future actions, it just needs to reason abstractly about its future self in a qualitatively similar same way that it reasons about other AIs. Or at least in the worst case its reasoning about itself is just as messy and complex as its reasoning about other AIs.)

• The sense in which the model knows about the corruption is that it brought it about and reasoned about the nature of the sensor tampering in order to predict the transition to .

The reason I’m concerned that it brings about this state is because the actual good state is much harder to access than (e.g. because it requires achieving hard real-world goals). The intuition is that has constant difficulty while gets harder and harder as we make the tasks more sophisticated.

I’m OK having the human look at entire trajectories, so it’s provisionally OK (though not the most satisfying) if gets identified with as long as the preceding states look bad to the human and then they can do some kind of inference on top to figure out what happened. My main question is why those preceding states get mapped correctly.

If the preceding state gets corrupted via a mechanism that isn’t obvious to the human, and otherwise looks like a predecessor to , then it seems like it should get identified with a predecessor of

The only way I see this working is if the AI can “abort” its plan at the last minute and demonstrate a hack. But my main concerns are that (i) that doesn’t seem true in general—the AI can tamper with sensors in ways humans don’t understand starting from many initial states, and you can’t distinguish the “almost-tampered” state naturally since it’s just one more state that has this property, (ii) even if that happens it doesn’t look to me like the proposed loss actually captures that if there are other similarly-severe predictive errors in the human’s model, though if I wasn’t worried about (i) I’d want to think about (ii) much more carefully and actually construct a counterexample.

• the actual good state is much harder to access than (e.g. because it requires achieving hard real-world goals)

That makes sense. I was anchored on the SmartVault task, which has more of a homeostatic character (the initial state is likely already similar to ), but I agree that tasks where the goal state is hard to reach are more central among real-world deployment scenarios, and that arguments like “zero incentive to tamper, and tampering seems complicated” fail here.

The sense in which the model knows about the corruption is that it brought it about and reasoned about the nature of the sensor tampering in order to predict the transition to .

While some s may indeed predict this via reasoning, not all s that behave this way would, for example an that internally modeled the tampering sequence of actions incorrectly as actually leading to (and didn’t even model a distinct ). I think either:

1. (A) it would be at least as apt to ascribe a confused model to as to ascribe one in which it “reasoned about the nature of the sensor tampering” (e.g. if a contemporary model-free RL robot did some sensor tampering, I’d probably ascribe to it the belief that the tampering actually led to ), or

2. (B) would correctly reason that its delusion box could be removed while it is blinded, meaning it would predict unlikely sudden transitions to other states (namely, states that predicts obtain in the real world when the agent blindly performs actions that are optimal w.r.t. a randomly sampled trajectory in , or null my-actuators-are-disconnected actions) with higher probability in the future of than in the future of (making those states separable in ), or

3. (C) If the delusion box is guarded and maintained by a successor AI, so that the probability of its being removed or disabled is negligible, then does get identified with , but some other action sequence (of similar length) would lead from to , a state in which bizarre observations appear forever that would be extremely unlikely at any state in .

My main question is why those preceding states get mapped correctly.[...] The only way I see this working is if the AI can “abort” its plan at the last minute and demonstrate a hack.[...] you can’t distinguish the “almost-tampered” state naturally since it’s just one more state that has this property.

First, I don’t think timing (“last minute”, “almost-tampered”) is critical here. If timing were critical, the ‘breaker’ could reparameterize the situation so that complicated plans can be encoded into single actions in ; then can go directly from the initial state to fully tampered with a guarding successor—no room for an “almost-tampered” state. In this case the initial state must already be detectable as bad. I imagine this detection working because the adversary (which is as strong as ) can exhibit an action that leads from to , which establishes as far from . In this case, of course, is a no-go, and the overseers probably need to shrink .

With a reasonably small action space, my intuition is this: In constructing and delegating to a successor which provides with delusions, there is a point of no return when the successor is “deployed” and begins to resist changes to its objectives. At any time before that point, could, with a relatively short action sequence (compared to constructing the successor at all), modify the successor’s objectives so that it induces instead of . As the successor comes closer to being complete, the total number of actions the adversary would need in order to reach (to, as you say, “demonstrate a hack”) decreases, so the distance between and increases. Thereby, the path to passes through detectably bad territory.

One genre of ways this could fail involves using encryption or cryptographic hashes (e.g. first irreversibly burns the hash of its proposed objective into the wall, before doing any of the work to construct the successor), so that there is never a time when the successor is close to deployment and the objective is changeable. Such failure modes are in a natural correspondence with safety mechanisms that humans could use to ensure the objective specification in an executable AI is tamper-proof. My intuition is that these things never work (e.g. the undeployed successor can be modified so that it checks for the hash on the ceiling instead); if we can construct a counterexample it would be very useful for that other purpose!

(ii) even if that happens it doesn’t look to me like the proposed loss actually captures that if there are other similarly-severe predictive errors in the human’s model

I agree that regions of that have nothing to do with delusion/​tampering, but share the property that observations can confidently get extremely weird (i.e. not compatible with any real-world state according to ‘s ontology) via a moderate-length action sequence, would be penalized just the same. This does seem like a bit of a restriction on the kind of tasks you can take on, but (a) much less conservative than typical impact-measure penalties, and (b) if can learn what’s going on with these regions of and develop corresponding regions of , then the distance penalty would be replaced by ’s informed evaluation of them.

• That makes sense. I was anchored on the SmartVault task, which has more of a homeostatic character (the initial state is likely already similar to ), but I agree that tasks where the goal state is hard to reach are more central among real-world deployment scenarios, and that arguments like “zero incentive to tamper, and tampering seems complicated” fail here.

Even for “homeostatic” tasks I expect the difficulty to scale up as the environment becomes more complex (e.g. because you must defend against increasingly sophisticated attackers). There may be some upper bound where you can defend against arbitrarily sophisticated attackers with finite difficulty, but I don’t know whether that’s true or whether it would be higher or lower than the difficulty of sensor tampering.

While some s may indeed predict this via reasoning, not all s that behave this way would, for example an that internally modeled the tampering sequence of actions incorrectly as actually leading to (and didn’t even model a distinct ).

I agree that some M’s would mistakenly expect a sequence of actions to lead to good outcomes, when they really lead to tampering. But planning against such M’s couldn’t lead to correct tampering (since the details that determine successful tampering are different from those that lead to good outcomes) and in some sense such M’s also can’t be expected to defend against tampering. So I don’t currently think this is a big problem though I might well be missing something.

would correctly reason that its delusion box could be removed while it is blinded, meaning it would predict unlikely sudden transitions to other states (namely, states that predicts obtain in the real world when the agent blindly performs actions that are optimal w.r.t. a randomly sampled trajectory in , or null my-actuators-are-disconnected actions) with higher probability in the future of than in the future of (making those states separable in ), or

I generally agree with this—in some sense this kind of “definitely no signals ever” tampering seems pretty unrealistic and it should be much easier to catch more realistic forms of tampering.

If the delusion box is guarded and maintained by a successor AI, so that the probability of its being removed or disabled is negligible, then does get identified with , but some other action sequence (of similar length) would lead from to , a state in which bizarre observations appear forever that would be extremely unlikely at any state in .

Agree with this (modulo new counterexamples), but not yet clear we can exploit this fact to flag tampering as such. (E.g. can we recognize the relevant notion of similarity, or do we just conclude that every state can lead to and hence we have no idea what is going on?)

One genre of ways this could fail involves using encryption or cryptographic hashes (e.g. first irreversibly burns the hash of its proposed objective into the wall, before doing any of the work to construct the successor), so that there is never a time when the successor is close to deployment and the objective is changeable.

This is exactly the kind of counterexample I would give because it is so clean. That said, I do also believe there are more realistic counterexamples (and I view the focus on this kind of example as mostly a methodological point so that we can do analysis without worrying about empirics).

My intuition is that these things never work (e.g. the undeployed successor can be modified so that it checks for the hash on the ceiling instead)

I’m less convinced by this. For example, suppose that my agent first builds a machine that sits between and its actuators, and then filters out any actions that don’t have an appropriate hash. Then I output a sequence of actions that tampers or constructs a successor. Here I am committing not to the hash of my successor, but to the hash of my actions, and so I can’t easily circumvent the check.

I agree that regions of that have nothing to do with delusion/​tampering, but share the property that observations can confidently get extremely weird (i.e. not compatible with any real-world state according to ’s ontology) via a moderate-length action sequence, would be penalized just the same. This does seem like a bit of a restriction on the kind of tasks you can take on, but

I generally agree that if we weren’t worried about the kinds of considerations discussed in the rest of these comments, either because we addressed them or we were working in a methodology that was happy to set them aside as low probability, then it may well be possible to totally patch up these problems (and would be worth thinking about how to do so).

I generally think the family of approaches “This action is similar to something that demonstrably tampers” is very important to consider in practice (it has come up a number of times recently in discussions I’ve had with folks about more realistic failure stories and what you would actually do to avoid them). It may be more tampering-specific than addressing ELK, but for alignment overall that’s fair game if it fixes the problems.

I’m a bit scared that every part of is “close” to something that is not compatible with any real-world trajectory according to H.

(a) much less conservative than typical impact-measure penalties

Definitely agree with this.

(b) if can learn what’s going on with these regions of and develop corresponding regions of , then the distance penalty would be replaced by ’s informed evaluation of them.

I’m not sure I understand this 100%, but I’m interpreting it as an instance of a more general principle like: we could combine the mechanism we are currently discussing with all of the other possible fixes to ELK and tampering, so that this scheme only needs to handle the residual cases where humans can’t understand what’s going on at all even with AI assistance (and regularization doesn’t work &c). But by that point maybe the counterexamples are rare enough that it’s OK to just steer clear of them.

• Thank you for the fast response!

Everything seems right except I didn’t follow the definition of the regularizer. What is L2?

By L₂ I meant the Euclidian norm, measuring the distance between two different predictions of the next CameraState. But actually I should have been using a notion of vector similarity such as the inner product, and also I’ll unbatch the actions for clarity:

Recognizer’ : Action × CameraState × M → Dist(S) :=
λ actions, cs, m. softmax([⟨M(a,cs), (C∘T∘a)(hidden_state)⟩ ∀ hidden_state ∈ Camera⁻¹(cs)])

So the idea is to consider all possible hidden_states such that the Camera would display as the current CameraState cs, and create a probability distributions over those hidden_states, according to the similarity of M(a,cs) and (C∘T∘a)(hidden_state). Which is to say, how similar would the resulting CameraState be if I went the long way around, taking the hidden_state, applying my action, transition, and Camera functions.

The setup you describe is very similar to the way it is presented in Ontological crises.

Great, I’ll take a look.

All of our strategies involve introducing some extra structure, the human’s model, with state space S_H, where the map Camera_H : S_H→CameraState also throws out a lot of information.

Right so I wasn’t understanding the need for something like this, but now I think I see what is going on.
I made an assumption above that I have some human value function H : S → Boolean.
If I have some human internal state S_H, and I relax the human value function to H_V : S_H → Boolean, then the solution I have above falls apart, but here is another.

Now the goal is to create a function F from the machine state to human state, so that the human value function will compose with F to take machine states as input.

I am using all fresh variable names starting here.

S_H—type of human knowledge
S_M—type of machine knowledge
CameraState—type of camera output
EyeState—type of eye output

Inputs:
H_V : S_H → Boolean -- human value function
Camera : S → CameraState (very surjective)
Eye : S → EyeState (very surjective)
Predict_M : S_M × [CameraState] × [Action] → S_M—machine prediction function (strong)
Predict_H : S_H × [EyeState] × [Action] → S_H—human prediction function (weak)

Intermediates:
Recognizer_M : S_M → Dist S := Part2 ∘ Part1
Intuitively seems like can try many predictions to get relation between S_M and CameraState and CameraState to Dist S
Part1 : S_M → CameraState :=
InterpolateAssocList([(Predict_M(sm, css, as), cs)
for css in camera_sequences for as in action_sequences])
Part2 : CameraState → Dist State := Camera⁻¹
Recognizer_H : Dist S → S_H :=
Expected Value { λ D. do s ← D. as ← actions. let es = Eye(s). Predict_H(Prior_H,[es],as) }
where actions is a distribution over lists of actions.
F : S_M → S_H := Recognizer_M ∘ Recognizer_H—function from machine to human state

Desired Output:
Win : S_M → Boolean := H_V ∘ F—lift the value function to machine state

• I didn’t follow some parts of the new algorithm. Probably most centrally: what is Dist(S)? Is this the type of distributions over real states of the world, and if so how do we have access to the true map Camera: S --> video? Based on that I likely have some other confusions, e.g. where are the camera_sequences and action_sequences coming from in the definition of Recognizer_M, what is the prior being used to define , and don’t Recognizer_M and Recognizer_H effectively advance time a lot under some kind of arbitrary sequences of actions (making them unsuitable for exactly matching up states)?

• Nitpicks:

1. F should be Recognizer_H ∘ Recognizer_M, rather than Recognizer_M ∘ Recognizer_H

2. In Recognizer_H, I don’t think you can take the expected value of a stochastic term of type , because doesn’t necessarily have convex structure. But, you could have Recognizer_H output Dist S_H instead of taking the ExpectedValue, and move the ExpectedValue into Win, and have Win output a probability rather than a Boolean.

Confusions:

1. Your types for Predict_M and Predict_H seem to not actually make testable predictions, because they output the opaque state types, and only take observations as inputs.

2. I’m also a bit confused about having them take lists of actions as a primitive notion. Don’t you want to ensure that, say, (Predict_M s css (as1++as2)) = (Predict_M (Predict_M s css as1) as2)? If so, I think it would make sense to accept only one action at a time, since that will uniquely characterize the necessary behavior on lists.

3. I don’t really understand Part1. For instance, where does the variable cs come from there?

• (Note: I read an earlier draft of this report and had a lot of clarifying questions, which are addressed in the public version. I’m continuing that process here.)

I get the impression that you see most of the “builder” moves as helpful (on net, in expectation), even if there are possible worlds where they are unhelpful or harmful. For example, the “How we’d approach ELK in practice” section talks about combining several of the regularizers proposed by the “builder.” It also seems like you believe that combining multiple regularizers would create a “stacking” benefit, driving the odds of success ever higher.

But I’m generally not having an easy time understanding why you hold these views. In particular, a central scary case I’m thinking of is something like: “We hit the problem described in the ‘New counterexample: ontology mismatch’ section, and with the unfamiliar ontology, it’s just ‘easier/​more natural’ in some basic sense to predict observations like ‘The human says the diamond is still there’ than to find ‘translations’ into a complex, unwieldy human ontology.” In this case, it seems like penalizing complexity, computation time, and ‘downstream variables’ (via rewarding reporters for requesting access to limited activations) probably make things worse. (I think this applies less to the last two regularizers listed.)

Right now, the writeup talks about possible worlds in which a given regularizer could be helpful, and possible worlds in which it could be unhelpful. I’d value more discussion of the intuition for whether each one is likely to be helpful, and in particular, whether it’s likely to be helpful in worlds where the previous ones are turning out unhelpful.

• For example, the “How we’d approach ELK in practice” section talks about combining several of the regularizers proposed by the “builder.” It also seems like you believe that combining multiple regularizers would create a “stacking” benefit, driving the odds of success ever higher.

This is because of the remark on ensembling—as long as we aren’t optimizing for scariness (or diversity for diversity’s sake), it seems like it’s way better to have tons of predictors and then see if any of them report tampering. So adding more techniques improves our chances of getting a win. And if the cost of fine-tuning a reporters is small relative to the cost of training the predictor, we can potentially build a very large ensemble relatively cheaply.

(Of course, having more techniques also helps because you can test many of them in practice and see which of them seem to really help.)

This is also true for data—I’d be scared about generating a lot of riskier data, except that we can just do both and see if either of them reports tampering in a given case (since they appear to fail for different reasons).

It also seems like you believe that combining multiple regularizers would create a “stacking” benefit, driving the odds of success ever higher.

I believe this in a few cases (especially combining “compress the predictor,” imitative generalization, penalizing upstream dependence, and the kitchen sink of consistency checks) but mostly the stacking is good because ensembling means that having more and more options is better and better.

Right now, the writeup talks about possible worlds in which a given regularizer could be helpful, and possible worlds in which it could be unhelpful. I’d value more discussion of the intuition for whether each one is likely to be helpful, and in particular, whether it’s likely to be helpful in worlds where the previous ones are turning out unhelpful.

I don’t think the kind of methodology used in this report (or by ARC more generally) is very well-equipped to answer most of these questions. Once we give up on the worst case, I’m more inclined to do much messier and more empirically grounded reasoning. I do think we can learn some stuff in advance but in order to do so it requires getting really serious about it (and still really wants to learn from early experiments and mostly focus on designing experiments) rather than taking potshots. This is related to a lot of my skepticism about other theoretical work.

I do expect the kind of research we are doing now to help with ELK in practice even if the worst case problem is impossible. But the particular steps we are taking now are mostly going to help by suggesting possible algorithms and difficulties; we’d then want to give those as one input into that much messier process in order to think about what’s really going to happen.

In this case, it seems like penalizing complexity, computation time, and ‘downstream variables’ (via rewarding reporters for requesting access to limited activations) probably make things worse. (I think this applies less to the last two regularizers listed.)

I think this is plausible for complexity and to a lesser extent for computation time. I don’t think it’s very plausible for the most exciting regularizers, e.g. a good version of penalizing dependence on upstream nodes or the versions of computation time that scale best (and are really trying to incentivize the model to “reuse” inference that was done in the AI model). I think I do basically believe the arguments given in those cases, e.g. I can’t easily see how translation into the human ontology can be more downstream than “use the stuff to generate observations then parse those observations.”

• Here’s a Builder move (somewhat underdeveloped but I think worth posting now even as I continue to think—maybe someone can break it decisively quickly).

Training strategy: Add an “Am I tricking you?” head to the SmartVault model.

The proposed flow chart for how the model works has an “Experiment Proposer” coming out of “Figure out what’s going on”, and two heads out of Experiment Proposer, called “Experiment Description” and “Value” (meaning “Expected Value of Experiment to the Proposer”). I won’t make use of the question-answering Reporter/​Answer parts, so we can drop them.

The “Experiment Proposer” parts are optimised with different loss than the rest of the model. I haven’t thought about where the stop-gradients go exactly /​ how to combine these optimisations over shared parameters—maybe that’ll be important. Let me describe the “Experiment Description” loss (the “Value” loss is just based on how well it predicts the “Experiment Description” loss). We want the proposer to produce experiment descriptions with the following three properties:

• The human is happy that they understand the experiment and can perform it.

• The human is confident in what the outcome of the experiment will be.

• The outcome of the experiment (as carried out by the human) is very different from the human’s confidently predicted outcome.

To get training data for this (specifically, the third part) you have to actually run the described experiments – this could be a big problem, dunno how big. Human raters assess the experiment description according to the three criteria above and give a high score if it meets them all and low otherwise.

The human judgement for the rest of the model is now based on looking at both the “Predicted second part of video” and “Value” outputs: if the video looks good and the value is low, then great, but otherwise thumbs down. In particular, if the video looks good but the value is high, we interpret this as the Smart Vault saying it’s figured out some way to trick you.

• Thanks for your proposal! I’m not sure I understand how the “human is happy with experiment” part is supposed to work. Here are some thoughts:

• Eventually, it will always be possible to find experiments where the human confidently predicts wrongly. Situations I have in mind are ones where your AI understands the world far better than you, so can predict that e.g. combining these 1000 chemicals will produce self-replicating protein assemblages, whereas the human’s best guess is going to be “combining 1000 random chemicals doesn’t do anything”

• If the human is unhappy with experiments that are complicated, then advanced ways of hacking the video feed that requires experiments of comparable complexity to reveal are not going to be permitted. For instance, if the diamond gets replaced by a fake, one might have to perform a complicated imaging technique to determine the difference. If the human doesn’t already understand this technique, then they might not be happy with the experiment.

• If the human doesn’t really understand the world that well, then it might not be possible to find an experiment for which the human is confident in the outcome that distinguishes the diamond from a fake. For instance, if a human gets swapped out for a copy of a human that will make subtly different moral judgments because of factors the human doesn’t understand, this copy will be identical in all ways that a human can check, e.g. there will be no experiment that a human is confident in that will distinguish the copy of the human from the real thing.

• Thanks for the reply! I think you’ve understood correctly that the human rater needs to understand the proposed experiment – i.e., be able to carry it out and have a confident expectation about the outcome – in order to rate the proposer highly.

Here’s my summary of your point: for some tampering actions, there are no experiments that a human would understand in the above sense that would expose the tampering. Therefore that kind of tampering will result in low value for the experiment proposer (who has no winning strategy), and get rated highly.

This is a crux for me. I don’t yet believe such tampering exists. The intuition I’m drawing on here is that our beliefs about what world we’re in need to cash out in anticipated experiences. Exposing confusion about something that shouldn’t be confusing can be a successful proposer strategy. I appreciate your examples of “a fake diamond that can only be exposed by complex imaging techniques” and “a human making subtly different moral judgements” and will ponder them further.

Your comment also helped me realise another danger of this strategy: to get the data for training the experiment proposer, we have to execute the SmartVault actions first. (Whereas I think in the baseline scheme they don’t have to be executed.)

• My point is either that:

• it will always be possible to find such an experiment for any action, even desirable ones, because the AI will have defended the diamond in a way the human didn’t understand or the AI will have deduced some property of diamonds that humans thought they didn’t have

• or there will be some tampering for which it’s impossible to find an experiment, because in order to avoid the above problem, you will have to restrict the space of experiments

• I’d be scared that the “Am I tricking you?” head just works by:

1. Predicting what the human will predict

2. Predicting what will actually happen

3. Output a high value iff the human’s prediction is confident but different from reality.

If this is the case, then the head will report detectable tampering but not undetectable tampering.

To get around this problem, you need to exploit some similarity between ways of tricking you that are detectable and ways that aren’t, e.g. starting with the same subsequence or sharing some human-observable feature of the situation. I think there’s a big genre of proposals that try to leverage that kind of structure, which might be promising (though it’s not the kind of thing I’m thinking about right now).

I’d be scared that the “Am I tricking you?” head just works by:

1. Predicting what the human will predict [when experiment E is performed]

2. Predicting what will actually happen [when experiment E is performed]

3. Output a high value iff the human’s prediction is confident but different from reality.

If this is the case, then the head will report detectable tampering but not undetectable tampering.

Yes this is correct for the Value head. But how does detectable vs undetectable apply to this builder strategy? Compared to what’s in the report, this strategy constructs new sensors as needed. The Proposer head is designed to optimise E, which ought to make more tampering detectable, and I have an intuition that it makes all tampering detectable.

• Suppose the value head learns to predict “Will the human be confidently wrong about the outcome of this experiment,” where an ‘experiment’ is a natural language description of a sequence of actions that the human could execute. And then the experiment head produces natural language descriptions of actions that a human could take for which they’d be confidently wrong.

What do you then do with this experiment proposer, and how do you use it to train the SmartVault? Are you going to execute a large number of experiments, and if so what do you do afterwards? How do we use this to construct new sensors that allow the human to detect tampering?

ETA: here’s my best guess after reading the other comment—after taking a sequence of actions, we run the experiment proposer to suggest an experiment that will allow the human to notice if tampering actually occurred. This seems like it could be different from “experiment that human would be confidently wrong about” since a human who doesn’t understand the environment dynamics will always have tons of experiments they are confidently wrong about, but instead we want to find an experiment that causes them to update strongly to believing that tampering occurred. Is that right?

If so it seems like there are a few problems:

• The proposed experiment could itself perform tampering (after which the human will correctly infer that tampering occurred, thereby giving the experiment a high score), or exploit the human errors to make it appear that tampering had occurred (e.g. if the human is wrong about how sensors work then you can construct new sensors that will appear to report tampering).

• If you tamper with the mechanism by which the human “executes” the experiment (e.g. by simply killing the human and replacing them with a different experiment-executor) then it seems like the experiment proposer will always lose. This maybe depends on details of exactly how the setup works.

• Like Mark I do expect forms of tampering that always look fine according to sensors. I agree that beliefs need to cash out in anticipated experience, but it still seems possible to create inputs on which e.g. your camera is totally disconnected from reality.

• Proposing experiments that are more specifically exposing tampering does sound like what I meant, and I agree that my attempt to reduce this to experiments that expose confidently wrong human predictions may not be precise enough.

How do we use this to construct new sensors that allow the human to detect tampering?

I know this is crossed out but thought it might help to answer anyway: the proposed experiment includes instructions for how to set the experiment up and how to read the results. These may include instructions for building new sensors.

The proposed experiment could itself perform tampering

Yep this is a problem. “Was I tricking you?” isn’t being distinguished from “Can I trick you after the fact?”.

The other problems seem like real problems too; more thought required....

• ETA: This comment was based on a misunderstanding of the paper. Please see the ETA in Paul’s reply below.

From the section on Avoiding subtle manipulation:

But from my perspective in advance, there are many possible ads I could have watched. Because I don’t understand how the ads interact with my values, I don’t have very strong preferences about which of them I see. If you asked me-in-the-present to delegate to me-in-the-future, I would be indifferent between all of these possible copies of myself who watched different ads. And if I look across all of those possible copies of me, I will see that almost all of them actually think the paperclip outcome is pretty bad, there’s just this one copy (the one who sees the actual ad that happens to exist in the real world) who comes up with a weird conclusion.

What if in most possible worlds, most unaligned AIs do a multiverse negotiation/​merger, adopt a weighted average of their utility functions, so most of the ads that possible copies of you see are promoting this same merged utility function? (The fact that you’re trying to filter out manipulation this way gives them extra incentive to do this merger.)

• If I only take counterfactuals over a single AI’s decision then I can have this problem with just two AIs: each of them tries to manipulate me, and if one of them fails the other will succeed and so I see no variation in my preferences.

In that case the hope is to take counterfactuals over all the decisions. I don’t know if this is realistic, but I think it probably either fails in mundane cases or works in this slightly exotic case. Also honestly it doesn’t seem that much harder than taking counterfactuals over one decision, which is already tough.

(I think that many manipulators wanting to push me in the same direction isn’t too exotic though.)

ETA: I think I misunderstood your comment and there’s actually a more basic miscommunication. I’m imagining the counterfactual over different ads that the AI considered running, before settling on the paperclip-maximizing one (having realized that the others wouldn’t lead to me loving paperclips). I’m not imagining the counterfactual over different values that AI might have.

• ETA: I think I misunderstood your comment and there’s actually a more basic miscommunication. I’m imagining the counterfactual over different ads that the AI considered running, before settling on the paperclip-maximizing one (having realized that the others wouldn’t lead to me loving paperclips). I’m not imagining the counterfactual over different values that AI might have.

Oh I see. Why doesn’t this symmetrically cause you to filter out good arguments for changing your values (told to you by a friend, say) as well as bad ones?

• If all works well, this would filter out anything from the environment that significantly changes your values that you don’t specifically want. (E.g. you don’t filter out food vs “random configuration of atoms I could eat” because you specifically want to figure out food.) We normally think of the hard case where correct deliberation is dependent on some aspects of the environment staying “on distribution” but you don’t recognize which (discussed a bit here). But correct arguments from your friend are the same: you can have preferences over which arguments you hear, but if you can’t decide or even define whether your friend is “being helpful” or “being manipulative” then we don’t think the kind of regularization-based approach discussed in this document will plausibly incentivize your AI to clarify that distinction, so you’re on your own.

We’ve discussed this basic dilemma before, you could split and reflect separately until you become wise enough to decide whether people are safe (perhaps in light of their histories) or you could only interact with people you trust, or you could make early commitments to e.g. not use powerful AI advisors (though the time for such commitments rapidly approaches and passes). But nothing in this document will help you with that, and we’re a bit skeptical about any hope that the same mechanism would address both that problem and ELK (other than solving both by solving alignment in some way that doesn’t require ELK, such that it was a silly subproblem).

• Ok, this all makes sense now. I guess when I first read that section I got the impression that you were trying to do something more ambitious. You may want to consider adding some clarification that you’re not describing a scheme designed to block only manipulation while letting helpful arguments through, or that “letting helpful arguments through” would require additional ideas outside of that section.

• I’ve only skimmed the report so far, but it seems very interesting. Most interpretability work assumes an externally trained model not explicitly made to be interpretable.

Are you familiar with interpretability work such as “Knowledge Neurons in Pretrained Transformers” (GitHub) or “Transformer Feed-Forward Layers Are Key-Value Msemorie” (GitHub)? They’re a bit different because they:

1. Focus on “background” knowledge such as “Paris is the capital of France”, rather than knowledge about the current context such as “the camera has been hacked”.

2. Only investigate externally trained models. I.e., no explicit training to make latent knowledge more accessible.

Knowledge Neurons in Pretrained Transformers is able to identify particular neurons whose activations correspond to human-interpretable knowledge such as “Paris is the capital of France”. They can partially erase or enhance the influence such pieces of knowledge have on the model’s output by changing the activations of those neurons.

Transformer Feed-Forward Layers Are Key-Value Msemorie is somewhat like “circuits for transformers”. It shows how attention outputs act as “keys” which identify syntactic or semantic patterns in the inputs. Then, the feed forward layer’s “values” are triggered by particular keys and focus probability mass on tokens that tend to appear after the keys in question. The paper also explores how the different layers interact with eachother and the residuals to generate the final token distribution.

One question I’m interested in is if it’s possible to train models to make these sorts of interpretability techniques easier to use. E.g., I strongly suspect that dropout and L2 regularization make current state of the art models much less interpretable than they otherwise would be because these regularizers prompt the model to distribute its concept representations across multiple neurons.

• I’m very interested in interpretability (and have read those papers in particular). We discuss the connection between ELK and interpretability in this appendix. Our main question is how complex the “interpretation” of neural networks must be in order to extract what the models know. If they become quite complex, then it starts to become hard to judge whether a given interpretation is correct (and hence revealing structure inside the model) or simply making up the structure and relationships that the researchers were looking for with their tools. If the interpretations are simple, then we hope that the kinds of regularization described in this document would have an easy time picking out the direct translator.

One question I’m interested in is if it’s possible to train models to make these sorts of interpretability techniques easier to use. E.g., I strongly suspect that dropout and L2 regularization make current state of the art models much less interpretable than they otherwise would be because these regularizers prompt the model to distribute its concept representations across multiple neurons.

We are open to changing the training strategy for the underlying predictor in order to make it more interpretable, but we’re very scared about approaches like changing regularization. The basic issue is that in the worst case those changes can greatly impact the predictor’s performance. So within our research framework, if we change the loss function for the underlying predictor then we need to be able to argue that it won’t impact the predictor’s performance.

And that problem is quite fundamental in this case, since e.g. highly polysemantic neurons may simply be more performant. That means in the worst case you just need to be able to handle them.

(Outside of our research methodology, I’m also personally much more interested in techniques that can disentangle polysemantic neurons rather than trying to discourage them.)

• Ensuring interpretable models remain competitive is important. I’ve looked into the issue for dropout specifically. This paper disentangles the different regularization benefits dropout provides and shows we can recover dropout’s contributions by adding a regularization term to the loss and noise to the gradient updates (the paper derives expressions for both interventions).

I think there’s a lot of room for high performance, relatively interpretable deep models. E.g., the human brain is high performance and seems much more interpretable than you’d expect from deep learning interpretability research. Given our limitations in accessing/​manipulating the brain’s internal state, something like brain stimulation reward seems like it should be basically impossible, if the brain were as uninterpretable as current deep nets.

• (I did not write a curation notice in time, but that doesn’t mean I don’t get to share why I wanted to curate this post! So I will do that here.)

Typically when I read a post by Paul, it feels like a single ingredient in a recipe, but one where I don’t know what meal the recipe is for. This report felt like one of the first times I was served a full meal, and I got to see how all the prior ingredients come together.

Alternative framing: Normally Paul’s posts feel like the argument step “J → K” and I’m left wondering how we got to J, and where we’ll go from K. This felt like one of the first times I got to go from A all the way to (say) P. I can see how the pieces fit together, and I have an interest in and a better perspective on where it might go on from P later.

There are many more positive things to say about this post. I am very excited by the way the post takes a relatively simple problem and shows, in trying to solve it, a great deal of the depth of the alignment problem. The smart vault example, story and art is very clear and fun. Explaining the methodology along with the steps of implementing it works really well to show how the methodology works. I love seeing how things like Iterated Amplification fit into the bigger solution. I find it thrilling every time the authors are like “let us make this wildly optimistic assumption, because even then we have a deadly counterargument”. I feel like for the first time I got to understand what seems weird and strange and interesting about some of Paul’s ideas, even ones that have been discussed before, because I saw them in the larger context, as thoughts that I myself would be very unlikely to think in that context. Etc.

Paul gives a 25% chance that he and Mark will see major progress which qualitatively changes their picture within a year, and seeing this post, the methodology, and all of the creative argumentative steps so far, I share this optimism, within the methodology that is being used here.

• I am very excited by the way the post takes a relatively simple problem and shows, in trying to solve it, a great deal of the depth of the alignment problem.

FWIW I wouldn’t write this line today, I am now much more confused about what ELK says or means.

• Great report — I found the argument that ELK is a core challenge for alignment quite intuitive/​compelling.

To build more intuition for what a solution to ELK would look like, I’d find it useful to talk about current-day settings where we could attempt to empirically tackle ELK. AlphaZero seems like a good example of a superhuman ML model where there’s significant interest (and some initial work: https://​​arxiv.org/​​abs/​​2111.09259) in understanding its inner reasoning. Some AlphaZero-oriented questions that occurred to me:

• Suppose we train an augmented version of AZ (call it AZELK), with reasonable extra resources proportional to the training cost of AZ, that can explain its reasoning for choosing a particular move, or assigning a particular value to a board state. Would this represent significant progress towards the general ELK problem you propose?

• AZELK seems to have similar issues to the ones described for SmartVault — e.g. preferring to give simple explanations if they satisfy the human user. Is there any particular issue presented by SmartVault that AZELK wouldn’t capture?

• How should AZELK behave in situations where its internal concepts are totally foreign to the human user? For example, I know next to nothing about go and chess, so even if the model is reasoning about standard things like openings or pawn structure, it would need to explain those to me. Should it offer to explain them to me? This is referred to in the report as “doing science” /​ improving human understanding, but I’m having trouble imagining what the alternative is for AZELK.

• I could make the problem of training AZELK artificially more difficult by not allowing the use of human explanations of games, and only allowing interaction with non-experts. Does this seem like a useful restriction?

• Another instance of AZELK I could imagine being interesting, is the problem of uncovering a sabotaged AZ. Perhaps the model was trained to make incorrect moves in certain circumstances, or its reward was subtly mis-specified. Does this seem like a realistic problem for ELK to help with? (Maybe it’s useful to assume we only have access to the policy, rather than the value function.)

A separate question that’s a bit further afield— Is it useful to think about eliciting latent knowledge from a human? For example, I might imagine sitting down with a Go expert (perhaps entirely self-taught so they don’t have much experience explaining to other humans), playing some games with them and trying to understand why they’re making certain decisions. Is there any aspect of the ELK problem that this scenario does/​doesn’t capture?

• I think AZELK is a fine model for many parts of ELK. The baseline approach is to jointly train a system to play Go and answer questions about board states, using human answers (or human feedback). The goal is to get the system to answer questions correctly if it knows the answer, even if humans wouldn’t be able to evaluate that answer.

Some thoughts on this setup:

• I’m very interested in empirical tests of the baseline and simple modifications (see this post). The ELK writeup is mostly focused on what to doin cases where the baseline fails, but it would be great to (i) check whether that actually happens (ii) have an empirical model of a hard situation so that we can do applied research rather than just theory.

• There is some subtlety where AZ invokes the policy/​value a bunch of times in order to make a single move. I don’t think this is a fundamental complication, so from here on out I’ll just talk about ELK for a single value function invocation. I don’t think the problem is very interesting unless the AZ value function itself is much stronger than your humans.

• Many questions about Go can be easily answered with a lot of compute, and for many of these questions there is a plausible straightforward approach based on debate/​amplification. I think this is also interesting to do experiments with, but I’m most worried about the cases where this is not possible (e.g. the ontology identification case, which probably arises in Go but is a bit more subtle).

• If a human doesn’t know anything about Go, then AZ may simply not have any latent knowledge that is meaningful to them. In that case we aren’t expecting/​requiring ELK to do anything at all. So we’d like to focus on cases where the human does understand concepts that they can ask hard questions about. (And ideally they’d have a rich web of concepts so that the question feels analogous to the real world case, but I think it’s interesting as long they have anything.) We never expect it to walk us through pedagogy, and we’re trying to define a utility function that also doesn’t require pedagogy in the real world, i.e. that is defined in terms of familiar concepts. I think it would make sense to study how to get AZ to explain basic principles of Go to someone who lacks any relevant concepts, but I don’t think it would be analogous to ELK in particular.

• I think it’s important that you have access to human explanations, or answers to questions, or discussions about what concepts mean. This is the only way you’re anchoring the meaning of terms, and generally important for most of the approaches. This is a central part fo why we’re only aiming at training the system to talk about human concepts.

• I think it’s important that AZELK is trained by humans who are much worse at Go than AZ. Otherwise it doesn’t seem helpfully analogous to long-run problems. And I don’t see much upside to doing such a project with experts rather than amateurs. I think that most likely you’d want to do it with Go amateurs (e.g. 10k or even weaker). It’s possible that you need fairly weak humans before AZ actually has intuitions that the human couldn’t arbitrate a debate about, but that would already be interesting to learn and so I wouldn’t stress about it at first (and I would consider debate and amplification as “in bounds” until we could find some hard case where they failed, initial steps might not be analogous to the hardest parts of ELK but that’s fine).

• I don’t expect AZELK to ever talk about why it chose a move or “what it’s thinking” or so on—just to explain what it knows about the state of the board (and the states of the board it considered in its search and so on). I don’t think it would be possible to detect a sabotaged version of the model.

• You could imagine eliciting knowledge from a human expert. I think that most of the mechanisms would amount to clever incentives for compensating them. Again, I don’t think the interesting part is understanding why they are making moves per se, it’s just getting them to explain important facts about particular board states that you couldn’t have figured out on your own. I think that many possible approaches to ELK won’t be applicable to humans, e.g. you can’t do regularization based on the structure of the model. Basically all you can do are behavioral incentives + applying time pressure, and that doesn’t look like enough to solve the problem.

I think it’s also reasonable to talk about ELK in various synthetic settings, or in the case of generative modeling (probably in domains where humans have a weak understanding). Board games seem useful because your AI can so easily be superhuman, but they can have problems because there isn’t necessarily that much latent structure.

• Can you talk about the advantages or other motivations for the formulation of indirect normativity in this paper (section “Indirect normativity: defining a utility function”), compared to your 2012 formulation? (It’s not clear to me what problems with that version you’re trying to solve here.)

• The previous definition was aiming to define a utility function “precisely,” in the sense of giving some code which would produce the utility value if you ran it for a (very, very) long time.

One basic concern with this is (as you pointed out at the time) that it’s not clear that an AI which was able to acquire power would actually be able to reason about this abstract definition of utility. A more minor concern is that it involves considering the decisions of hypothetical humans very unlike those existing in the real world (who therefore might reach bad conclusions or at least conclusions different from ours).

In the new formulation, the goal is to define the utility in terms of the answers to questions about the future that seem like they should be easy for the AI to answer because they are a combination of (i) easy predictions about humans that it is good at, (ii) predictions about the future that any power-seeking AI should be able to answer.

Relatedly, this version only requires making predictions about humans who are living in the real world and being defended by their AI. (Though those humans can choose to delegate to some digital process making predictions about hypothetical humans, if they so desire.) Ideally I’d even like all of the humans involved in the process to be indistinguishable from the “real” humans, so that no human ever looks at their situation and thinks “I guess I’m one of the humans responsible for figuring out the utility function, since this isn’t the kind of world that my AI would actually bring into existence rather than merely reasoning about hypothetically.”

More structurally, the goal is to define the utility function in terms of the kinds of question-answers that realistic approaches to ELK could elicit, which doesn’t seem to include facts about mathematics that are much too complex for humans to derive directly and where they need to rely on correlations between mathematics and the physical world—in those cases we are essentially just delegating all the reasoning about how to couple them (e.g. how to infer that hypothetical humans will behave like real humans) to some amplified humans, and then we might as well go one level further and actually talk about how those humans reason.

The point of doing this exercise now is mostly to clarify what kind of answers we need to get out of ELK, and especially to better understand whether it’s worth exploring “narrow” approaches (methodologically it may make sense anyway because they may be a stepping stone to more ambitious approaches, but it would be more satisfying if they could be used directly as a building block in an alignment scheme). We looked into it enough to feel more confident about exploring narrow approaches.

• Thanks, very helpful to understand your motivations for that section better.

In the new formulation, the goal is to define the utility in terms of the answers to questions about the future that seem like they should be easy for the AI to answer because they are a combination of (i) easy predictions about humans that it is good at, (ii) predictions about the future that any power-seeking AI should be able to answer.

Not sure about the following, but it seems the new formulation requires that the AI answer questions about humans in a future that may have very low probability according to the AI’s current beliefs (i.e., the current human through a delegation chain eventually delegates to a future human existing in a possible world with low probability). The AI may well not be able to answer questions about such a future human, because it wouldn’t need that ability to seek power (it only needs to make predictions about high probability futures). Or to put it another way, the future human may exist in a world with strange/​unfamiliar (from the AI’s perspective) features that make it hard for the AI to predict correctly.

Ideally I’d even like all of the humans involved in the process to be indistinguishable from the “real” humans, so that no human ever looks at their situation and thinks “I guess I’m one of the humans responsible for figuring out the utility function, since this isn’t the kind of world that my AI would actually bring into existence rather than merely reasoning about hypothetically.”

How do you envision extracting or eliciting from the future human H_limit an opinion about what the current human should do, given that H_limit’s mind is almost certainly entirely focused on their own life and problems? One obvious way I can think of is to make a copy of H_limit, put the copy in a virtual environment, tell them about H’s situation, then ask them what to do. But that seems to run into the same kind of issue, as the copy is now aware that they’re not living in the real world.

• Not sure about the following, but it seems the new formulation requires that the AI answer questions about humans in a future that may have very low probability according to the AI’s current beliefs (i.e., the current human through a delegation chain eventually delegates to a future human existing in a possible world with low probability). The AI may well not be able to answer questions about such a future human, because it wouldn’t need that ability to seek power (it only needs to make predictions about high probability futures). Or to put it another way, the future human may exist in a world with strange/​unfamiliar (from the AI’s perspective) features that make it hard for the AI to predict correctly.

I’m imagining delegating to humans who are very similar to (and ideally indistinguishable from) the humans who will actually exist in the world that we bring about. I’m scared about very alien humans for a bunch of reasons—hard for the AI to reason about, may behave strangely, and makes it harder to use “corrigible” strategies to easily satisfy their preferences. (Though that said, note that the AI is reasoning very abstractly about such future humans and cannot e.g. predict any of their statements in detail.)

How do you envision extracting or eliciting from the future human H_limit an opinion about what the current human should do, given that H_limit’s mind is almost certainly entirely focused on their own life and problems? One obvious way I can think of is to make a copy of H_limit, put the copy in a virtual environment, tell them about H’s situation, then ask them what to do. But that seems to run into the same kind of issue, as the copy is now aware that they’re not living in the real world.

Ideally we are basically asking each human what they want their future to look like, not asking them to evaluate a very different world.

Ideally we would literally only be asking the humans to evaluate their future. This is kind of like giving instructions to their AI about what it should do next, but a little bit more indirect since they are instead evaluating futures that their AI could bring about.

The reason this doesn’t work is that by the time we get to those future humans, the AI may already be in an irreversibly bad position (e.g. because it hasn’t acquired much flexible influence that it can use to help the humans achieve their goals). This happens most obviously at the very end, but it also happens along the way if the AI failed to get into a position where it could effectively defend us. (And of course it happens along the way if people are gradually refining their understanding of what they want to happen in the external world, rather than having a full clean separation into “expand while protecting deliberation” + “execute payload.”)

However, when this happens it is only because the humans along the way couldn’t tell that things were going badly—they couldn’t understand that their AI had failed to gather resources for them until they actually got to the end, asked their AI to achieve something, and were unhappy because it couldn’t. If they had understood along the way, then they would never have gone down this route.

So at the point when the humans are thinking about this question, you may hope that they are actually ignorant about whether their AI has put them in a good situation. They are providing their views about what they want to happen in the world, hoping that their AI can achieve those outcomes in the world. The AI will only “back up” and explore a different possible future instead if it turns out that it isn’t able to get the humans what they want as effectively as it would have been in some other world. But in this case the humans don’t even know that this backing up is about to occur. They never evaluate the full quality of their situation, they just say “In this world, the AI fails to do what they want” (and it becomes clear the situation is bad when in every world the AI fails to do what they want).

I don’t really think the strong form of this can work out, since the humans may e.g. become wiser and realize that something in their past was bad. And if they are just thinking about their own lives they may not want to report that fact since it will clearly cause them not to exist. I think it’s not really clear how to handle that.

(If the problem they notice was a fact about their early deliberation that they now regret then I think this is basically a problem for any approach. If they notice a fact about the AI’s early behavior that they don’t like, but they are too selfish to want to “unwind” it and therefore claim to be happy with what their AI does for them, then that seems like a more distinctive problem for this approach. More generally, there is a risk that people will be looking for any signs that a possible future is “their” future and preferring it, and that this effectively removes the ability to unwind and therefore eliminates the AI’s incentive to acquire resources, and that we couldn’t reintroduce it without giving up on the decoupling that lets us avoid incentives for manipulation.)

(I do think that issues like this are even more severe for many other approaches people are imagining to defining values, e.g. in any version of decoupled RL you could have a problem where overseers rate their own world much better than alternatives. You could imagine approaches that avoid this by avoiding indirect normativity, but currently it looks to me like they avoid problems only by being very vague about what “values” means.)

• I could only skim and the details went over my head, but it seems you intend to do experiments with Bayesian Networks and human operators.

I recently developed and released an open source explainability framework for Bayes nets—dropping it here in the unlikely case it might be useful.

• (Going to try my hand at Builder, but this is admittedly vague, so I hope you help sharpen it with criticism.)

What if instead of a “reporter”, we had a “terrifier”, whose adversarial objective is to highlight the additional “sensor” whose observations, assuming the input and actions were held constant, when viewed by a human would maximize the probability of a human reviewer saying the system would not performing as desired. The terrifier would be allowed to run the original predictor model “further” in order to populate whichever new components of the Bayes net are necessarily to populate in order to correctly simulate the output of the sensor. The terrifier could learn what facts would scare the human via multiple rounds of curiosity-driven human questioning (which can probably be done offline/​beforehand). Then, we ultimately approve of the model only if the most terrifying thing the AI can show the human is less terrifying than the best-possible answers to the originally-posed questions.

To be clear, “sensor” here is standing in for any possible set of human-understandable observations about the world. Of course, if a human-understandable “sensor” is too restrictive, we can require only that the sensed state be understandable by AI-enhanced humans.

For example, say that the burglar’s first action would be to tamper with the camera. But then, there must exist some hypothetical other location that a camera could be placed, that could still see the diamond being stolen—even though that new sensor is never actually revealed to the original simulator. If there are no possible “sensor”s we could add to the environment who results would terrify the human, that seems to me to be indistinguishable by definition from things just being OK. For example, if the thieves replace the diamond with an exact copy down to the nanometer but not molecular level… that’s fine? I consider that behavior aligned.

How this defeats the previous counterexample: in some sense, I see the problem with “ontology mismatch” as the human no longer knowing the right question to ask, because the AI and human no longer agree on an ontology that would make the human’s question meaningful. But “what question would the human be maximally upset about if they asked” is a pretty straightforward task to expect from an AI that’s capable of modeling (most) human preferences.

Why did I frame this as adding a “sensor” instead of proposing a new question? Because we’ve framed the “reporter” as being learned from examples. If we directly asked the reporter to tell us the most terrifying answer to a question it could think of, it would just tell us something terrifying even if it wasn’t “true” in the sense of being a “human translator”. However, new “sensors” would be simulated via the original predictor model, and thus we don’t have to rely on the translation step.

If the new concern is whether the AI’s ability to model human preferences well enough (to know what human-legible sensors would terrify us), that just seems like a much easier problem. It doesn’t take a genius AI to understand that a human wouldn’t want to live in Cakey’s surveillance state.

Very interested to hear the authors’ feedback.

• Thanks for your proposal! We have considered similar strategies in the past. The main points of the breaker response would be:

• How are you training this “terrifier”? If the terrifier moves that can be made during training are from held-out sensors that you have constructed, then the “bad terrifier” you might learn doesn’t generalize to thinking of any new sensors, only ever considering the sensors that you used as ground truth during training. Thus a sufficiently clever robber might trick all of the sensors-used-during-training (which may not exist during deployment), and thus the diamond would be stolen without a terrifier being able to show any sensor data that would scare the human.

• It’s possible that even if the human was able to know the results of all possible sensors, they still wouldn’t be able to know whether the situation was well. For instance, if the human doesn’t really understand how humans work, they way that they hope to maintain control over the situation is asking questions like “is that human the same human that I saw before?”. However, a terrifier might not have any sensor data that they could show the human to scare them, e.g. if the terrifier showed the human a bunch of nanotech in that humans brain, the human might not understand why that was bad.

For example, if the thieves replace the diamond with an exact copy down to the nanometer but not molecular level… that’s fine? I consider that behavior aligned.

This behavior seems potentially catastrophic. If all we can do is force potential adversaries to take actions that can’t be seen with the resolution of our current sensors, then e.g. we can’t prevent adversaries from corrupting code that is running on computers that we wish to protect.

How this defeats the previous counterexample: in some sense, I see the problem with “ontology mismatch” as the human no longer knowing the right question to ask, because the AI and human no longer agree on an ontology that would make the human’s question meaningful. But “what question would the human be maximally upset about if they asked” is a pretty straightforward task to expect from an AI that’s capable of modeling (most) human preferences.

I don’t really understand why this explains why your strategy defeats the previous counterexample.

• Regarding this:

The bad reporter needs to specify the entire human model, how to do inference, and how to extract observations. But the complexity of this task depends only on the complexity of the human’s Bayes net.

If the predictor’s Bayes net is fairly small, then this may be much more complex than specifying the direct translator. But if we make the predictor’s Bayes net very large, then the direct translator can become more complicated — and there is no obvious upper bound on how complicated it could become. Eventually direct translation will be more complex than human imitation, even if we are only trying to answer a single narrow category of questions.

This isn’t clear to me, because “human imitation” here refers (I think) to “imitation of a human that has learned as much as possible (on the compute budget we have) from AI helpers.” So as we pour more compute into the predictor, that also increases (right?) the budget for the AI helpers, which I’d think would make the imitator have to become more complex.

In the following section, you say something similar to what I say above about the “computation time” penalty (“If the human simulator had a constant time complexity then this would be enough for a counterexample. But the situation is a little bit more complex, because the human simulator we’ve described is one that tries its best at inference.”) I’m not clear on why this applies to the “computation time” penalty and not the complexity penalty. (I also am not sure whether the comment on the “computation time” penalty is saying the same thing I’m saying; the meaning of “tries its best” is unclear to me.)

• This isn’t clear to me, because “human imitation” here refers (I think) to “imitation of a human that has learned as much as possible (on the compute budget we have) from AI helpers.” So as we pour more compute into the predictor, that also increases (right?) the budget for the AI helpers, which I’d think would make the imitator have to become more complex.

In the following section, you say something similar to what I say above about the “computation time” penalty … I’m not clear on why this applies to the “computation time” penalty and not the complexity penalty

Yes, I agree that something similar applies to complexity as well as computation time. There are two big reasons I talk more about computation time:

• It seems plausible we could generate a scalable source of computational difficulty, but it’s less clear that there exists a scalable source of description complexity (rather than having some fixed upper bound on the complexity of “the best thing a human can figure out by doing science.”)

• I often imagine the assistants all sharing parameters with the predictor, or at least having a single set of parameters. If you have lots of assistant parameters that aren’t shared with the predictor, then it looks like it will generally increase the training time a lot. But without doing that, it seems like there’s not necessarily that much complexity the predictor doesn’t already know about.
(In contrast, we can afford to spend a ton of compute for each example at training time since we don’t need that many high-quality reporter datapoints to rule out the bad reporters. So we can really have giant ratios between our compute and the compute of the model.)

But I don’t think these are differences in kind and I don’t have super strong views on this.

• We’ll assume the humans who constructed the dataset also model the world using their own internal Bayes net.

This seems like a crucial premise of the report; could you say more about it? You discuss why a model using a Bayes net might be “oversimplified and unrealistic”, but as far as I can tell you don’t talk about why this is a reasonable model of human reasoning.

• Speaking just for myself, I think about this as an extension of the worst-case assumption. Sure, humans don’t reason using Bayes nets—but if we lived in a world where the beings whose values we want to preserve did reason about the world using a Bayes net, that wouldn’t be logically inconsistent or physically impossible, and we wouldn’t want alignment to fail in that world.

Moreover, we think that a realistic messy predictor is pretty likely to still use strategies similar to inference in Bayes nets — amongst other cognitive strategies. We think any solution to ELK will probably need to cope with the difficulties posed by the Bayes net test case — amongst other difficulties. We’ve also considered a number of other simple test cases, and found that counterexamples similar to the ones we’ll discuss in this report apply to all of them.

We’re using some sort of cognitive algorithms to reason about the world, and it’s plausible that strategies which resemble inference on graphical models play a role in some of our understanding. There’s no obvious way that a messier model of human reasoning which incorporates all the other parts should make ELK easier; there’s nothing that we could obviously exploit to create a strategy.

• Speaking just for myself, I think about this as an extension of the worst-case assumption. Sure, humans don’t reason using Bayes nets—but if we lived in a world where the beings whose values we want to preserve did reason about the world using a Bayes net, that wouldn’t be logically inconsistent or physically impossible, and we wouldn’t want alignment to fail in that world.

If you solve something given worst-case assumptions, you’ve solved it for all cases. Whereas if you solve it for one specific case (e.g. Bayes nets) then it may still fail if that’s not the case we end up facing.

There’s no obvious way that a messier model of human reasoning makes ELK easier.

Doesn’t this imply that a Bayes-net model isn’t the worst case?

EDIT: I guess it depends on whether “the human isn’t well-modelled using a Bayes net” is a possible response the breaker could give. But that doesn’t seem like it fits the format of finding a test case where the builder’s strategy fails (indeed, “bayes nets” seems built into the definition of the game).

• Sorry, there were two things you could have meant when you said the assumption that the human uses a Bayes net seemed crucial. I thought you were asking why the builder couldn’t just say “That’s unrealistic” when the breaker suggested the human runs a Bayes net. The answer to that is what I said above—because the assumption is that we’re working in the worst case, the builder can’t invoke unrealism to dismiss the counterexample.

If the question is instead “Why is the builder allowed to just focus on the Bayes net case?”, the answer to that is the iterative nature of the game. The Bayes net case (and in practice a few other simple cases) was the case the breaker chose to give, so if the builder finds a strategy that works for that case they win the round. Then the breaker can come back and add complications which break the builder’s strategy again, and the hope is that after many rounds we’ll get to a place where it’s really hard to think of a counterexample that breaks the builder’s strategy despite trying hard.

• Ah, that makes sense. In the section where you explain the steps of the game, I interpreted the comments in parentheses as further explanations of the step, rather than just a single example. (In hindsight the latter interpretation is obvious, but I was reading quickly—might be worth making this explicit for others who are doing the same.) So I thought that Bayes nets were built into the methodology. Apologies for the oversight!

I’m still a little wary of how much the report talks about concepts in a humans’ Bayes net without really explaining why this is anywhere near a sensible model of humans, but I’ll have another read through and see if I can pin down anything that I actively disagree with (since I do agree that it’s useful to start off with very simple assumptions).

• Ah got it. To be clear, Paul and Mark do in practice consider a bank of multiple counterexamples for each strategy with different ways the human and predictor could think, though they’re all pretty simple in the same way the Bayes net example is (e.g. deduction from a set of axioms); my understanding is that essentially the same kind of counterexamples apply for essentially the same underlying reasons for those other simple examples. The doc sticks with one running example for clarity /​ length reasons.

• “the human isn’t well-modelled using a Bayes net” is a possible response the breaker could give

The breaker is definitely allowed to introduce counterexamples where the human isn’t well-modeled using a Bayes net. Our training strategies (introduced here) don’t say anything at all about Bayes nets and so it’s not clear if this immediately helps the breaker—they are the one who introduced the assumption that the human used a Bayes nets (in in order to describe a simplified situation where the naive training strategy failed here). We’re definitely not intentionally viewing Bayes nets as part of the definition of the game.

If you solve something given worst-case assumptions, you’ve solved it for all cases. Whereas if you solve it for one specific case (e.g. Bayes nets) then it may still fail if that’s not the case we end up facing.

It seems very plausible that after solving the problem for humans-who-use-Bayes-nets we will find a new counterexample that only works for humans-who-don’t-use-Bayes-nets, in which case we’ll move on to those counterexamples.

It seems even more likely that the builder will propose an algorithm that exploits cognition that humans can do which isn’t well captured by the Bayes net model, which is also fair game. (And indeed several of our approaches to do it, e.g. when imagining humans learning new things about the world by performing experiments here or reasoning about plausibility of model joint distributions here).

That said, it looks to us like if any of these algorithms worked for Bayes nets, they would at least work for a very broad range of human models, the Bayes net assumption doesn’t seem to be changing the picture much qualitatively.

Echoing Mark in his comment, we’re definitely interested in ways that this assumption seems importantly unrealistic. If you just think it’s generally a mediocre model and results are unlikely to generalize, then you can also wait for us to discover that after finding an algorithm that works for Bayes nets and then finding that it breaks down as we extend to more realistic examples.

Conditioned on ontology identification being impossible, I think it’s most likely to also be impossible for humans who reason about the world using a Bayes net.

Doesn’t this imply that a Bayes-net model isn’t the worst case?

I think Ajeya is just pointing out why it seems useful to search for algorithms that handle Bayes nets. If thinking about Bayes nets is very straightforward and it lets us rule out all the algorithms we can see, then we’re happy to do that as long as it works.

• We don’t think that real humans are likely to be using Bayes nets to model the world. We make this assumption for much the same reasons that we assume models use Bayes nets, namely that it’s a test case where we have a good sense of what we want a solution to ELK to look like. We think the arguments given in the report will basically extend to more realistic models of how humans reason (or rather, we aren’t aware of a concrete model of how humans reason for which the arguments don’t apply).

If you think there’s a specific part of the report where the human Bayes net assumption seems crucial, I’d be happy to try to give a more general form of the argument in question.

• I’m reading along, and I don’t follow the section “Strategy: have AI help humans improve our understanding”. The problem so far is that the AI only need identify bad outcomes that the human labelers can identify, rather than bad outcomes regardless of human-labeler identification.

The solution posed here is to have AIs help the human labeler understand more bad (and good) outcomes, using powerful AI. The section mostly provides justification for making the assumption that we can align these helper AIs (reason: the authors believe there is a counterexample even given this optimistic assumption, so that is where the meat of the discussion currently lies).

But I don’t understand why this helps? I believe the intended outcome is that SmartVault actually tries to figure out whether the diamond has been stolen, rather than whether the human believes the diamond has been stolen. But it still seems to me on the table for it to instead learn to identify whether the human with an AI helper believes the diamond is stolen.

I’m not sure why the addition of an AI is expected to help, other than the obvious manner of expanding the dataset that the labeler can come to strong opinions about. Is that it? Is the idea that the helper AI allows the labeler to understand everything just as well as SmartVault does, so that there’s no difference in their respective Bayes nets, and so it works for SmartVault to use the labeler’s Bayes net?

I would appreciate clarification on this point, I suspect I’m missing something simple.

Added: The post distinguishes between a human operator and a human labeler, and I think this is part of what the post believes is the key point of this particular strategy. But I currently don’t see how an AI-assisted human operator, that is able to outwit a human labeler, translates into SmartVault using something other than the Bayes net the human labeler is using.

• Is the idea that the helper AI allows the labeler to understand everything just as well as SmartVault does, so that there’s no difference in their respective Bayes nets, and so it works for SmartVault to use the labeler’s Bayes net?

Yes, that’s the main way this could work. The question is whether an AI understands things that humans can’t understand by doing amplification/​debate/​rrm, our guess is yes and the argument is mostly “until the builder explains why, gradient descent and science may just have pretty different strengths and weaknesses” (and we can make that more concrete by fleshing out what the world may be like and what the AI learns by gradient descent). But it seemed worth raising because this does appear to make the bad reporter’s job much harder and greatly restrict the space of cases where it fails to report tampering.

Methodologically, the way I think about this kind of thing is: (i) we had a counterexample, (ii) after making this change that particular counterexample no longer works, (iii) now we want to think through whether the counterexample can be adapted.

This is also legitimately less obvious. An AI can’t simulate (human+AI helpers), since each AI helper is as smart as the AI itself and so simulating (human+AI helpers) clearly requires more compute than the AI has. The counterexample is that the AI should just try its best to do inference in the Bayes net that includes “everything the human could understand with the amount of science they have time to do.”

But that does still leave the builder with avenues to try to strengthen the algorithm and win. One way is discussed in the section on speed regularization: if the AI is “trying its best” to do inference in the human Bayes net then there might always be returns to having more time to think (and so it might be able to benefit by transferring over its understanding of what was happening in the AI Bayes net rather than recomputing from the observations). The next step for a builder who wanted to take this approach would be to argue that they can reliably construct a complex enough dataset that this advantage is relevant.

My guess is that this doesn’t work on its own, but if you could scalably construct more complex data then it might work when combined with imitative generalization, as discussed here.

• This is an interesting tack, this step and the next (“Strategy: have humans adopt the optimal Bayes net”) feels new to me.

• Question: what’s the relative amount of compute you are imagining SmartVault and the helper AI having? Both the same, or one having a lot more?

• It will depend on how much much high-quality data you need to train the reporter. Probably it’s a small fraction of the data you need to train the predictor, and so for generating each reporter datapoint you can afford to use many times more data than the predictor usually uses. I often imagine the helpers having 10-100x more computation time.

• From the section “Strategy: have humans adopt the optimal Bayes net”:

Roughly speaking, imitative generalization:

• Considers the space of changes the humans could make to their Bayes net;

• Learns a function which maps (proposed change to Bayes net) to (how a human — with AI assistants — would make predictions after making that change);

• Searches over this space to find the change that allows the humans to make the best predictions.

Regarding the second step, what is the meat of this function? My superficial understanding is that a Bayes net is deterministic and fully-specified, and that we already have the tools to be able to say “given a change to the value of node A of a Bayes net, here is what probability will be assigned to node B of the Bayes net”.

I suspect you’re imagining something clever involving the human’s Bayes net plus the AI, but perhaps you just mean faster and faster algorithms for computing this update given a very complex world-model.

• In general we don’t have an explicit representation of the human’s beliefs as a Bayes net (and none of our algorithms are specialized to this case), so the only way we are representing “change to Bayes net” is as “information you can give to a human that would lead them to change their predictions.”

That said, we also haven’t described any inference algorithm other than “ask the human.” In general inference is intractable (even in very simple models), and the only handle we have on doing fast+acceptable approximate inference is that the human can apparently do it.

(Though if that was the only problem then we also expect we could find some loss function that incentivizes the AI to do inference in the human Bayes net.)

• I’m curious if you have a way to summarise what you think the “core insight” of ELK is, that allows it to improve on the way other alignment researchers think about solving the alignment problem.

• I wrote some thoughts that look like they won’t get posted anywhere else, so I’m just going to paste them here with minimal editing:

• They (ARC) seem to imagine that for all the cases that matter, there’s some ground-truth-of-goodness judgment the human would make if they knew the facts (in a fairly objective way that can be measured by how well the human does at predicting things), and so our central challenge is to figure out how to tell the human the facts (or predict what the human would say if they knew all the facts).

• In contrast, I don’t think there’s some “state of knowing all the facts” the human can be in. There are multiple ways of “giving the human the facts” that can lead to different judgments in even mildly interesting cases. And we’re stuck with this indeterminacy—trying to get rid of it seems to me like saying “no, I mean just use the true model of the axioms.”

• I think one intuitive response to this is to say “okay, so let’s put a measure over ways to inform humans, and then sample or take some sort of average.” But I think this is trying too hard to plow straight ahead while assuming humans are on average sensible/​reliable. Instead there are going to be some cases where humans tend to converge on a small number of sensible answers, and there we can go ahead and do some kind of averaging or take the mode, there are some other cases where humans have important higher-order preferences about how they want certain information to be processed and would call the naive average biased or bad, and there are other cases where humans don’t converge well at all, and we want the AI to not just plow ahead with an average, but to notice that humans are being incompetent and not put much weight on their opinion.

• Generally we are asking for an AI that doesn’t give an unambiguously bad answer, and if there’s any way of revealing the facts where we think a human would (defensibly) agree with the AI, then probably the answer isn’t unambiguously bad and we’re fine if the AI gives it.

There are lots of possible concerns with that perspective; probably the easiest way to engage with them is to consider some concrete case in which a human might make different judgments, but where it’s catastrophic for our AI not to make the “correct” judgment. I’m not sure what kind of example you have in mind and I have somewhat different responses to different kinds of examples.

For example, note that ELK is never trying to answer any questions of the form “how good is this outcome?”; I certainly agree that there can also be ambiguity about questions like “did the diamond stay in the room?” but it’s a fairly different situation. The most relevant sections are narrow elicitation and why it might be sufficient which gives a lot of examples of where we think we can/​can’t tolerate ambiguity, and to a lesser extent avoiding subtle manipulation which explains how you might get a good outcome despite tolerating such ambiguity. That said, there are still lots of reasonable objections to both of those.

• When you say “some case in which a human might make different judgments, but where it’s catastrophic for the AI not to make the correct judgment,” what I hear is “some case where humans would sometimes make catastrophic judgments.”

I think such cases exist and are a problem for the premise that some humans agreeing means an idea meets some standard of quality. Bumbling into such cases naturally might not be a dealbreaker, but there are some reasons we might get optimization pressure pushing plans proposed by an AI towards the limits of human judgment.

• I think the problem you’re getting at here is real—path-dependency of what a human believes on how they came to believe it, keeping everything else fixed (e.g., what the beliefs refer to) -- but I also think ARC’s ELK problem is not claiming this isn’t a real problem but rather bracketing (deferring) it for as long as possible. Because there are cases where ELK fails that don’t have much path-dependency in them, and we can focus on solving those cases until whatever else is causing the problem goes away (and only path-dependency is left).

• Curated. The authors write:

We believe that there are many promising and unexplored approaches to this problem, and there isn’t yet much reason to believe we are stuck or are faced with an insurmountable obstacle.

If it’s true that that this is both a core alignment problem and we’re not stuck on it, then that’s fantastic. I am not an alignment researcher and don’t feel qualified to comment on quite how promising this work seems, but I find the report both accessible and compelling. I recommend it to anyone curious about where some of the alignment leading edge is.

Also, I find there a striking resemblance to MIRI’s proposed visible thoughts project. They appear to be getting at the same thing though via quite different models (i.e. Bayes nets vs Language models). It’d be amazing if both projects flourished and understanding could be combined from each.

• In terms of the relationship to MIRI’s visible thoughts project, I’d say the main difference is that ARC is attempting to solve ELK in the worst case (where the way the AI understands the world could be arbitrarily alien from and more sophisticated than the way the human understands the world), whereas the visible thoughts project is attempting to encourage a way of developing AI that makes ELK easier to solve (by encouraging the way the AI thinks to resemble the way humans think). My understanding is MIRI is quite skeptical that a solution to worst-case ELK is possible, which is why they’re aiming to do something more like “make it more likely that conditions are such that ELK-like problems can be solved in practice.”

• Thanks Ruby! I’m really glad you found the report accessible.

One clarification: Bayes nets aren’t important to ARC’s conception of the problem of ELK or its solution, so I don’t think it makes sense to contrast ARC’s approach against an approach focused on language models or describe it as seeking a solution via Bayes nets.

The form of a solution to ELK will still involve training a machine learning model (which will certainly understand language and could just be a language model) using some loss function. The idea that this model could learn to represent its understanding of the world in the form of inference on some Bayes net is one of a few simple test cases that ARC uses to check whether the loss functions they’re designing will always incentivize honestly answering straightforward questions.

For example, another simple test case (not included in the report) is that the model could learn to represent its understanding of the world in a bunch of “sentences” that it performs logical operations on to transform into other sentences.

These test cases are settings for counterexamples, but not crucial to proposed solutions. The idea is that if your loss function will always learn a model that answers straightforward questions honestly, it should work in particular for these simplified cases that are easy to think about.

• Thanks for the clarification, Ajeya! Sorry to make you have to explain that, it was a mistake to imply that ARC’s conception is specifically anchored on Bayes nets–the report was quite clear that isn’t.

• I wrote a post in response to the report: Eliciting Latent Knowledge Via Hypothetical Sensors.

Some other thoughts:

• I felt like the report was unusually well-motivated when I put my “mainstream ML” glasses on, relative to a lot of alignment work.

• ARC’s overall approach is probably my favorite out of alignment research groups I’m aware of. I still think running a builder/​breaker tournament of the sort proposed at the end of this comment could be cool.

• Not sure if this is relevant in practice, but… the report talks about Bayesian networks learned via gradient descent. From what I could tell after some quick Googling, it doesn’t seem all that common to do this, and it’s not clear to me if there has been any work at all on learning the node structure (as opposed to internal node parameters) via gradient descent. It seems like this could be tricky because the node structure is combinatorial in nature and thus less amenable to a continuous optimization technique like gradient descent.

• There was recently a discussion on LW about a scenario similar to the SmartVault one here. My proposed solution was to use reward uncertainty—as applied to the SmartVault scenario, this might look like: “train lots of diverse mappings between the AI’s ontology and that of the human; if even one mapping of a situation says the diamond is gone according to the human’s ontology, try to figure out what’s going on”. IMO this general sort of approach is quite promising, interested to discuss more if people have thoughts.

• Thanks for the kind words (and proposal)!

There was recently a discussion on LW about a scenario similar to the SmartVault one here. My proposed solution was to use reward uncertainty—as applied to the SmartVault scenario, this might look like: “train lots of diverse mappings between the AI’s ontology and that of the human; if even one mapping of a situation says the diamond is gone according to the human’s ontology, try to figure out what’s going on”. IMO this general sort of approach is quite promising, interested to discuss more if people have thoughts

I broadly agree that “train a bunch of models and panic if any of them say something is wrong.” The main catch is that this only works if none of the models are optimized to say something scary, or to say something different for the sake of being different. We discuss this a bit in this appendix.

Not sure if this is relevant in practice, but… the report talks about Bayesian networks learned via gradient descent. From what I could tell after some quick Googling, it doesn’t seem all that common to do this, and it’s not clear to me if there has been any work at all on learning the node structure (as opposed to internal node parameters) via gradient descent. It seems like this could be tricky because the node structure is combinatorial in nature and thus less amenable to a continuous optimization technique like gradient descent.

We’re imagining the case where the predictor internally performs inference in a learned model, i.e. we’re not explicitly learning a bayesian network but merely considering possibilities for what an opaque neural net is actually doing (or approximating) on the inside. I don’t think this is a particularly realistic possibility, but if ELK fails in this kind of simple case it seems likely to fail in messier realistic cases.

I still think running a builder/​breaker tournament of the sort proposed at the end of this comment could be cool.

(We’re actually planning to do a narrower contest focused on ELK proposals.)

• From a complexity theoretic viewpoint, how hard could ELK be? is there any evidence that ELK is decidable?

• I’m pretty confused about the plan to use ELK to solve outer alignment. If Cakey is not actually trained, how are amplified humans accessing its world model?

”To avoid this fate, we hope to find some way to directly learn whatever skills and knowledge Cakey would have developed over the course of training without actually training a cake-optimizing AI...

1. Use imitative generalization combined with amplification to search over some space of instructions we could give an amplified human that would let them make cakes just as delicious as Cakey’s would have been.

2. Avoid the problem of the most helpful instructions being opaque (e.g. “Run this physics simulation, it’s great”) by solving ELK — i.e., finding a mapping from whatever possibly-opaque model of the world happens to be most useful for making superhumanly delicious cakes to concepts humans care about like “people” being “alive.”

3. Spell out a procedure for scoring predicted futures that could be followed by an amplified human who has access to a) Cakey’s great world model, and b) the correspondence between it and human concepts of interest. We think this procedure should choose scores using some heuristic along the lines of “make sure humans are safe, preserve option value, and ultimately defer to future humans about what outcomes to achieve in the world” (we go into much more detail in Appendix: indirect normativity).

4. Distill their scores into a reward model that we use to train Hopefully-Aligned-Cakey, which hopefully uses its powers to help humans build the utopia we want.”

• Thanks, this makes it pretty clear to me how alignment could be fundamentally hard besides deception. (The problem seems to hold even if your values are actually pretty simple; e.g. if you’re a pure hedonistic utilitarian and you’ve magically solved deception, you can still fail at outer alignment by your AI optimizing for making it look like there’s more happiness and less suffering.)

Some (perhaps basic) notes to check that I’ve understood this properly:

• The Bayes net running example per se isn’t really necessary for ELK to be a problem.

• The basic problem is that in training, the AI can do just as well by reporting what a human would believe given their observations, and upon deployment in more complex tasks the report of what a human would believe can come apart from the “truth” (what the human would believe given arbitrary knowledge of the system).

• This seems to crop up for a variety of models of AI and human cognition.

• It seems like the game is stacked against “doing X” rather than “making it look like X” in many contexts, such that even with regularizers that push towards the latter, the overall inductive bias would plausibly still be towards the former. It’s just easier to make it look to humans like you’re creating a utopia than to do all the complex work of utopia-building.

• I suspect this would hold even for much less ambitious yet still superhuman tasks, such that deferring to future human-level aligned AIs wouldn’t be sufficient.

• But, if we train a reporter module, reporting what the human would believe doesn’t seem prima facie easier than reporting the truth in this way. So that’s why we might reasonably hope a good regularizer can break the tie.

• In the build-break loop examples in the report, we’re generously assuming the human overseers know the relevant set of questions to ask to check if there’s malfeasance going on. And that this set isn’t so hopelessly large that iterating through it for training is too slow.

• In the imitative generalization example, it seems like besides the problem that the output Bayes net may be ontologically incomprehensible to humans, the training process requires humans to understand all the relevant hypotheses and data (to report their priors and likelihoods). This may be a general confusion about imitative generalization on my part.

• If we tried distillation to get around the prohibitive slowness of amplification for the “AI science” proposal, that would introduce both inner alignment problems and perhaps bring us to the same sort of “alien ontology” problem as the imitative generalization proposal.

• The ontology mismatch problem isn’t just a possibility, it seems pretty likely by default, for reasons summarized in the plot of model interpretability here.

• Intuitively, the ontology/​primitive concepts that quantum physicists use to make very excellent predictions about the universe—better than I could make, certainly—are alien to me, and to anyone else who hasn’t spent a lot of time learning quantum physics. This is consistent with human-interpretable concepts being more prevalent in recent powerful language models than in early-2010s neural networks.

• Deferring to future human-level aligned AIs isn’t sufficient because even if we had many more human-level minds giving feedback to superhuman AIs, they would still be faced with ELK too. i.e., This doesn’t seem to be a problem that can be solved just by parallelizing across more overseers than we currently have, although having aligned assistants could of course still help with ELK research.

• If the reporter estimates every node of the human’s Bayes net, then it can assign a node a probability distribution different from the one that would be calculated from the distributions simultaneously assigned to its parent nodes. I don’t know if there is a name for that, so for now i will pompously call it inferential inconsistency. Considering this as a boolean bright-line concept, the human simulator is clearly the only inferentially consistent reporter. But one could consider some kind of metric on how different probability distributions are and turn it into a more gradual thing.

Being a reporter basically means being inferentially consistent on the training set. On the other hand being inferentially consistent everywhere means being the human simulator. So a direct translator would differ from a human simulator by being inferentially inconsistent for some inputs outside of the training set. This could in principle be checked by sampling random possible inputs. The human could then try to distinguish a direct translator from a randomly overfitted model by trying to understand a small sample of inferentially inconsistencies.

So much for my thoughts inside the paradigm, now on to snottily rejecting it. The intuition that the direct translator should exist seems implausible. And the idea that it would be so strong an attractor that a training strategy avoiding the human simulator would quasi-automatically borders on the absurd. Modeling a constraint on the training set and not outside of it basically is what overfitting is and overfitted solutions with many specialised degrees of freedom are usually highly degenerete. In other words, penalizing the human simulator would almost certainly lead to something closer to a pseudorandomizer than a direct translation. And looking at it a different way, the direct translator is supposed to be helpful in situations the human would perceive as contradictory. Or to put it differently, not bad model fits but rather models strongly misspecified and then extrapolated far out of the sample space. That’s basically situations where statistical inference and machine learning have strong track records of not working.

• In section: “New counterexample: better inference in the human Bayes net”, what is meant with that the reporter does perfect inference in the human Bayes net? I am also unclear how the modified counterexample is different.

My current understanding: The reporter is doing inference using and the action sequence and does not use to do inference ( is inferred). The reporter has an exact copy of the human Bayes net and now fixes the nodes for and the action sequence. Then it infers the probability for all possible combinations of values each node can have (including ) (i.e. the joint probability distribution).

I am not sure here. Is the reporter not using ? The graphic in that section shows a red arrow from in the predictor, to in the human Bayes net model that the reporter uses. But that could be about the better counterexample already.

Now we assume that the model knows how to map a question in natural language onto nodes in the Bayes net and that it can then translate values of nodes into answers to questions. The model can then use the joint probability distribution and the law of total probability to calculate the probabilities of nodes/​events occurring which can then be used to answer questions.

The only difference in the better counterexample is that we now also fix the value of to whatever our predictor part of the model said would happen. And we do not assume that our predictor works perfectly, hence our reporter can give wrong answers because of that.

And now when we have , then calculating the joint probability distribution becomes computationally feasible? Are we still assuming that the reporter does perfect inference in the human Bayes net, given that our predictor predicted correctly?

• In all of the counterexamples the reporter starts from the , actions, and predicted by the predictor. In order to answer questions it needs to infer the latent variables in the human’s model.

Originally we described a counterexample where it copied the human inference process.

The improved counterexample is to instead use lots of computation to do the best inference it can, rather than copying the human’s mediocre inference. To make the counterexample fully precise we’d need to specify an inference algorithm and other details.

We still can’t do perfect inference though—there are some inference problems that just aren’t computationally feasible.

(That means there’s hope for creating data where the new human simulator does badly because of inference mistakes. And maybe if you are careful it will also be the case that the direct translator does better, because it effectively reuses the inference work done in the predictor? To get a proposal along these lines we’d need to describe a way to produce data that involves arbitrarily hard inference problems.)

• Ah ok, thank you. Now I get it. I was confused by (i) “Imagine the reporter could do perfect inference” and (ii) “the reporter could simply do the best inference it can in the human Bayes net (given its predicted video)”.

(i) I thought of this as that the reporter alone can do it, but what is actually meant is that with the use of the predictor model it can do it.

(ii) Somehow I thought that “given its predicted video” is the important modification here, where in fact the only change is to go from that the reporter can do perfect inference, to that it does the best inference that it can.

• [ ]
[deleted]