# Ramana Kumar

Karma: 1,179
• 9 Dec 2022 16:28 UTC
LW: 3 AF: 2
0 ∶ 0
AF

(Bold direct claims, not super confident—criticism welcome.)

The approach to ELK in this post is unfalsifiable.

A counterexample to the approach would need to be a test-time situation in which:

1. The predictor correctly predicts a safe-looking diamond.

2. The predictor “knows” that the diamond is unsafe.

3. The usual “explanation” (e.g., heuristic argument) for safe-looking-diamond predictions on the training data applies.

Points 2 and 3 are in direct conflict: the predictor knowing that the diamond is unsafe rules out the usual explanation for the safe-looking predictions.

So now I’m unclear what progress has been made. This looks like simply defining “the predictor knows P” as “there is a mechanistic explanation of the outputs starting from an assumption of P in the predictor’s world model”, then declaring ELK solved by noting we can search over and compare mechanistic explanations.

• 9 Dec 2022 11:08 UTC
LW: 2 AF: 1
0 ∶ 0
AF

I think you’re right—thanks for this! It makes sense now that I recognise the quote was in a section titled “Alignment research can only be done by AI systems that are too dangerous to run”.

• 8 Dec 2022 11:46 UTC
LW: 2 AF: 1
0 ∶ 0
AF

“We can compute the probability that a cell is alive at timestep 1 if each of it and each of its 8 neighbors is alive independently with probability 10% at timestep 0.”

we the readers (or I guess specifically the heuristic argument itself) can do this, but the “scientists” cannot, because the

“scientists don’t know how the game of life works”.

Do the scientists ever need to know how the game of life works, or can the heuristic arguments they find remain entirely opaque?

Another thing confusing to me along these lines:

“for example they may have noticed that A-B patterns are more likely when there are fewer live cells in the area of A and B”

where do they (the scientists) notice these fewer live cells? Do they have some deep interpretability technique for examining the generative model and “seeing” its grid of cells?

• They have a strong belief that in order to do good alignment research, you need to be good at “consequentialist reasoning,” i.e. model-based planning, that allows creatively figuring out paths to achieve goals.

I think this is a misunderstanding, and that approximately zero MIRI-adjacent researchers hold this belief (that good alignment research must be the product of good consequentialist reasoning). What seems more true to me is that they believe that better understanding consequentialist reasoning—e.g., where to expect it to be instantiated, what form it takes, how/​why it “works”—is potentially highly relevant to alignment.

• I’m focusing on the code in Appendix B.

What happens when self.diamondShard’s assessment of whether some consequences contain diamonds differs from ours? (Assume the agent’s world model is especially good.)

• upweights actions and plans that lead to

how is it determined what the actions and plans lead to?

• 28 Nov 2022 15:21 UTC
LW: 6 AF: 4
0 ∶ 0
AF

We expect an explanation in terms of the weights of the model and the properties of the input distribution.

We have a model that predicts a very specific pattern of observations, corresponding to “the diamond remains in the vault.” We have a mechanistic explanation π for how those correlations arise from the structure of the model.

Now suppose we are given a new input on which our model predicts that the diamond will appear to remain in the vault. We’d like to ask: in this case, does the diamond appear to remain in the vault for the normal reason π?

A problem with this: π can explain the predictions on both train and test distributions without all the test inputs corresponding to safe diamonds. In other words, the predictions can be made for the “normal reason” π even when the normal reason of the diamond being safe doesn’t hold.

(elaborating the comment above)

Because π is a mechanistic (as opposed to teleological, or otherwise reference-sensitive) explanation, its connection to what we would like to consider “normal reasons” has been weakened if not outright broken.

On the training distribution suppose we have two explanations for the “the diamond remains in the vault” predicted observations.

First there is ɸ, the explanation that there was a diamond in the vault and the cameras were working properly, etc. and the predictor is a straightforward predictor with a human-like world-model (ɸ is kinda loose on the details of how the predictor works, and just says that it does work).

Then there is π, which is an explanation that relies on various details about the circuits implemented by the weights of the predictor that traces abstractly how this distribution of inputs produces outputs with the observed properties, and uses various concepts and abstractions that make sense of the particular organisation of this predictor’s weights. (π is kinda glib about real world diamonds but has plenty to say about how the predictor works, and some of what it says looks like there’s a model of the real world in there.)

We might hope that a lot of the concepts π is dealing in do correspond to natural human things like object permanence or diamonds or photons. But suppose not all of them do, and/​or there are some subtle mismatches.

Now on some out-of-distribution inputs that produce the same predictions, we’re in trouble when π is still a good explanation of those predictions but ɸ is not. This could happen because, e.g., π’s version of “object permanence” is just broken on this input, and was never really about object permanence but rather about a particular group of circuits that happen to do something object-permanence-like on the training distribution. Or maybe π refers to the predictor’s alien diamond-like concept that humans wouldn’t agree with if they understood it but does nevertheless explain the prediction of the same observations.

Is it an assumption of your work here (or maybe a desideratum of whatever you find to do mechanistic explanations) that the mechanistic explanation is basically in terms of a world model or simulation engine, and we can tell that’s how it’s structured? I.e., it’s not some arbitrary abstract summary of the predictor’s computation. (And also that we can tell that the world model is good by our lights?)

• Partitions (of some underlying set) can be thought of as variables like this:

• The number of values the variable can take on is the number of parts in the partition.

• Every element of the underlying set has some value for the variable, namely, the part that that element is in.

Another way of looking at it: say we’re thinking of a variable as a function from the underlying set to ’s domain . Then we can equivalently think of as the partition of with (up to) parts.

In what you quoted, we construct the underlying set by taking all possible combinations of values for the “original” variables. Then we take all partitions of that to produce all “possible” variables on that set, which will include the original ones and many more.

• 25 Nov 2022 17:38 UTC
LW: 6 AF: 3
2 ∶ 0
AF

I agree with you—and yes we ignore this problem by assuming goal-alignment. I think there’s a lot riding on the pre-SLT model having “beneficial” goals.

# Refin­ing the Sharp Left Turn threat model, part 2: ap­ply­ing al­ign­ment techniques

25 Nov 2022 14:36 UTC
36 points
(vkrakovna.wordpress.com)

# Threat Model Liter­a­ture Review

1 Nov 2022 11:03 UTC
54 points

# Clar­ify­ing AI X-risk

1 Nov 2022 11:03 UTC
100 points
• 20 Sep 2022 11:01 UTC
LW: 2 AF: 2
0 ∶ 0
AF

I’ll take a stab at answering the questions for myself (fairly quick takes):

1. No, I don’t care about whether a model is an optimiser per se. I care only insofar as being an optimiser makes it more effective as an agent. That is, if it’s robustly able to achieve things, it doesn’t matter how. (However, it could be impossible to achieve things without being shaped like an optimiser; this is still unresolved.)

2. I agree that it would be nice to find definitions such that capacity and inclination split cleanly. Retargetability is one approach to this, e.g., operationalised as fine-tuning effort required to redirect inclinations.

3. I think there are two: incorrect labels (when the feedback provider isn’t capable enough to assess the examples it needs to evaluate), and underspecification (leading to goal misgeneralisation).

4. Goal misgeneralisation. More broadly (to also include capability misgeneralisation), robustness failures.

5. No I don’t think they’re important to distinguish.

• 8 Sep 2022 10:24 UTC
LW: 8 AF: 4
2 ∶ 0
AF
in reply to: janus’s comment on: Simulators

I think Dan’s point is good: that the weights don’t change, and the activations are reset between runs, so the same input (including rng) always produces the same output.

I agree with you that the weights and activations encode knowledge, but Dan’s point is still a limit on learning.

I think there are two options for where learning may be happening under these conditions:

• During the forward pass. Even though the function always produces the same output for a given output, the computation of that output involves some learning.

• Using the environment as memory. Think of the neural network function as a choose-your-own-adventure book that includes responses to many possible situations depending on which prompt is selected next by the environment (which itself depends on the last output from the function). Learning occurs in the selection of which paths are actually traversed.

These can occur together. E.g., the “same character” as was invoked by prompt 1 may be invoked by prompt 2, but they now have more knowledge (some of which was latent in the weights, some of which came in directly via prompt 2; but all of which was triggered by prompt 2).

• 5 Sep 2022 15:47 UTC
LW: 3 AF: 3
1 ∶ 0
AF

Expanding a bit on why: I think this will fail because the house-building AI won’t actually be very good at instrumental reasoning, so there’s nothing for the sticky goals hypothesis to make use of.

• 5 Sep 2022 15:43 UTC
LW: 3 AF: 3
1 ∶ 0
AF

I agree with this prediction directionally, but not as strongly.

I’d prefer a version where we have a separate empirical reason to believe that the training and finetuning approaches used can support transfer of something (e.g., some capability), to distinguish goal-not-sticky from nothing-is-sticky.

• 30 Aug 2022 13:35 UTC
LW: 3 AF: 1
2 ∶ 0
AF
in reply to: TurnTrout’s comment

Hm, no, not really.

OK let’s start here then. If what I really want is an AI that plays tic-tac-toe (TTT) in the real world well, what exactly is wrong with saying the reward function I described above captures what I really want?

There are several claims which are not true about this function:

Neither of those claims seemed right to me. Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is? [I presume the problem you’re getting at isn’t as simple as wanting desires to look like (image, digit-label, goodness) tuples as opposed to(image, correct digit-label) tuples.]