Nora Belrose

Karma: 572

Counting arguments provide no evidence for AI doom

Nora Belrose and Quintin Pope

27 Feb 2024 23:03 UTC

99 points

177 comments14 min readLW link

Nora Belrose 2 Dec 2023 18:24 UTC
LW: 49 AF: 18
−55
AF
on: Thoughts on “AI is easy to control” by Pope & Belrose
(Didn’t consult Quintin on this; I speak for myself)
I flatly deny that our arguments depend on AGI being anything like an LLM. I think the arguments go through in a very wide range of scenarios, basically as long as we’re using some kind of white-box optimization to align them, rather than e.g. carrot-and-stick incentives or prompt engineering. Even if we only relied on prompt engineering, I think we’d be in a better spot than with humans (because we can run many controlled experiments).
A human can harbor a secret desire for years, never acting on it, and their brain won’t necessarily overwrite that desire, even as they think millions of thoughts in the meantime. So evidently, the argument above is inapplicable to human brains.
I’m pretty confused by this claim. Why should we expect the human reward system to overwrite all secret desires? Also how do we know it’s not doing that? Your desires are just causal effects of a bunch of stuff including your reward circuitry.
As a human, I can be sitting in bed, staring into space, and I can think a specific abstruse thought about string theory, and now I’ve figured out something important. If a future AI can do that kind of thing, as I expect, then it’s not so clear that “controlling the AI’s sensory environment” is really all that much control.
1. This is just generally a pretty weak argument. You don’t seem to be contesting the fact that we have full sensory control for AI and we don’t have full sensory control for humans. It’s just a claim that this doesn’t matter. Maybe this ends up being a brute clash of intuitions, but it seems obvious to me that full sensory control matters a lot, even if the AI is doing a lot of long running cognition without supervision.
2. With AI we can choose to cut its reasoning short whenever we want, force it to explain itself in human language, roll it back to a previous state, etc. We just have a lot more control over this ongoing reasoning process for AIs and it’s baffling to me that you seem to think this mostly doesn’t matter.
That sounds nice, but brain-like AGI (like most RL agents) does online learning. So if you run a bunch of experiments, then as soon as the AGI does anything whatsoever (e.g. reads the morning newspaper), your experiments are all invalid (or at least, open to question), because now your AGI is different than it was before
You can just include online learning in your experimentation loop. See what happens when you let the AI online learn for a bit in different environments. I don’t think online learning changes the equation very much. It’s known to be less stable than offline RL, but that instability hurts capabilities as well as alignment, so we’d need a specific argument that it will hurt alignment significantly more than capabilities, in ways that we wouldn’t be able to notice during training and evaluation.

I have no idea how I’m supposed to interpret this sentence [“we are the innate reward system”] for brain-like AGI, such that it makes any sense at all. Actually, I’m not quite sure what it means even for LLMs!
It just means we are directly updating the AI’s neural circuitry with white box optimizers. This will be true across a very wide range of scenarios, including (IIUC) your brain-like AGI scenario.
Brains can imitate, but do so in a fundamentally different way from LLM pretraining
I don’t see why any of the differences you listed are relevant for safety.
Relatedly, brains have a distinction between expectations and desires, cleanly baked into the algorithms. I think this is obvious common sense, leaving aside galaxy-brain Free-Energy-Principle takes which try to deny it.
I basically deny this, especially if you’re stipulating that it’s a “clean” distinction. Obviously folk psychology has a fuzzy distinction between beliefs and desires in it, but it’s also well-known both in common sense and among neuroscientists etc. that beliefs and desires get mixed up all the time and there’s not a particularly sharp divide.
What links here?
- Thoughts on “AI is easy to control” by Pope & Belrose by Steven Byrnes (1 Dec 2023 17:30 UTC; 189 points)

My Kind of Pragmatism

Nora Belrose20 May 2023 18:58 UTC

35 points

11 comments3 min readLW link

Nora Belrose 29 Feb 2024 3:31 UTC
LW: 34 AF: 18
2
AF
in reply to: evhub’s comment on: Counting arguments provide no evidence for AI doom
Thanks for the reply. A couple remarks:
- “indifference over infinite bitstrings” is a misnomer in an important sense, because it’s literally impossible to construct a normalized probability measure over infinite bitstrings that assigns equal probability to each one. What you’re talking about is the length weighted measure that assigns exponentially more probability mass to shorter programs. That’s definitely not an indifference principle, it’s baking in substantive assumptions about what’s more likely.
- I don’t see why we should expect any of this reasoning about Turing machines to transfer over to neural networks at all, which is why I didn’t cast the counting argument in terms of Turing machines in the post. In the past I’ve seen you try to run counting or simplicity arguments in terms of parameters. I don’t think any of that works, but I at least take it more seriously than the Turing machine stuff.
- If we’re really going to assume the Solomonoff prior here, then I may just agree with you that it’s malign in Christiano’s sense and could lead to scheming, but I take this to be a reductio of the idea that we can use Solomonoff as any kind of model for real world machine learning. Deep learning does not approximate Solomonoff in any meaningful sense.
- Terminological point: it seems like you are using the term “simple” as if it has a unique and objective referent, namely Kolmogorov-simplicity. That’s definitely not how I use the term; for me it’s always relative to some subjective prior. Just wanted to make sure this doesn’t cause confusion.

Nora Belrose 29 Feb 2024 3:46 UTC
28 points
10
in reply to: evhub’s comment on: Counting arguments provide no evidence for AI doom
I’m well aware of how it’s derived. I still don’t think it makes sense to call that an indifference prior, precisely because enforcing an uncomputable halting requirement induces an exponentially strong bias toward short programs. But this could become a terminological point.

I think relying on an obviously incorrect formalism is much worse than relying on no formalism at all. I also don’t think I’m relying on zero formalism. The literature on the frequency/spectral bias is quite rigorous, and is grounded in actual facts about how neural network architectures work.

Nora Belrose 3 Dec 2023 6:05 UTC
28 points
9
in reply to: MiguelDev’s comment on: Quick takes on “AI is easy to control”
I would be up for having a dialogue with Nate. Quintin, myself, and the others in the Optimist community are working on posts which will more directly critique the arguments for pessimism.
What links here?
- DavidW's comment on Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning by So8res (21 Dec 2023 1:48 UTC; 65 points)

Nora Belrose 28 Feb 2024 1:20 UTC
24 points
2
in reply to: johnswentworth’s comment on: Counting arguments provide no evidence for AI doom
I’m pleasantly surprised that you think the post is “pretty decent.”
I’m curious which parts of the Goal Realism section you find “philosophically confused,” because we are trying to correct what we consider to be deep philosophical confusion fairly pervasive on LessWrong.
I recall hearing your compression argument for general-purpose search a long time ago, and it honestly seems pretty confused / clearly wrong to me. I would like to see a much more rigorous definition of “search” and why search would actually be “compressive” in the relevant sense for NN inductive biases. My current take is something like “a lot of the references to internal search on LW are just incoherent” and to the extent you can make them coherent, NNs are either actively biased away from search, or they are only biased toward “search” in ways that are totally benign.
More generally, I’m quite skeptical of the jump from any mechanistic notion of search, and the kind of grabby consequentialism that people tend to be worried about. I suspect there’s a double dissociation between these things, where “mechanistic search” is almost always benign, and grabby consequentialism need not be backed by mechanistic search.

Nora Belrose 19 Apr 2024 23:55 UTC
21 points
5
on: Inducing Unprompted Misalignment in LLMs
Unclear why this is supposed to be a scary result.
“If prompting a model to do something bad generalizes to it being bad in other domains, this is also evidence for the idea that prompting a model to do something good will generalize to it doing good in other domains”—Matthew Barnett
What links here?
- AI #61: Meta Trouble by Zvi (2 May 2024 18:40 UTC; 29 points)

Nora Belrose 15 Jan 2024 3:35 UTC
LW: 21 AF: 11
6
AF
in reply to: evhub’s comment on: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
I don’t think the results you cited matter much, because fundamentally the paper is considering a condition in which the model ~always is being pre-prompted with “Current year: XYZ” or something similar in another language (please let me know if that’s not true, but that’s my best-effort read of the paper).
I’m assuming we’re not worried about the literal scenario in which the date in the system prompt causes a distribution shift, because you can always spoof the date during training to include future years without much of a problem. Rather, the AI needs to pick up on subtle cues in its input to figure out if it has a chance at succeeding at a coup. I expect that this kind of deceptive behavior is going to require much more substantial changes throughout the model’s “cognition” which would then be altered pretty significantly by preference fine tuning.
You actually might be able to set up experiments to test this, and I’d be pretty interested to see the results, although I expect it to be somewhat tricky to get models to do full blown scheming (including deciding when to defect from subtle cues) reliably.

Nora Belrose 14 Jan 2024 18:53 UTC
LW: 18 AF: 11
3
AF
in reply to: evhub’s comment on: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
So, I think this is wrong.
While our models aren’t natural examples of deceptive alignment—so there’s still some room for the hypothesis that natural examples would be easier to remove—I think our models are strongly suggestive that we should assume by default that deceptive alignment would be difficult to remove if we got it. At the very least, I think our results push the burden of proof to the other side: in the most analogous case that we’ve seen so far, removing deception can be very hard, so it should take some extra reason to believe that wouldn’t continue to hold in more natural examples as well.
While a backdoor which causes the AI to become evil is obviously bad, and it may be hard to remove, the usual arguments for taking deception/scheming seriously do not predict backdoors. Rather, they predict that the AI will develop an “inner goal” which it coherently pursues across contexts. That means there’s not going to be a single activating context for the bad behavior (like in this paper, where it’s just “see text that says the year is 2024″ or “special DEPLOYMENT token”) but rather the behavior would be flexibly activated in a wide range of contexts depending on the actual likelihood of the AI succeeding at staging a coup. That’s how you get the counting argument going— there’s a wide range of goals compatible with scheming, etc. But the analogous counting argument for backdoors— there’s a wide range of backdoors that might spontaneously appear in the model and most of them are catastrophic, or something— proves way too much and is basically a repackaging of the unsound argument “most neural nets should overfit / fail to generalize.”

I think it’s far from clear that an AI which had somehow developed a misaligned inner goal— involving thousands or millions of activating contexts— would have all these contexts preserved after safety training. In other words, I think true mesaoptimization is basically an ensemble of a very very large number of backdoors, making it much easier to locate and remove.

Deconstructing Bostrom’s Classic Argument for AI Doom

Nora Belrose11 Mar 2024 5:58 UTC

16 points

14 comments1 min readLW link

(www.youtube.com)

Nora Belrose 2 Sep 2023 7:24 UTC
15 points
5
on: [Linkpost] Large language models converge toward human-like concept organization
As a sanity check it would have been nice if they showed Procrustes and RDM results with the vocabulary items randomly permuted (if you can align with randomly permuted tokens that’s a bad sign).
Also since they compute the RDM using Euclidean distances instead of e.g. inner products, all the elements are non-negative and the cosine similarity would be non-negative even for completely unrelated embeddings. That doesn’t necessarily invalidate their scaling trends but it makes it a bit hard to interpret.
I think there are much better papers on this topic, such as this one.

Nora Belrose 14 Mar 2024 16:42 UTC
14 points
11
in reply to: Charlie Steiner’s comment on: Deconstructing Bostrom’s Classic Argument for AI Doom
Yeah, I think Evan is basically opportunistically changing his position during that exchange, and has no real coherent argument.

Nora Belrose 28 Feb 2024 2:45 UTC
13 points
−1
in reply to: johnswentworth’s comment on: Counting arguments provide no evidence for AI doom
Some incomplete brief replies:
Huemer… indeed seems confused about all sorts of things
Sure, I was just searching for professional philosopher takes on the indifference principle, and that chapter in Paradox Lost was among the first things I found.
Separately, “reductionism as a general philosophical thesis” does not imply the thing you call “goal reductionism”
Did you see the footnote I wrote on this? I give a further argument for it.
doesn’t mean the end-to-end trained system will turn out non-modular.
I looked into modularity for a bit 1.5 years ago and concluded that the concept is way too vague and seemed useless for alignment or interpretability purposes. If you have a good definition I’m open to hearing it.
There are good reasons behaviorism was abandoned in psychology, and I expect those reasons carry over to LLMs.
To me it looks like people abandoned behaviorism for pretty bad reasons. The ongoing replication crisis in psychology does not inspire confidence in that field’s ability to correctly diagnose bullshit.
That said, I don’t think my views depend on behaviorism being the best framework for human psychology. The case for behaviorism in the AI case is much, much stronger: the equations for an algorithm like REINFORCE or DPO directly push up the probability of some actions and push down the probability of others.

Nora Belrose 5 Nov 2023 5:25 UTC
13 points
−1
in reply to: gwern’s comment on: Genetic fitness is a measure of selection strength, not the selection target
This seems to entirely ignore the actual point that is being made in the post. The point is that “IGF” is not a stable and contentful loss function, it is a misleadingly simple shorthand for “whatever traits are increasing their own frequency at the moment.” Once you see this, you notice two things:
1. In some weak sense, we are fairly well “aligned” to the “traits” that were selected for in the ancestral environment, in particular our social instincts.
2. All of the ways in which ML is disanalogous with evolution indicate that alignment will be dramatically easier and better for ML models. For starters, we don’t randomly change the objective function for ML models throughout training. See Quintin’s post for many more disanalogies.

Nora Belrose 3 Nov 2023 22:09 UTC
12 points
2
on: AI as a science, and three obstacles to alignment strategies
I expect that we’d see all sorts of coincidences and hacks that make the thing run, and we’d be able to see in much more detail how, when we ask the system to achieve some target, it’s not doing anything close to “caring about that target” in a manner that would work out well for us, if we could scale up the system’s optimization power to the point where it could achieve great technological or scientific feats (like designing Drexlerian nanofactories or what-have-you).
I think this counterfactual is literally incoherent— it does not make sense to talk about what an individual neural network would do if its “optimization power” were scaled up. It’s a category error. You instead need to ask what would happen if the training procedure were scaled up, and there are always many different ways that you can scale it up— e.g. keeping data fixed while parameters increase, or scaling both in lockstep, keeping the capability of the graders fixed, or investing in more capable graders / scalable oversight techniques, etc. So I deny that there is any fact of the matter about whether current LLMs “care about the target” in your sense. I think there probably are sensible ways of cashing out what it means for a 2023 LLM to “care about” something but this is not it.

Nora Belrose 7 Jul 2022 18:39 UTC
11 points
9
in reply to: Anon User’s comment on: Human values & biases are inaccessible to the genome
I think what TurnTrout wants to say is that things like sexual preferences are actually learned generalizations from very basic hardcoded reward signals that latch onto things like the pheromones of the opposite sex. But I don’t think he’s got it all worked out yet.

Nora Belrose 28 Feb 2024 7:33 UTC
10 points
1
in reply to: Daniel Kokotajlo’s comment on: Counting arguments provide no evidence for AI doom
they almost certainly don’t have anything to do with what humans want, per se. (that would be basically magic)
We are obviously not appealing to literal telepathy or magic. Deep learning generalizes the way we want in part because we designed the architectures to be good, in part because human brains are built on similar principles to deep learning, and in part because we share a world with our deep learning models and are exposed to similar data.

Nora Belrose 4 Aug 2022 3:31 UTC
10 points
0
on: Externalized reasoning oversight: a research direction for language model alignment
Thanks for making this thoughtful post. I’m optimistic about this general approach and think it’s more promising than other approaches by a significant margin. I’m currently working on a post which will defend a more specific version of the externalized reasoning framework and will explore concrete defenses against steganography etc.

Nora Belrose 19 Jan 2024 2:41 UTC
9 points
1
in reply to: Wei Dai’s comment on: Against Relying on Evolution to Forecast AI Outcomes (Part 1)
It’s very difficult to get any agent to robustly pursue something like IGF because it’s an inherently sparse and beyond-lifetime goal. Human values have been pre-densified for us: they are precisely the kinds of things it’s easy to get an intelligence to pursue fairly robustly. We get dense, repeated, in-lifetime feedback about stuff like sex, food, love, revenge, and so on. A priori, if you’re an agent built by evolution, you should expect to have values that are easy to learn— it would be surprising if it turned out that evolution did things the hard way. So evolution suggests alignment should be easy.

Nora Belrose

Count­ing ar­gu­ments provide no ev­i­dence for AI doom

My Kind of Pragmatism

De­con­struct­ing Bostrom’s Clas­sic Ar­gu­ment for AI Doom

Counting arguments provide no evidence for AI doom

Deconstructing Bostrom’s Classic Argument for AI Doom