Lauro Langosco
Uncertainty about the future does not imply that AGI will go well
An Exercise to Build Intuitions on AGI Risk
I agree with what I read as the main direct claim of this post, which is that it is often worth avoiding making very confident-sounding claims, because it makes it likely for people to misinterpret you or derail the conversation towards meta-level discussions about justified confidence.
However, I disagree with the implicit claim that people who confidently predict AI X-risk necessarily have low model uncertainty. For example, I find it hard to predict when and how AGI is developed, and I expect that many of my ideas and predictions about that will be mistaken. This makes me more pessimistic, rather than less, since it seems pretty hard to get AI alignment right if we can’t even predict basic things like “when will this system have situational awareness”, etc.
Some reasons why a predictor wants to be a consequentialist
(This was an interesting exercise! I wrote this before reading any other comments; obviously most of the bullet points are unoriginal)
The basics
It doesn’t prevent you from shutting it down
It doesn’t prevent you from modifying it
It doesn’t deceive or manipulate you
It does not try to infer your goals and achieve them; instead it just executes the most straightforward, human-common-sense interpretation of its instructions
It performs the task with minimal side-effects (but without explicitly minimizing a measure of side-effects)
If it self-modifies or constructs other agents, it will preserve corrigibility. Preferably it does not self-modify or construct other intelligent agents at all
Myopia
Its objective is no more broad or long-term than is required to complete the task
In particular, it only cares about results within a short timeframe (chosen to be as short as possible while still enabling it to perform the task)
It does not cooperate (in the sense of helping achieve their objective) with future, past, or (duplicate) concurrent versions of itself, unless intended by the operator
Non-maximizing
It doesn’t maximize the probability of getting the task done; it just does something that gets the task done with (say) >99% probability
It doesn’t “optimize too hard” (not sure how to state this better)
Example: when communicating with humans (e.g. to query them about their instructions), it does not maximize communication bandwidth / information transfer; it just communicates reasonably well
Its objective / task does not consist in maximizing any quantity; rather, it follows a specific bounded instruction (like “make me a coffee”, or “tell me a likely outcome of this plan”) and then shuts down
It doesn’t optimize over causal pathways you don’t want it to: for example, if it is meant to predict the consequences of a plan, it does not try to make its prediction more likely to happen
It does not try to become more consequentialist with respect to its goals
for example, if in the middle of deployment the system reads a probability theory textbook, learns about dutch book theorems, and decides that EV maximization is the best way to achieve its goals, it will not change its behavior
No weird stuff
It doesn’t try to acausally cooperate or trade with far-away possible AIs
It doesn’t come to believe that it is being simulated by multiverse-aliens trying to manipulate the universal prior (or whatever)
It doesn’t attempt to simulate a misaligned intelligence
In fact it doesn’t simulate any other intelligences at all, except to the minimal degree of fidelity that is required to perform the task
Human imitation
Where possible, it should imitate a human that is trying to be corrigible
To the extent that this is possible while completing the task, it should try to act like a helpful human would (but not unboundedly minimizing the distance in behavior-space)
When this is not possible (e.g. because it is executing strategies that a human could not), it should stay near to human-extrapolated behaviour (“what would a corrigible, unusually smart / competent / knowledgable human do?”)
To the extent that meta-cognition is necessary, it should think about itself and corrigibility in the same way its operators do: its objectives are likely misspecified, therefore it should not become too consequentialist, or “optimize too hard”, and [other corrigibility desiderata]
Querying / robustness
Insofar as this is feasible it presents its plans to humans for approval, including estimates of the consequences of its plans
It will raise an exception, i.e. pause execution of its plans and notify its operators if
its instructions are unclear
it recognizes a flaw in its design
it sees a way in which corrigibility could be strengthened
in the course of performing its task, the ability of its operators to shut it down or modify it would be limited
in the course of performing its task, its operators would predictably be deceived / misled about the state of the world
- 23 Jun 2022 3:32 UTC; 4 points) 's comment on Let’s See You Write That Corrigibility Tag by (
You make a claim that’s very close to that—your claim, if I understand correctly, is that MIRI thought AI wouldn’t understand human values and also not lie to us about it (or otherwise decide to give misleading or unhelpful outputs):
The key difference between the value identification/specification problem and the problem of getting an AI to understand human values is the transparency and legibility of how the values are represented: if you solve the problem of value identification, that means you have an actual function that can tell you the value of any outcome (which you could then, hypothetically, hook up to a generic function maximizer to get a benevolent AI). If you get an AI that merely understands human values, you can’t necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.
I think this is similar enough (and false for the same reasons) that I don’t think the responses are misrepresenting you that badly. Of course I might also be misunderstanding you, but I did read the relevant parts multiple times to make sure, so I don’t think it makes sense to blame your readers for the misunderstanding.
My paraphrase of your (Matthews) position: while I’m not claiming that GPT-4 provides any evidence about inner alignment (i.e. getting an AI to actually care about human values), I claim that it does provide evidence about outer alignment being easier than we thought: we can specify human values via language models, which have a pretty robust understanding of human values and don’t systematically deceive us about their judgement. This means people who used to think outer alignment / value specification was hard should change their minds.
(End paraphrase)
I think this claim is mistaken, or at least it rests on false assumptions about what alignment researchers believe. Here’s a bunch of different angles on why I think this:
-
My guess is a big part of the disagreement here is that I think you make some wrong assumptions about what alignment researchers believe.
-
I think you’re putting a bit too much weight on the inner vs outer alignment distinction. The central problem that people talked about always was how to get an AI to care about human values. E.g. in The Hidden Complexity of Wishes (THCW) Eliezer writes
To be a safe fulfiller of a wish, a genie must share the same values that led you to make the wish.
If you find something that looks to you like a solution to outer alignment / value specification, but it doesn’t help make an AI care about human values, then you’re probably mistaken about what actual problem the term ‘value specification’ is pointing at. (Or maybe you’re claiming that value specification is just not relevant to AI safety—but I don’t think you are?).
-
It was always possible to attempt to solve the value specification problem by just pointing at a human. The fact that we can now also point at an LLM and get a result that’s not all that much worse than pointing at a human is not cause for an update about how hard value specification is. Part of the difficulty is how to define the pointer to the human and get a model to maximize human values rather than maximize some error in your specification. IMO THCW makes this point pretty well.
-
It’s tricky to communicate problems in AI alignment―people come in with lots of different assumptions about what kind of things are easy / hard, and it’s hard to resolve disagreements because we don’t have a live AGI to do experiments on. I think THCW and related essays you criticize are actually great resources. They don’t try to communicate the entire problem at once because that’s infeasible. The fact that human values are complex and hard to specify explicitly is part of the reason why alignment is hard, where alignment means get the AI to care about human values, not get an AI to answer questions about moral behavior reasonably.
-
You claim the existence of GPT-4 is evidence against the claims in THCW. But IMO GPT-4 fits in neatly with THCW. The post even starts with a taxonomy of genies:
There are three kinds of genies: Genies to whom you can safely say “I wish for you to do what I should wish for”; genies for which no wish is safe; and genies that aren’t very powerful or intelligent.
GPT-4 is an example of a genie that is not very powerful or intelligent.
-
If in 5 years we build firefighter LLMs that can rescue mothers from burning buildings when you ask them to, that would also not show that we’ve solved value specification—it’s just a didactic example, not a full description of the actual technical problem. More broadly, I think it’s plausible that within a few years LLM will be able to give moral counsel far better than the average human. That still doesn’t solve value specification any more than the existence of humans that could give good moral counsel 20 years ago had solved value specification.
-
If you could come up with a simple action-value function Q(observation, action), that when maximized over actions yields a good outcome for humans, then I think that would probably be helpful for alignment. This is an example of a result that doesn’t directly make an AI care about human values, but would probably lead to progress in that direction. I think if it turned out to be easy to formalize such a Q then I would change my mind about how hard value specification is.
-
While language models understand human values to some extent, they aren’t robust. The RHLF/RLAIF family of methods is based on using an LLM as a reward model, and to make things work you need to be careful not to optimize too hard or you’ll just get gibberish (Gao et al. 2022). LLMs don’t hold up against mundane RLHF optimization pressure, nevermind an actual superintelligence. (Of course, humans wouldn’t hold up either).
-
I think it’s false in the sense that MIRI never claimed that it would be hard to build an AI with GPT-4 level understanding of human values + GPT-4 level of willingness to answer honestly (as far as I can tell). The reason I think it’s false is mostly that I haven’t seen a claim like that made anywhere, including in the posts you cite.
I agree lots of the responses elide the part where you emphasize that it’s important how GPT-4 doesn’t just understand human values, but is also “willing” to answer questions somewhat honestly. TBH I don’t understand why that’s an important part of the picture for you, and I can see why some responses would just see the “GPT-4 understands human values” part as the important bit (I made that mistake too on my first reading, before I went back and re-read).
It seems to me that trying to explain the original motivations for posts like Hidden Complexity of Wishes is a good attempt at resolving this discussion, and it looks to me as if the responses from MIRI are trying to do that, which is part of why I wanted to disagree with the claim that the responses are missing the point / not engaging productively.
Whether or not this is the safest path, the fact that OpenAI thinks it’s true and is one of the leading AI labs makes it a path we’re likely to take. Humanity successfully navigating the transition to extremely powerful AI might therefore require successfully navigating a scenario with short timelines and slow, continuous takeoff.
You can’t just choose “slow takeoff”. Takeoff speeds are mostly a function of the technology, not company choices. If we could just choose to have a slow takeoff, everything would be much easier! Unfortunately, OpenAI can’t just make their preferred timelines & “takeoff” happen. (Though I agree they have some influence, mostly in that they can somewhat accelerate timelines).
evolution does not grow minds, it grows hyperparameters for minds.
Imo this is a nitpick that isn’t really relevant to the point of the analogy. Evolution is a good example of how selection for X doesn’t necessarily lead to a thing that wants (‘optimizes for’) X; and more broadly it’s a good example for how the results of an optimization process can be unexpected.
I want to distinguish two possible takes here:
The argument from direct implication: “Humans are misaligned wrt evolution, therefore AIs will be misaligned wrt their objectives”
Evolution as an intuition pump: “Thinking about evolution can be helpful for thinking about AI. In particular it can help you notice ways in which AI training is likely to produce AIs with goals you didn’t want”
It sounds like you’re arguing against (1). Fair enough, I too think (1) isn’t a great take in isolation. If the evolution analogy does not help you think more clearly about AI at all then I don’t think you should change your mind much on the strength of the analogy alone. But my best guess is that most people incl Nate mean (2).
Thinking about alignment-relevant thresholds in AGI capabilities. A kind of rambly list of relevant thresholds:
Ability to be deceptively aligned
Ability to think / reflect about its goals enough that model realises it does not like what it is being RLHF’d for
Incentives to break containment exist in a way that is accessible / understandable to the model
Ability to break containment
Ability to robustly understand human intent
Situational awareness
Coherence / robustly pursuing it’s goal in a diverse set of circumstances
Interpretability methods break (or other oversight methods break)
doesn’t have to be because of deceptiveness; maybe thoughts are just too complicated at some point, or in a different place than you’d expect
Capable enough to help us exit the acute risk period
Many alignment proposals rely on reaching these thresholds in a specific order. For example, the earlier we reach (9) relative to other thresholds, the easier most alignment proposals are.
Some of these thresholds are relevant to whether an AI or proto-AGI is alignable even in principle. Short of ‘full alignment’ (CEV-style), any alignment method (eg corrigibility) only works within a specific range of capabilities:
Too much capability breaks alignment, eg bc a model self-reflects and sees all the ways in which its objectives conflicts with human goals.
Too little capability (or too little ‘coherence’) and any alignment method will be non-robust wrt to OOD inputs or even small improvements in capability or self-reflectiveness.
That’s a challenge, and while you (hopefully) chew on it, I’ll tell an implausibly-detailed story to exemplify a deeper obstacle.
Some thoughts written down before reading the rest of the post (list is unpolished / not well communicated)
The main problems I see:There are kinds of deception (or rather kinds of deceptive capabilities / thoughts) that only show up after a certain capability level, and training before that level just won’t affect them cause they’re not there yet.
General capabilities imply the ability to be deceptive if useful in a particular circumstance. So you can’t just train away the capability to be deceptive (or maybe you can, but not in a way that is robust wrt general capability gains).
Really you want to train against the propensity to be deceptive, rather than the capability. But propensities also change with capability level; becoming more capable is all about having more ways to achieve your goals. So eliminating propensity to be deceptive at a lower capability level does not eliminate the propensity at a higher capability level.
The robust way to get rid of propensity to be deceptive is to reach an attractor where more capability == less deception (within the capability range we care about), because the AI’s terminal goals on some level include ‘being nondeceptive’.
Before we can align the AIs goals to human intent in this way, the AI needs to have a good understanding of human intent, good situational awareness, and be a (more or less) unified / coherent agent. If it’s not, then its goals / propensities will shift as it becomes more capable (or more situationally aware, or more coherent, etc)
This is a pretty harsh set of prerequisites, and is probably outside of the range of circumstances where people usually hope their method to avoid deception will work.
Even if methods to detect deception (narrowly conceived) work, we cannot tell apart an agent that is actually nondeceptive / aligned from an agent that e.g. just aims to play the training game (and will do something unspecified once it reaches a capability threshold that allows it to breach containment).
A specific (maybe too specific) problem that can still happen in this scenario: you might get an AI that is overall capable, but just learns to not think long enough about scenarios that would lead it to try to be deceptive. This can still happen at the maximum capability levels at which we might hope to still contain an AGI that we are trying to align (ie somewhere around human level, optimistically).
I would be very curious to see your / OpenAI’s responses to Eliezer’s Dimensions of Operational Adequacy in AGI Projects post. Which points do you / OpenAI leadership disagree with? Insofar as you agree but haven’t implemented the recommendations, what’s stopping you?
There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals
IMO making the field of alignment 10x larger or evals do not solve a big part of the problem, while indefinitely pausing AI development would. I agree it’s much harder, but I think it’s good to at least try, as long as it doesn’t terribly hurt less ambitious efforts (which I think it doesn’t).
This seems wrong. Here’s an incomplete list of reasons why:
If the 3 leading labs join the moratorium and AGI is stealthily developed by the 4th, then the arrival of AGI will in fact have been slowed by the lead time of the first 3 labs + the slowdown that the 4th incurs by working in secret.
The point of this particular call for a 6-month moratorium is not to particularly slow down anyone (and as has been pointed out by others, it is possible that OpenAI wasn’t even planning to start training GPT-5 in the next few months). Rather, the point is to form a coalition to support future policies, e.g. a government-supported moratorium.
It is actually fairly hard to build compute clusters in secret, because you can just track what comes out of the chip fabs and where it goes
While not straightforward, it’s also feasible to monitor existing clusters, see e.g. https://arxiv.org/abs/2303.11341
Alignment researchers, how useful is extra compute for you?
Broadly agree with the takes here.
However, these results seem explainable by the widely-observed tendency of larger models to learn faster and generalize better, given equal optimization steps.
This seems right and I don’t think we say anything contradicting it in the paper.
I also don’t see how saying ‘different patterns are learned at different speeds’ is supposed to have any explanatory power. It doesn’t explain why some types of patterns are faster to learn than others, or what determines the relative learnability of memorizing versus generalizing patterns across domains. It feels like saying ‘bricks fall because it’s in a brick’s nature to move towards the ground’: both are repackaging an observation as an explanation.
The idea is that the framing ‘learning at different speeds’ lets you frame grokking and double descent as the same thing. More like generalizing ‘bricks move towards the ground’ and ‘rocks move towards the ground’ to ‘objects move towards the ground’. I don’t think we make any grand claims about explaining everything in the paper, but I’ll have a look and see if there’s edits I should make—thanks for raising these points.
(Crossposting some of my twitter comments).
I liked this criticism of alignment approaches: it makes a concrete claim that addresses the crux of the matter, and provides supporting evidence! I also disagree with it, and will say some things about why.
-
I think that instead of thinking in terms of “coherence” vs. “hot mess”, it is more fruitful to think about “how much influence is this system exerting on its environment?”. Too much influence will kill humans, if directed at an outcome we’re not able to choose. (The rest of my comments are all variations on this basic theme).
-
We humans may be a hot mess, but we’re far better at influencing (optimizing) our environment than any other animal or ML system. Example: we build helicopters and roads, which are very unlikely to arise by accident in a world without people trying to build helicopters or roads. If a system is good enough at achieving outcomes, it is dangerous whether or not it is a “hot mess”.
-
It’s much easier for us to describe simple behaviors as utility maximization; for example a ball rolling down a hill is well-described as minimizing its potential energy. So it’s natural that people will rate a dumb / simple system as being more easily described by a utility function than a smart system with complex behaviors. This does not make the smart system any less dangerous.
-
Misalignment risk is not about expecting a system to “inflexibly” or “monomanically” pursuing a simple objective. It’s about expecting systems to pursue objectives at all. The objectives don’t need to be simple or easy to understand.
-
Intelligence isn’t the right measure to have on the X-axis—it evokes a math professor in an ivory tower, removed from the goings-on in the real world. A better word might be capability: “how good is this entity at going out into the world and getting more of what it wants?”
-
In practice, AI labs are working on improving capability, rather than intelligence defined abstractly in a way that does not connect to capability. And capability is about achieving objectives.
-
If we build something more capable than humans in a certain domain, we should expect it to be “coherent” in the sense that it will not make any mistakes that a smart human wouldn’t have made. Caveat: it might make more of a particular kind of mistake, and make up for it by being better at other things. This happens with current systems, and IMO plausibly we’ll see something similar even in the kind of system I’d call AGI. But at some point the capabilities of AI systems will be general enough that they will stop making mistakes that are exploitable by humans. This includes mistakes like “fail to notice that your programmer could shut you down, and that would stop you from achieving any of your objectives”.
-
But in my report I arrive at a forecast by fixing a model size based on estimates of brain computation, and then using scaling laws to estimate how much data is required to train a model of that size. The update from Chinchilla is then that we need more data than I might have thought.
I’m confused by this argument. The old GPT-3 scaling law is still correct, just not compute-optimal. If someone wanted to, they could still go on using the old scaling law. So discovering better scaling can only lead to an update towards shorter timelines, right?
(Except if you had expected even better scaling laws by now, but it didn’t sound like that was your argument?)
People at OpenAI regularly say things like
Our current path [to solve alignment] is very promising (https://twitter.com/janleike/status/1562501343578689536)
[...] even without fundamentally new alignment ideas, we can likely build sufficiently aligned AI systems to substantially advance alignment research itself (https://openai.com/blog/our-approach-to-alignment-research/ )
And you say:
OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities
AFAICT, no-one from OpenAI has publicly explained why they believe that RLHF + amplification is supposed to be enough to safely train systems that can solve alignment for us. The blog post linked above says “we believe” four times, but does not take the time to explain why anyone believes these things.
Writing up this kind of reasoning is time-intensive, but I think it would be worth it: if you’re right, then the value of information for the rest of the community is huge; if you’re wrong, it’s an opportunity to change your minds.