rohinmshah
Karma: 2,193 (LW), 431 (AF)
NewTop
Rohin seems to think the point is “Simply knowing that an agent is intelligent lets us infer that it is goal-directed” but Eliezer doesn’t seem to think that corrigible (hence not goal-directed) agents are impossible to build. (That’s actually one of MIRI’s research objectives even though they take a different approach from Paul’s.)
I think the point (from Eliezer’s perspective) is “Simply knowing that an agent is intelligent lets us infer that it is an expected utility maximizer”. The main implication is that there is no way to affect the details of a superintelligent AI except by affecting its utility function, since everything else is fixed by math (specifically the VNM theorem). Note that this is (or rather, appears to be) a very strong condition on what alignment approaches could possibly work—you can throw out any approach that isn’t going to affect the AI’s utility function. I think this is the primary reason for Eliezer making this argument. Let’s call this the “intelligence implies EU maximization” claim.
Separately, there is another claim that says “EU maximization by default implies goal-directedness” (or the presence of convergent instrumental subgoals, if you prefer that instead of goal-directedness). However, this is not required by math, so it is possible to avoid this implication, by designing your utility function in just the right way.
Corrigibility is possible under this framework by working against the second claim, i.e. designing the utility function in just the right way that you get corrigible behavior out. And in fact this is the approach to corrigibility that MIRI looked into.
I am primarily taking issue with the “intelligence implies EU maximization” argument. The problem is, “intelligence implies EU maximization” is true, it just happens to be vacuous. So I can’t say that that’s what I’m arguing against. This is why I rounded it off to arguing against “intelligence implies goal-directedness”, though this is clearly a bad enough summary that I shouldn’t be saying that any more.
Strong agree. One of my subgoals during teaching is often to confuse students. See also this video, which basically captures the reason why.
That was the summary :P The full thing was quite a bit longer. I also didn’t want to misquote Eric.
Maybe the shorter summary is: there are two axes which we can talk about. First, will systems be transparent, modular and structured (call this CAIS-like), or will they be opaque and well-integrated? Second, assuming that they are opaque and well-integrated, will they have the classic long-term goal-directed AGI-agent risks or not?
Eric and I disagree on the first one: my position is that for any particular task, while CAIS-like systems will be developed first, they will gradually be replaced by well-integrated ones, once we have enough compute, data, and model capacity.
I’m not sure how much Eric and I disagree on the second one: I think it’s reasonable to predict that the resulting systems are specialized for particular bounded tasks and so won’t be running broad searches for long-term plans. I would still worry about inner optimizers; I don’t know what Eric thinks about that worry.
This summary is more focused on my beliefs than Eric’s, and is probably not a good summary of the intent behind the original comment, which was “what does Eric think Rohin got wrong in his summary + opinion of CAIS”, along with some commentary from me trying to clarify my beliefs.
Updates were mainly about actually carving up the space in the way above. Probably others, but I often find it hard to introspect on how my beliefs are updating.
Eric and I have exchanged a few emails since I posted this summary, I’m posting some of it here (with his permission), edited by me for conciseness and clarity. The paragraphs in the quotes are Eric’s, but I have rearranged his paragraphs and omitted some of them for better flow in this comment.
There is a widespread intuition that AGI agents would by nature be more integrated, flexible, or efficient than comparable AI services. I am persuaded that this is wrong, and stems from an illusion of simplicity that results from hiding mechanism in a conceptually opaque box, a point that is argued at some length in Section 13.
Overall, I think that many of us have been in the habit of seeing flexible optimization itself as problem, when optimization is instead (in the typical case) a strong constraint on a system’s behavior (see Section 8). Flexibility of computation in pursuit of optimization for bounded tasks seems simply useful, regardless of planning horizon, scope of considerations, or scope of required knowledge.
I agree that AGI agents hide mechanism in an opaque box. I also agree that the sort of optimization that current ML does, which is very task-focused, is a strong constraint on behavior. There seems to be a different sort of optimization that humans are capable of, where we can enter a new domain and perform well in it very quickly; I don’t have a good understanding of that sort of optimization, and I think that’s what the classic AGI agent risks are about.
Relatedly, I’ve used the words “monolithic AGI agent” a bunch in the summary and the post. Now, I want to instead talk about whether AI systems will be opaque and well-integrated, since that’s the main crux of our disagreement. It’s plausible to me that even if they are opaque and well-integrated, you don’t get the classic AGI agent risks, because you don’t get the sort of optimization I was talking about above.
In this connection, you cite the power of end-to-end training, but Section 17.4 (“General capabilities comprise many tasks and end-to-end relationships”) argues that, because diverse tasks encompass many end-to-end relationships, the idea that a broad set of tasks can be trained “end to end” is mistaken, a result of the narrowness of current trained systems in which services form chains rather than networks that are more wide than deep. We should instead expect that broad capabilities will best be implemented by sets of systems (or sets of end-to-end chains of systems) that comprise well-focused competencies: Systems that draw on distinct subtask competencies will typically be easier to train and provide more robust and general performance (Section 17.5). Modularity typically improves flexibility and generality, rather than impeding it.
Note that the ability to employ subtask components in multiple contexts constitutes a form of transfer learning, and [...] this transfer learning can carry with it task-specific aspects of behavioral alignment.
This seems like the main crux of the disagreement. My claim is that for any particular task, given enough compute, data and model size, an opaque, well-integrated, unstructured AI system will outperform a transparent, modular collection of services. This is only on the axis of performance at the task: I agree that the structured system will generalize better out of distribution (which leads to robustness, flexibility, and better transfer learning). I’m basing this primarily off of empirical evidence and intuitions:
For many tasks so far (computer vision, NLP, robotics), transitioning from a modular architecture to end-to-end deep learning led to large boosts in performance.
My impression is that many interdisciplinary academics are able to transfer ideas and intuitions from one of their fields to the other, allowing them to make big contributions that more experienced researchers could not do. This suggests that patterns of problem-solving from one field can transfer to another in a non-trivial way, that you could achieve best with well-integrated systems.
Psychology research can be thought of as an attempt to systematize/modularize our knowledge about humans. Despite a huge amount of work in psychology, our internal, implicit, well-integrated models of humans are way better than our explicit theories.
Humans definitely solve large tasks in a very structured way; I hypothesize that this is because for those tasks the limits of human compute/data/brain size prevent us from getting the benefits of an unstructured heuristic approach.
Speaking of integration:
Regarding integration, I’ve argued that classic AGI-agent models neither simplify nor explain general AI capabilities (Section 13.3), including the integration of competencies. Whatever integration of functionality one expects to find inside an opaque AGI agent must be based on mechanisms that presumably apply equally well to integrating relatively transparent systems of services. These mechanisms can be dynamic, rather than static, and can include communication via opaque vector embeddings, jointly fine-tuning systems that perform often-repeated tasks, and matching of tasks to services, (including service-development services) in semantically meaningful “task spaces” (discussed in Section 39 “Tiling task-space with AI services can provide general AI capabilities”).
[...]
Direct lateral links between competencies such as organic synthesis, celestial mechanics, ancient Greek, particle physics, image interpretation, algorithm design, traffic planning (etc.) are likely to be sparse, particularly when services perform object-level tasks. This sparseness is, I think, inherent in natural task-structures, quite independent of human cognitive limitations.
(The paragraphs above were written in a response to me while I was still using the phrase “AGI agents”)
I expect that the more you integrate the systems of services, the more opaque they will become. The resulting system will be less interpretable; it will be harder to reason about what information particular services do not have access to (Section 9.4); and it is harder to tell when malicious behavior is happening. The safety affordances identified in CAIS no longer apply because there is not enough modularity between services.
Re: sparseness inherent in task-structures, I think this is a result of human cognitive limitations but don’t know how to argue more for that perspective.
What would you say is the primary source of the problem?
The fact that humans don’t generalize well out of distribution, especially on moral questions; and the fact that progress can cause distribution shifts that cause us to fail to achieve our “true values”.
What do you think the implications of this are?
Um, nothing in particular.
I’m not sure why you ask this though.
It’s very hard to understand what people actually mean when they say things, and a good way to check is to formulate an implication of (your model of) their model that they haven’t said explicitly, and then see whether you were correct about that implication.
Other strategies I want to put in this cluster include formal verification, informed oversight and factorization.
Why informed oversight? It doesn’t feel like a natural fit to me. Perhaps you think any oversight fits in this category, as opposed to the specific problem pointed to by informed oversight? Or perhaps there was no better place to put it?
Corrigibility is largely about making systems that are superintelligent without being themselves fully agentic.
This seems very different from the notion of corrigibility that is “a system that is trying to help its operator”. Do you think that these are two different notions, or are they different ways of pointing at the same thing?
A lot of this doesn’t seem specific to AI. Would you agree that AI accelerates the problem and makes it more urgent, but isn’t the primary source of the problem you’ve identified?
How would you feel about our chances for a good future if AI didn’t exist (but we still go forward with technological development, presumably reaching space exploration eventually)? Are human safety problems an issue then? Some of the problems, like intentional value manipulation, do seem to become significantly easier.
FYI I’ve had this experience as well, though it’s not particularly strong or common.
I see a few criticisms about how this doesn’t really solve the problem, it only delays it because we expect a unified agent to outperform the combined services.
Not sure if you’re talking about me, but I suspect that my criticism could be read that way. Just want to clarify that I do think “we expect a unified agent to outperform the combined services” but I don’t think this means we shouldn’t pursue CAIS. That strategic question seems hard and I don’t have a strong opinion on it.
(But maybe these questions aren’t very important if the main point here isn’t offering RLSP as a concrete technique for people to use but more that “state of the world tells us a lot about what humans care about”.)
Yeah, I think that’s basically my position.
But to try to give an answer anyway, I suspect that the benefits of having a lot of data via large-scale IRL will make it significantly outperform RLSP, even if you could get a longer time horizon on RLSP. There might be weird effects where the RLSP reward is less Goodhart-able (since it tends to prioritize keeping the state the same) that make the RLSP reward better to maximize, even though it captures fewer aspects of “what humans care about”. On the other hand, RLSP is much more fragile; slight errors in dynamics / features / action space will lead to big errors in the inferred reward; I would guess this is less true of large-scale IRL, so in practice I’d guess that large-scale IRL would still be better. But both would be bad.
I’m confused that this idea is framed as an alternative to impact measures, because I thought the main point of impact measures is “prevent catastrophe” and this doesn’t aim to do that.
I didn’t mean to frame it as an alternative to impact measures, but it is achieving some of the things that impact measures achieve. Partly I wrote this post to explicitly say that I don’t imagine RLSP being a drop-in replacement for impact measures, even though it might seem like that could be true. I guess I didn’t communicate that effectively.
In the AI that RLSP might be a component of, what is doing the “prevent catastrophe” part?
That depends more on the AI part than on RLSP. I think the actual contribution here is the observation that the state of the world tells us a lot about what humans care about, and the RLSP algorithm is meant to demonstrate that it is in principle possible to extract those preferences.
If I were forced to give an answer to this question, it would be that RLSP would form a part of a norm-following AI, and that because the AI was following norms it wouldn’t do anything too crazy. However, RLSP doesn’t solve any of the theoretical problems with norm-following AI.
But the real answer is that this is an observation that seems important, but I don’t have a story for how it leads to us solving AI safety.
Can you also compare the pros and cons of this idea with other related ideas, for example large-scale IRL? (I’m imagining attaching recording devices to lots of people and recording their behavior over say months or years and feeding that to IRL.)
Any scenario I construct with RLSP has clear problems, and similarly large-scale IRL also has clear problems. If you provide particular scenarios I could analyze those.
For example, if you literally think just of running RLSP with a time horizon of a year vs. large-scale IRL over a year and optimizing the resulting utility function, large-scale IRL should do better because it has way more data to work with.
It seems like there’s gotta be a principled way to combine this idea with inverse reward design. Is that something you’ve thought about?
Yeah, I agree they feel very composable. The main issue is that the observation model in IRD requires a notion of a “training environment” that’s separate from the real world, whereas RLSP assumes that there is one complex environment in which you are acting.
Certainly if you first trained your AI system in some training environments and then deployed them in the real world, you could use IRD during training to get a distribution over reward functions, and then use that distribution as your prior when running RLSP. It’s maybe plausible that if you did this you could simply optimize the resulting reward function, rather than doing risk-averse planning (which is how IRD gets the robot to avoid lava), that would be cool. It’s hard to test because all of the IRD environments don’t satisfy the key assumption of RLSP (that humans have optimized the environment for their preferences).
:)
Ah, you’re right, we don’t really agree, I misunderstood.
I think we basically agree on actual object-level thing and I’m mostly disagreeing on the use of “tragedy of the commons” as a description of it. I don’t think this is important though so I’d prefer to drop it.
Tbc, I agree with this:
If there is a cost to reducing Xrisk (which I think is a reasonable assumption), then there will be an incentive [...] to underinvest in reducing Xrisk. There’s still *some* incentive to prevent Xrisk, but to some people everyone dying is not much worse than just them dying.
Academic documents, as I interpret them, aim to be acceptable to the academic community or considered academic.
There are good non-signaling reasons for academic documents being the way that they are. Consider the following properties of academia:
A field is huge, such that it is very hard to learn all of it
The group of people working on the field is enormous, requiring decentralized coordination
Fields of inquiry take decades, meaning that there needs to be a way of onboarding new people
Consider how you might try to write explanatory posts for such a group that are shorter than books, and I suspect you’ll recover many of the properties of academic articles (perhaps modernized, e.g. links instead of citations).
Alignment Newsletter #45
Whether it is necessary to simulate the past to figure out the cost of deviating from the present state, I am not sure.
You seem to be proposing low-impact AI / impact regularization methods. As I mentioned in the post:
we are gaining significantly on the “do what we want” desideratum: the point of inferring preferences is that we do not also penalize positive impacts that we want to happen.
Almost everything we want to do is irreversible / impactful / entropy-increasing, and many things that we don’t care about are also irreversible / impactful / entropy-increasing. If you penalize irreversibility / impact / entropy, then you will prevent your AI system from executing strategies that would be perfectly fine and even desirable. My intuition is that typically this would prevent your AI system from doing anything interesting (e.g. replacing CEOs).
Simulating the past is one way that you can infer preferences from the state of the world; it’s probably not the best way and I’m not tied to that particularly strategy. The important bit is that the state contains preference information and it is possible in theory to extract it.
You can use RL for the distillation step.
Yeah, I know, my main uncertainty was with how exactly that cashes out into an algorithm (in particular, RL is typically about sequential decision-making, and I wasn’t sure where the “sequential” part came in).
The algorithm still needs reinforce and a value function baseline (since you need to e.g. output words one at a time), and “RL” seems like the normal way to talk about that algorithm/problem. We you could instead call it “contextual bandits.”
I get the need for reinforce, I’m not sure I understand the value function baseline part.
Here’s a thing you might be saying that would explain the value function baseline: this problem is equivalent to a sparse-reward RL problem, where:
The states are the question + in-progress answer
The actions are “append the word w to the answer”
All actions produce zero reward except for the action that ends the answer, which produces reward equal to the overseer’s answer to “How good is answer <answer> to question <question>?”
And we can apply RL algorithms to this problem.
Is that equivalent to what you’re saying?
You could also use an assistant who you can interact with to help evaluate rewards (rather than using assistants who answer a single question) in which case it’s generic RL.
Just to make sure I’m understanding correctly, this is recursive reward modeling, right?
Does “imitation learning” refer to an autoregressive model here? I think of IRL+RL a possible mechanism for imitation learning, and it’s normally the kind of algorithm I have in mind when talking about “imitation learning” (or the GAN objective, or an EBM, all of which seem roughly equivalent, or maybe some bi-GAN/VAE thing). (Though I also expect to use an autoregressive model as an initialization in any case.)
Yeah, that was bad wording on my part. I was using “imitation learning” to refer both to the problem of imitating the behavior of an agent, as well as the particular mechanism of behavioral cloning, i.e. collecting a dataset of many question-answer pairs and performing gradient descent using e.g. cross-entropy loss.
I agree that IRL + RL is a possible mechanism for imitation learning, in the same way that behavioral cloning is a possible mechanism for imitation learning. (This is why I was pretty confident that my first option was not the right one.)
I’m seeing a one-hour old empty comment, I assume it got accidentally deleted somehow?
ETA: Nvm, I can see it on LessWrong, but not on the Alignment Forum.
I agree with Wei Dai that the schemes you’re describing do not sound like imitation learning. Both of the schemes you describe sound to me like RL-IA. The scheme that you call imitation-IA seems like a combination random search + gradients method of doing RL. There’s an exactly analogous RL algorithm for the normal RL setting—just take the algorithm you have, and replace all instances of M2(“How good is answer X to Y?“) with , where is the reward function.
One way that you could do imitation-IA would be to compute a bunch of times to get a dataset and train on that dataset.
I am also not sure exactly what it means to use RL in iterated amplification. There are two different possibilities I could imagine:
Using a combination of IRL + RL to achieve the same effect as imitation learning. The hope here would be that IRL + RL provides a better inductive bias for imitation learning, helping with sample efficiency.
Instead of asking the amplified model to compute directly, we ask it to provide a measure of approval, e.g. by asking “How good is answer X to Y?“, or by asking “Which is a better answer to Y, X1 or X2?” and learning from that signal (see optimizing with comparisons), using some arbitrary RL algorithm.
I’m quite confident that RL+IA is not meant to be the first kind. But even with the second kind, one question does arise—typically with RL we’re trying to optimize the sum of rewards across time, whereas here we actually only want to optimize the one-step reward that you get immediately (which is the point of maximizing approval and having a stronger overseer). So then I don’t really see why you want RL, which typically is solving a hard credit assignment problem that doesn’t arise in the one-step setting.
Planned newsletter opinion: This seems like a real problem, but I’m not sure how important it is. I am most optimistic that the last approach will “just work”, where we solve alignment and there are enough overseers who care about getting these questions right that we do solve these philosophical problems. However, I’m very uncertain about this since I haven’t thought about it enough (it seems like a question about humans rather than about AI). Regardless of importance, it does seem to have almost no one working on it and could benefit from more thought.