I’ve watched your talk at SERI now.
One question I have is how you hope to define a good notion of “acceptable” without a notion of intent. In your talk, you mention looking at why the model does what it does, in addition to just looking at what it does. This makes sense to me (I talk about similar things), but, it seems just about as fraught as the notion of mesa-objective:
It requires approximately the same “magic transparency tech” as we need to extract mesa-objectives.
Even with magical transparency tech, it requires additional insight as to which reasoning is acceptable vs unacceptable.
If you are pessimistic about extracting mesa-objectives, why are you optimistic about providing feedback about how to reason? More generally, what do you think “acceptability” might look like?
(By no means do I mean to say your view is crazy; I am just looking for your explanation.)
This is interesting to me for two reasons:
[Mainly] Several proposals for avoiding the instrumental policy work by penalizing computation. But I have a really shaky philosophical grip on why that’s a reasonable thing to do, and so all of those solutions end up feeling weird to me. I can still evaluate them based on what works on concrete examples, but things are slippery enough that plan A is getting a handle on why this is a good idea.
In the long run I expect to have to handle learned optimizers by having the outer optimizer instead directly learn whatever the inner optimizer would have learned. This is an interesting setting to look at how that works out. (For example, in this case the outer optimizer just needs to be able to represent the hypothesis “There is a program that has property P and runs in time T’ ” and then do its own search over that space of faster programs.)
The speed prior still delegates to better search algorithms though. For example, suppose that someone is able to fill in a 1000 bit program using only 2^500 steps of local search. Then the local search algorithm has speed prior complexity 500 bits, so will beat the object-level program. And the prior we’d end up using is basically “2x longer = 2 more bits” instead of “2x longer = 1 more bit,” i.e. we end up caring more about speed because we delegated.
The actual limit on how much you care about speed is given by whatever search algorithms work best. I think it’s likely possible to “expose” what is going on to the outer optimizer (so that it finds a hypothesis like “This local search algorithm is good” and then uses it to find an object-level program, rather than directly finding a program that bundles both of them together). But I’d guess intuitively that it’s just not even meaningful to talk about the “simplest” programs or any prior that cares less about speed than the optimal search algorithm.
In traditional settings, we are searching for a program M that is simpler than the property P. For example, the number of parameters in our model should be smaller than the size of the dataset we are trying to fit if we want the model to generalize. (This isn’t true for modern DL because of subtleties with SGD optimizing imperfectly and implicit regularization and so on, but spiritually I think it’s still fine..)
But this breaks down if we start doing something like imposing consistency checks and hoping that those change the result of learning. Intuitively it’s also often not true for scientific explanations—even simple properties can be surprising and require explanation, and can be used to support theories that are much more complex than the observation itself.
It’s quite plausible that in these cases we want to be doing something other than searching over programs. This is pretty clear in the “scientific explanation” case, and maybe it’s the way to go for the kinds of alignment problems I’ve been thinking about recently.A basic challenge with searching over programs is that we have to interpret the other data. For example, if “correspondence between two models of physics” is some kind of different object like a description in natural language, then some amplified human is going to have to be thinking about that correspondence to see if it explains the facts. If we search over correspondences, some of them will be “attacks” on the human that basically convince them to run a general computation in order to explain the data. So we have two options: (i) perfectly harden the evaluation process against such attacks, (ii) try to ensure that there is always some way to just directly do whatever the attacker convinced the human to do. But (i) seems quite hard, and (ii) basically requires us to put all of the generic programs in our search space.
It’s also quite plausible that we’ll just give up on things like consistency conditions. But those come up frequently enough in intuitive alignment schemes that I at least want to give them a fair shake.
The speed prior is calibrated such that this never happens if the learned optimizer is just using brute force—if it needs to search over 1 extra bit then it will take 2x longer, offsetting the gains.
That means that in the regime where P is simple, the speed prior is the “least you can reasonably care about speed”—if you care even less, you will just end up pushing the optimization into an inner process that is more concerned with speed and is therefore able to try a bunch of options.
(However, this is very mild, since the speed prior cares only a tiny bit about speed. Adding 100 bits to your program is the same as letting it run 2^100 times longer, so you are basically just optimizing for simplicity.)
To make this concrete, suppose that I instead used the kind-of-speed prior, where taking 4x longer is equivalent to using 1 extra bit of description complexity. And suppose that P is very simple relative to the complexities of the other objects involved. Suppose that the “object-level” program M has 1000 bits and runs in 2^2000 time, so has kind-of-speed complexity 2000 bits. A search that uses the speed prior will be able to find this algorithm in 2^3000 time, and so will have a kind-of-speed complexity of 1500 bits. So the kind-of-speed prior will just end up delegating to the speed prior.
Suppose I am interested in finding a program M whose input-output behavior has some property P that I can probabilistically check relatively quickly (e.g. I want to check whether M implements a sparse cut of some large implicit graph). I believe there is some simple and fast program M that does the trick. But even this relatively simple M is much more complex than the specification of the property P.
Now suppose I search for the simplest program running in time T that has property P. If T is sufficiently large, then I will end up getting the program “Search for the simplest program running in time T’ that has property P, then run that.” (Or something even simpler, but the point is that it will make no reference to the intended program M since encoding P is cheaper.)
I may be happy enough with this outcome, but there’s some intuitive sense in which something weird and undesirable has happened here (and I may get in a distinctive kind of trouble if P is an approximate evaluation). I think this is likely to be a useful maximally-simplified example to think about.
I agree about the general principle, even if I don’t think this particular thing is an example because of the “not maximizing sum of future rewards” thing.
I realize that unit-type-checking ML is pretty uncommon and might just be insane
Nah, it’s a great trick.
The two parameter distances seem like they’re in whatever distance metric you’re using for parameter space, which seems to be very different from the logprobs.
The trick here is that L2 regularization / weight decay is equivalent to having a Gaussian prior on the parameters, so you can think of that term as logN(θ0,σ) (minus an irrelevant additive constant), where σ is set to imply whatever hyperparameter you used for your weight decay.
This does mean that you are committing to a Gaussian prior over the parameters. If you wanted to include additional information like “moving towards zero is more likely to be good” then you would not have a Gaussian centered at θ0, and so the corresponding log prob would not be the nice simple “L2 distance to θ0”.
My admittedly-weak physics intuitions are usually that you only want to take an exponential (or definitely a log-sum-exp like this) of unitless quantities, but it looks like it has the maybe the unit of our distance in parameter space. That makes it weird to integrate over possible parameter, which introduces another unit of parameter space, and then take the logarithm of it.
I think this intuition is correct, and the typical solution in ML algorithms is to empirically scale all of your quantities such that everything works out (which you can interpret from the unit-checking perspective as “finding the appropriate constant to multiply your quantities by such that they become the right kind of unitless”).
I guess my sense is that most biological systems are going to be ‘package deals’ instead of ‘cleanly separable’ as much as possible—if you already have a system that’s doing learning, and you can tweak that system in order to get something that gets you some of the benefits of a VoI framework (without actually calculating VoI), I expect biology to do that.
Hmm, I guess I mostly disagree because:
I see this as sorta an unavoidable aspect of how the system works, so it doesn’t really need an explanation;
You’re jumping to “the system will maximize sum of future rewards” but I think RL in the brain is based on “maximize rewards for this step right now” (…and by the way “rewards for this step right now” implicitly involves an approximate assessment of future prospects.) See my comment “Humans are absolute rubbish at calculating a time-integral of reward”.
I’m all for exploration, value-of-information, curiosity, etc., just not involving this particular mechanism.
Thanks Aryeh for collecting these! I added them to a new Project Ideas section in my AI Safety Resources list.
Sure, but don’t you agree that it’s a very confusing use of the term?
Maybe? Idk, according to me the goal of alignment is “create a model that is motivated to help us”, and so misalignment = not-alignment = “the mode is not motivated to help us”. Feels pretty clear to me but illusion of transparency is a thing.
I am making a claim that for the purposes of alignment of capable systems, you do want to talk about “motivation”. So to the extent GPT-N / Codex-N doesn’t have a motivation, but is existentially risky, I’m claiming that you want to give it a motivation. I wouldn’t say this with high confidence but it is my best guess for now.
(Also the fact that gwern, which ascribe agency to GPT-3, quoted specifically this part in his comment is another evidence that you’re implying agency for different people)
I think Gwern is using “agent” in a different way than you are ¯\_(ツ)_/¯
I don’t think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He’d probably be more specific than me just because he’s worked with it a lot more than I have.)
Maybe the real reason it feels weird for me to call this behavior of Codex misalignment is that it is so obvious?
It doesn’t seem like whether something is obvious or not should determine whether it is misaligned—it’s obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned.
Almost all specification gaming examples are subtle, or tricky, or exploiting bugs.
I think that’s primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.
Very interesting, thanks.
A few points on examples from humans in capacity-to-succeed-through-deception (tricking in the transcript):
It’s natural that we don’t observe anyone successfully doing this, since success entails not being identified as deceptive. This could involve secrecy, but more likely things like charisma and leverage of existing biases.
When making comparisons with very-smart-humans, I think it’s important to consider very-smart-across-all-mental-dimensions-humans (including charisma etc).
It may be that people have paths to high utility (which may entail happiness, enlightenment, meaning, contentment… rather than world domination) that don’t involve the risks of a deceptive strategy. If human utility were e.g. linear in material resources, things may look different.
Human deception is often kept in check by cost-of-punishments outweighing benefit-of-potential-success. With AI agents the space of meaningful punishments will likely look different.
But in experiments, they’re not synchronized; the former happens faster than the latter.
This has the effect of incentivizing learning, right? (A system that you don’t yet understand is, in total, more rewarding than an equally yummy system that you do understand.) So it reminds me of exploration in bandit algorithms, which makes sense given the connection to motivation.
Sorry for ascribing you beliefs you don’t have. I guess I’m just used to people here and in other places assuming goals and agency in language models, and also some of your choices of words sounded very goal-directed/intentional stance to me.
Maybe you’re objecting to the “motivated” part of that sentence? But I was saying that it isn’t motivated to help us, not that it is motivated to do something else.
Sure, but don’t you agree that it’s a very confusing use of the term? Like, if I say GPT-3 isn’t trying to kill me, I’m not saying it is trying to kill anyone, but I’m sort of implying that it’s the right framing to talk about it. In this case, the “motivated” part did triggered me, because it implied that the right framing is to think about what Codex wants, which I don’t think is right (and apparently you agree).
Maybe you’re objecting to words like “know” and “capable”? But those don’t seem to imply agency/goals; it seems reasonable to say that Google Maps knows about traffic patterns and is capable of predicting route times.
Agreed with you there.
As an aside, this was Codex rather than GPT-3, though I’d say the same thing for both.
True, but I don’t feel like there is a significant difference between Codex and GPT-3 in terms of size or training to warrant different conclusions with regard to ascribing goals/agency.
I don’t care what it is trained for; I care whether it solves my problem. Are you telling me that you wouldn’t count any of the reward misspecification examples as misalignment? After all, those agents were trained to optimize the reward, not to analyze what you meant and fix your reward.
First, I think I interpreted “misalignment” here to mean “inner misalignment”, hence my answer. I also agree that all examples in Victoria’s doc are showing misalignment. That being said, I still think there is a difference with the specification gaming stuff.
Maybe the real reason it feels weird for me to call this behavior of Codex misalignment is that it is so obvious? Almost all specification gaming examples are subtle, or tricky, or exploiting bugs. They’re things that I would expect a human to fail to find, even given the precise loss and training environment. Whereas I expect any human to complete buggy code with buggy code once you explain to them that Codex looks for the most probable next token based on all the code.
But there doesn’t seem to be a real disagreement between us: I agree that GPT-3/Codex seem fundamentally unable to get really good at the “Chatbot task” I described above, which is what I gather you mean by “solving my problem”.
(By the way, I have an old post about formulating this task that we want GPT-3 to solve. It was written before I actually studied GPT-3 but that holds decently well I think. I also did some experiments on GPT-3 with EleutherAI people on whether bigger models get better at answering more variations of the prompt for the same task.)
Planned summary for the Alignment Newsletter:
Under a [longtermist](https://forum.effectivealtruism.org/tag/longtermism) lens, one problem to worry about is that even after building AI systems, humans will spend more time competing with each other rather than figuring out what they want, which may then lead to their values changing in an undesirable way. For example, we may have powerful persuasion technology that everyone uses to persuade people to their line of thinking; it seems bad if humanity’s values are determined by a mix of effective persuasion tools, especially if persuasion significantly diverges from truth-seeking.One solution to this is to coordinate to _pause_ competition while we deliberate on what we want. However, this seems rather hard to implement. Instead, we can at least try to _decouple_ competition from deliberation, by having AI systems acquire <@flexible influence@>(@The strategy-stealing assumption@) on our behalf (competition), and having humans separately thinking about what they want (deliberation). As long as the AI systems are competent enough to shield the humans from the competition, the results of the deliberation shouldn’t depend too much on competition, thus achieving the desired decoupling.The post has a bunch of additional concrete details on what could go wrong with such a plan that I won’t get into here.
Under a [longtermist](https://forum.effectivealtruism.org/tag/longtermism) lens, one problem to worry about is that even after building AI systems, humans will spend more time competing with each other rather than figuring out what they want, which may then lead to their values changing in an undesirable way. For example, we may have powerful persuasion technology that everyone uses to persuade people to their line of thinking; it seems bad if humanity’s values are determined by a mix of effective persuasion tools, especially if persuasion significantly diverges from truth-seeking.
One solution to this is to coordinate to _pause_ competition while we deliberate on what we want. However, this seems rather hard to implement. Instead, we can at least try to _decouple_ competition from deliberation, by having AI systems acquire <@flexible influence@>(@The strategy-stealing assumption@) on our behalf (competition), and having humans separately thinking about what they want (deliberation). As long as the AI systems are competent enough to shield the humans from the competition, the results of the deliberation shouldn’t depend too much on competition, thus achieving the desired decoupling.
The post has a bunch of additional concrete details on what could go wrong with such a plan that I won’t get into here.
I think that this is a very good example where the paper (based on your summary) and your opinion assumes some sort of higher agency/goals in GPT-3 than what I feel we have evidence for.
Where do you see any assumption of agency/goals?
(I find this some combination of sad and amusing as a commentary on the difficulty of communication, in that I feel like I tend to be the person pushing against ascribing goals to GPT.)
For a simulator-like model, this is not misalignment, this is intended behavior. It is trained to find the most probable continuation, not to analyze what you meant and solve your problem.
You can obviously say “it’s an agent that does really care about the context”, but I doesn’t look like it adds anything to the picture,
Agreed, which is why I didn’t say anything like that?
Rohin’s opinion: I really liked the experiment demonstrating misalignment, as it seems like it accurately captures the aspects that we expect to see with existentially risky misaligned AI systems: they will “know” how to do the thing we want, they simply won’t be “motivated” to actually do it.
I think that this is a very good example where the paper (based on your summary) and your opinion assumes some sort of higher agency/goals in GPT-3 than what I feel we have evidence for. Notably, there are IMO pretty good arguments (mostly by people affiliated with EleutherAI, I’m pushing them to post on the AF) that GPT-3 seems to work more like a simulator of language-producing processes (for lack of a better word), than as an agent trying to predict the next token.
Like what you write here:
They also probe the model for bad behavior, including misalignment. In this context, they define misalignment as a case where the user wants A, but the model outputs B, and the model is both capable of outputting A and capable of distinguishing between cases where the user wants A and the user wants B.
For a simulator-like model, this is not misalignment, this is intended behavior. It is trained to find the most probable continuation, not to analyze what you meant and solve your problem. In that sense, GPT-3 fails the “chatbot task”: for a lot of the great things it’s great at doing, you have to handcraft (or constrain) the prompts to make—it won’t find out precisely what you mean.
Or put it differently: people which are good at making GPT-3 do what they want have learned to not use it like a smart agent figuring out what you really mean, but more like a “prompt continuation engine”. You can obviously say “it’s an agent that does really care about the context”, but I doesn’t look like it adds anything to the picture, and I have the gut feeling that being agenty makes it harder to do that task (as you need a very un-goal-like goal).
(I think this points out to what you mention in that comment, about approval-directedness being significantly less goal-directed: if GPT-3 is agenty, it looks quite a lot like a sort of approval-directed agent.)
The way I’m thinking about AGI algorithms (based on how I think the neocortex works) is, there would be discrete “features” but they all come in shades of applicability from 0 to 1, not just present or absent. And by the same token, the reward wouldn’t perfectly align with any “features” (since features are extracted from patterns in the environment), and instead you would wind up with “features” being “desirable” (correlated with reward) or “undesirable” (anti-correlated with reward) on a continuous scale from -∞ to +∞. And the agent would try to bring about “desirable” things rather than maximize reward per se, since the reward may not perfectly line up with anything in its ontology / predictive world-model. (Related.)
So then you sometimes have “a thing that pattern-matches 84% to desirable feature X, but also pattern-matches 52% to undesirable feature Y”.
That kinda has some spiritual similarity to model splintering I think, but I don’t think it’s exactly the same … for example I don’t think it even requires a distributional shift. (Or let me know if you disagree.) I don’t see how to import your model splintering ideas into this kind of algorithm more faithfully than that.
Anyway, I agree with “conservatism & asking for advice”. I guess I was thinking of conservatism as something like balancing good and bad aspects but weighing the bad aspects more. So maybe “a thing that pattern-matches 84% to desirable feature X, but also pattern-matches 52% to undesirable feature Y” is actually net undesirable, because the Y outweighs the X, after getting boosted up by the conservatism correction curve.
And as for asking for advice, I was thinking, if you get human feedback about this specific thing, then after you get the advice it would pattern-match 100% to desirable feature Z, and that outweighs everything else.
As for “when advice fails”, I do think you ultimately need some kind of corrigibility, but earlier on there could be something like “the algorithm that chooses when to ask questions and what questions to ask does not share the same desires as the algorithm that makes other types of decisions”, maybe.
Are you the historical origin of the robustness-centric approach?
Idk, probably? It’s always hard for me to tell; so much of what I do is just read what other people say and make the ideas sound sane to me. But stuff I’ve done that’s relevant:
Talk at CHAI saying something like “daemons are just distributional shift” in August 2018, I think. (I remember Scott attending it.)
Talk at FHI in February 2020 that emphasized a risk model where objectives generalize but capabilities don’t.
Talk at SERI conference a few months ago that explicitly argued for a focus on generalization over objectives.
Especially relevant stuff other people have done that has influenced me:
Two guarantees (arguably this should be thought of as the origin)
(My views were pretty set by the time Evan wrote the clarifying inner alignment terminology post; it’s possible that his version that’s closer to generalization-focused was inspired by things I said, you’d have to ask him.)