Vlad Mikulik(Vladimir Mikulik)

Karma: 679

Utility ≠ Reward

Vlad Mikulik5 Sep 2019 17:28 UTC

130 points

24 comments1 min readLW link 2 reviews

2-D Robustness

Vlad Mikulik30 Aug 2019 20:27 UTC

85 points

8 comments2 min readLW link

Clarifying Consequentialists in the Solomonoff Prior

Vlad Mikulik11 Jul 2018 2:35 UTC

20 points

16 comments6 min readLW link

Vlad Mikulik 2 Jun 2019 15:38 UTC
LW: 20 AF: 12
AF
on: Selection vs Control
(I am unfortunately currently bogged down with external academic pressures, and so cannot engage with this at the depth I’d like to, but here’s some initial thoughts.)

I endorse this post. The distinction explained here seems interesting and fruitful.

I agree with the idea to treat selection and control as two kinds of analysis, rather than as two kinds of object – I think this loosely maps onto the distinction we make between the mesa-objective and the behavioural objective. The former takes the selection view of the learned algorithm; the latter takes the control view.

At least speaking for myself (the other authors might have different thoughts on this), the decision to talk explicitly in terms of the selection view in the mesa-optimiser post is based on an intuition that selectors, in general, have more coherently defined counterfactual behaviour. That is, given a very different input, a selector will still select an output that scores well on its mesa-objective, because that’s how selectors work. Whereas a controller, to the degree it optimises for an objective, seems more likely to just completely stop working on a different input. I have fairly low confidence in this argument, however: it seems to me that one can plausibly have pretty coherent counterfactual behaviour in a very broad distribution even without doing selection. And since it is ultimately the behaviour that does the damage, it would be good to have a working distinction that is based purely on that. We (the mesa-optimisation authors) haven’t been able to come up with one.

Another reason to be interested in selectors is that in RL, the learned algorithm is supposed to fill a controller role. So, restricting attention to selectors allows to talk at least somewhat meaningfully about non-optimiser agents, which is otherwise difficult, as any learned agent is in a controller-shaped context.

In any case, I hope that more work happens on this problem, either dissolving the need to talk about optimisation, or at least making all these distinctions more precise. The vagueness of everything is currently my biggest worry about the legitimacy of mesa-optimiser concerns.

Vlad Mikulik 8 Jun 2019 18:47 UTC
LW: 15 AF: 7
AF
in reply to: tom4everitt’s comment on: Risks from Learned Optimization: Introduction
Thanks for an insightful comment. I think your points are good to bring up, and though I will offer a rebuttal I’m not convinced that I am correct about this.

What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?

Distinguishing the “this system looks like it optimises for X” from “this system internally uses an evaluation of X to make decisions” is useful from the point of view of making the abstraction more robust. The former doesn’t make clear what makes the abstraction “work”, and so when to expect it to fail. The latter will at least tell you what kind of failures to expect in the abstraction: places where the evaluation of X doesn’t connect to the rest of the system like it’s supposed to. In particular, you’re right that if the learned environment model doesn’t generalise, the mesa-objective won’t be predictive of behaviour. But that’s actually a prediction of taking this view. On the other hand, it is unclear if taking the behavioural view would predict that the system will change its behaviour off-distribution (partially, because it’s unclear what exactly grounds the similarities in behaviour on-distribution).

I think it definitely is useful to also think about the behavioural objective in the way you describe, because the later concerns we raise basically do also translate to coherent behavioural objectives. And I welcome more work trying to untangle these concepts from one another, or trying to dissolve any of them as unnecessary. I am just wary of throwing away seemingly relevant assumptions about internal structure before we can show they’re unhelpful.

Re: DQN

You’re also right to point out DQN as an interesting edge case. But I am actually unsure that DQN agents should be considered non-optimisers, in the sense that they do perform rudimentary optimisation: they take an argmax of the Q function. The Q function is regressed to the episode returns. If the learning goes well, the Q function is literally representing the agent’s objective (indeed, it’s not really selected to maximise return; its selected to be accurate at predicting return). Contrast this with e.g. policy optimisation trained agents, which are not supposed to directly represent an objective, but are supposed to score well on it. (Someone good at running RL experiments maybe should look into comparing the coherence of revealed preferences of DQN agents with PPO agents. I’d read that paper.)

Vlad Mikulik 20 May 2021 13:57 UTC
LW: 13 AF: 7
AF
on: Knowledge Neurons in Pretrained Transformers
Thanks for the link. This has been on my reading list for a little bit and your recco tipped me over.
Mostly I agree with Paul’s concerns about this paper.
However, I did find the “Transformer Feed-Forward Layers Are Key-Value Memories” paper they reference more interesting—it’s more mechanistic, and their results are pretty encouraging. I would personally highlight that one more, as it’s IMO stronger evidence for the hypothesis, although not conclusive by any means.

Some experiments they show:
- Top-k activations of individual ‘keys’ do seem like coherent patterns in prefixes, and as we move up the layers, these patterns become less shallow and more semantics-driven. (Granted, it’s not clear how good the methodology there is, as to qualify as a pattern, it needs to occur in 3 out of top-25 prefixes. There are 3.6 patterns on average in each key. But this is curious enough to keep looking into.)
- The ‘value’ distributions corresponding to the keys are in fact somewhat predictive of the actual next word for those top-k prefixes, and exhibit a kind of ‘calibration’: while the distributions themselves aren’t actually calibrated, they are more correct when they assign a higher probability.
I also find it very intriguing that you can just decode the value distributions using the embedding matrix a la Logit Lens.

Vlad Mikulik 2 Jun 2019 20:03 UTC
LW: 12 AF: 7
AF
in reply to: abramdemski’s comment on: Selection vs Control

to what extent are mesa-controllers with simple behavioural objectives going to be simple?

I’m not sure what “simple behavioural objective” really means. But I’d expect that for tasks requiring very simple policies, controllers would do, whereas the more complicated the policy required to solve a task, the more one would need to do some kind of search. Is this what we observe? I’m not sure. AlphaStar and OpenAI Five seem to do well enough in relatively complex domains without any explicit search built into the architecture. Are they using their recurrence to search internally? Who knows. I doubt it, but it’s not implausible.

certain kinds of mesa-controllers can be simple: the mesa-controllers which are more like my rocket example (explicit world-model; explicit representation of objective within that world model; but, optimal policy does not use any search).

The rocket example is interesting. I guess the question for me there is, what sorts of tasks admit an optimal policy that can be represented in this way? Here it also seems to me like the more complex an environment, the more implausible it seems that a powerful policy can be successfully represented with straightforward functions. E.g., let’s say we want a rocket not just to get to the target, but to self-identify a good target in an area and pick a trajectory that evades countermeasures. I would be somewhat surprised if we can still represent the best policy as a set of non-searchy functions. So I have this intuition that for complex state spaces, it’s hard to find pure controllers that do the job well.

Vlad Mikulik 24 Jun 2019 17:58 UTC
LW: 10 AF: 6
AF
in reply to: Rohin Shah’s comment on: Risks from Learned Optimization: Introduction
You’re completely right; I don’t think we meant to have ‘more formally’ there.

Vlad Mikulik 8 Jun 2019 18:15 UTC
LW: 10 AF: 4
AF
in reply to: Richard_Ngo’s comment on: Risks from Learned Optimization: Introduction
I think humans are fairly weird because we were selected for an objective that is unlikely to be what we select for in our AIs.

That said, if we model AI success as driven by model size and compute (with maybe innovations in low-level architecture), then I think that the way humans represent objectives is probably fairly close to what we ought to expect.

If we model AI success as mainly innovative high-level architecture, then I think we will see more explicitly represented objectives.

My tentative sense is that for AI to be interpretable (and safer) we want it to be the latter kind, but given enough compute the former kind of AI will give better results, other things being equal.

Here, what I mean by low-level architecture is something like “we’ll use lots of LSTMs instead of lots of plain RNNs, but keep the model structure simple: plug in the inputs, pass it through some layers, and read out the action probabilities”, and high-level is something like “let’s organise the model using this enormous flowchart with all of these various pieces that each are designed to take a particular role; here’s the observation embedding, here’s the search in latent model space, here’s the …”

Vlad Mikulik 21 Feb 2021 19:01 UTC
LW: 7 AF: 4
AF
on: Formal Solution to the Inner Alignment Problem
Thanks for the post and writeup, and good work! I especially appreciate the short, informal explanation of what makes this work.
Given my current understanding of the proposal, I have one worry which makes me reluctant to share your optimism about this being a solution to inner alignment:
The scheme doesn’t protect us if somehow all top-n demonstrator models have correlated errors. This could happen if they are coordinating, or more prosaically if our way to approximate the posterior leads to such correlations. The picture I have in my head for the latter is that we train a big ensemble of neural nets and treat a random sample from that ensemble as a random sample from the posterior, although I don’t know if that’s how it’s actually done.
A lot of the work is done by the assumption that the true demonstrator is in the posterior, which means that at least one of the top-performing models will not have the same correlated errors. But I’m not sure how true this assumption will be in the neural-net approximation I describe above. I worry about inner alignment failures because I don’t really trust the neural net prior, and I can imagine training a bunch of neural nets to have correlated weirdnesses about them (in part because of the neural net prior they share, and in part because of things like Adversarial Examples Are Not Bugs, They Are Features). As such it wouldn’t be that surprising to me if it turned out that ensembles have certain correlated errors, and in particular don’t really represent anything like the demonstrator.
I do feel safer using this method than I would deferring to a single model, so this is still a good idea on balance. I just am not convinced that it solves the inner alignment problem. Instead, I’d say it ameliorates its severity, which may or may not be sufficient.

Vlad Mikulik 29 Sep 2019 0:13 UTC
LW: 7 AF: 4
AF
in reply to: Rohin Shah’s comment on: A simple environment for showing mesa misalignment
By that I didn’t mean to imply that we care about mesa-optimisation in particular. I think that this demo working “as intended” is a good demo of an inner alignment failure, which is exciting enough as it is. I just also want to flag that the inner alignment failure doesn’t automatically provide an example of a mesa-optimiser.

Vlad Mikulik 9 Jun 2019 18:53 UTC
7 points
in reply to: Ofer’s comment on: Risks from Learned Optimization: Introduction
To some extent, but keep in mind that in another sense, the behavioural objective of maximising paperclips is totally consistent with playing along with the base objective for a while and then defecting. So I’m not sure the behaviour/mesa- distinction alone does the work you want it to do even in that case.

Vlad Mikulik 3 Apr 2021 15:57 UTC
LW: 6 AF: 3
AF
on: My take on Michael Littman on “The HCI of HAI”
Thanks for a great post.
---
One nice point that this post makes (which I suppose was also prominent in the talk, but I can only guess, not being there myself) is that there’s a kind of progression we can draw (simplifying a little):
- Human specifies what to do (Classical software)
- Human specifies what to achieve (RL)
- Machine infers a specification of what to achieve (IRL)
- Machine collaborates with human to infer and achieve what the human wants (Assistance games)
Towards the end, this post describes an extrapolation of this trend,
- Machine and human collaboratively figure out what the human even wants to do in the first place.
‘Helping humans figure out what they want’ is a deep, complex and interesting problem, and I’d love it if more folks were thinking through what solutions to it ought to look like. This seems particularly urgent because human motivations can be affected even by algorithms that were not designed to solve this problem—for example, think of recommender systems shaping their users’ habits—and which therefore aren’t doing what we’d want them to do.
---
Another nice point is the connection between ML algorithm design and HCI. I’ve been meaning to write something looking at RL as ‘technique for communicating and achieving human intent’ (and, as a corollary, at AI safety as a kind of human-centred algorithm design), but it seems that I’ve been scooped by Michael :)
I note that not everyone sees RL from this frame. Some RL researchers view it as a way of understanding intelligence in the abstract, without connecting reward to human values.
---
One thing I’m a little less sure of is the conclusion you draw from your examples of changing intentions. While the examples convince me that the AI ought to have some sophistication about the human’s intentions—for example, being aware that human intentions can change—it’s not obvious that the right move is to ‘pop out’ further and assume there is something ‘bigger’ that the human’s intentions should be aligned with. Could you elaborate on your vision of what you have in mind there?

Vlad Mikulik 13 Sep 2020 15:16 UTC
LW: 6 AF: 4
AF
on: Mesa-Search vs Mesa-Control
I’ve thought of two possible reasons so far.
Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implement its own reward function, which gives faster feedback and therefore accelerates learning. This is closer to the story in Evan’s mesa-optimization post, just replacing search with RL.
More likely perhaps (based on my understanding), the outer RL algorithm has a learning rate that might be too slow, or is not sufficiently adaptive to the situation. The inner RL algorithm adjusts its learning rate to improve performance.
I would be more inclined towards a more general version of the latter view, in which gradient updates just aren’t a very effective way to track within-episode information.
The central example of learning-to-learn is a policy that effectively explores/exploits when presented with an unknown bandit from within the training distribution. An optimal policy essentially needs to keep track of sufficient statistics of the reward distributions for each action. If you’re training a memoryless policy for a fixed bandit problem using RL, then the only way of tracking the sufficient stats you have is through your weights, which are changed through the gradient updates. But the weight-space might not be arranged in a way that’s easily traversed by local jumps. On the other hand, a meta-trained recurrent agent can track sufficient stats in its activations, traversing the sufficient statistic space in whatever way it pleases—its updates need not be local.
This has an interesting connection to MAML, because a converged memoryless MAML solution on a distribution of bandit tasks will presumably arrange the part of its weight-space that encodes bandit sufficient statistics in a way that makes it easy to traverse via SGD. That would be a neat (and not difficult) experiment to run.
What links here?
- Vlad Mikulik's comment on Mesa-Search vs Mesa-Control by abramdemski (14 Sep 2020 16:49 UTC; 2 points)

Vlad Mikulik 18 Dec 2019 22:18 UTC
LW: 6 AF: 4
AF
in reply to: Rohin Shah’s comment on: Is the term mesa optimizer too narrow?
We’re probably in agreement, but I’m not sure what exactly you mean by “retreat to malign generalisation”.

For me, mesa-optimisation’s primary claim isn’t (call it Optimisers) that agents are well-described as optimisers, which I’m happy to drop. It is the claim (call it Mesa≠Base) that whatever the right way to describe them is, in general their intrinsic goals are distinct from the reward.

That’s a specific (if informal) claim about a possible source of malign generalisation. Namely, that when intrinsic goals differ arbitrarily from the reward, then systems that competently pursue them may lead to outcomes that are arbitrarily bad according to the reward. Humans don’t pose a counterexample to that, and it seems prima facie conceptually clarifying, so I wouldn’t throw it away. I’m not sure if you propose to do that, but strictly, that’s what “retreating to malign generalisation” could mean, as malign generalisation itself makes no reference to goals.

One might argue that until we have a good model of goal-directedness, Mesa≠Base reifies goals more than is warranted, so we should drop it. But I don’t think so – so long as one accepts goals as meaningful at all, the underlying model need only admit a distinction between the goal of a system and the criterion according to which a system was selected. I find it hard to imagine a model or view that wouldn’t allow this – this makes sense even in the intentional stance, whose metaphysics for goals is pretty minimal.

It’s a shame that Mesa≠Base is so entangled with Optimisers. When I think of mesa-optimisation, I tend to think more about the former than about the latter. I wish there was a term that felt like it pointed directly to Mesa≠Base without pointing to Optimisers. The Inner Alignment Problem might be it, though it feels like it’s not quite specific enough.

Vlad Mikulik 16 Dec 2019 13:00 UTC
LW: 6 AF: 4
AF
on: Is the term mesa optimizer too narrow?
I’m sympathetic to what I see as the message of this post: that talk of mesa-optimisation is too specific given that the practical worry is something like malign generalisation. I agree that it makes extra assumptions on top of that basic worry, which we might not want to make. I would like to see more focus on inner alignment than on mesa-optimisation as such. I’d also like to see a broader view of possible causes for malign generalisation, which doesn’t stick so closely to the analysis in our paper. (In hindsight our analysis could also have benefitted from taking a broader view, but that wasn’t very visible at the time.)

At the same time, speaking only in terms of malign generalisation (and dropping the extra theoretical assumptions of a more specific framework) is too limiting. I suspect that solutions to inner alignment will come from taking an opinionated view on the structure of agents, clarifying its assumptions and concepts, explaining why it actually applies to real-world agents, and offering concrete ways in which the extra structure of the view can be exploited for alignment. I’m not sure that mesa-optimisation is the right view for that, but I do think that the right view will have something to do with goal-directedness.

Vlad Mikulik 26 Sep 2019 9:57 UTC
LW: 6 AF: 5
AF
on: A simple environment for showing mesa misalignment
I have now seen a few suggestions for environments that demonstrate misaligned mesa-optimisation, and this is one of the best so far. It combines being simple and extensible with being compelling as a demonstration of pseudo-alignment if it works (fails?) as predicted. I think that we will want to explore more sophisticated environments with more possible proxies later, but as a first working demo this seems very promising. Perhaps one could start even without the maze, just a gridworld with keys and boxes.

I don’t know whether observing key-collection behaviour here would be sufficient evidence to count for mesa-optimisation, if the agent has too simple a policy. There is room for philosophical disagreement there. Even with that, a working demo of this environment would in my opinion be a good thing, as we would have a concrete agent to disagree about.

Vlad Mikulik 5 Sep 2019 23:14 UTC
6 points
in reply to: Gordon Seidoh Worley’s comment on: Utility ≠ Reward
Thanks for raising this. While I basically agree with evhub on this, I think it is unfortunate that the linguistic justification is messed up as it is. I’ll try to amend the post to show a bit more sensitivity to the Greek not really working like intended.
Though I also think that “the opposite of meta”-optimiser is basically the right concept, I feel quite dissatisfied with the current terminology, with respect to both the “mesa” and the “optimiser” parts. This is despite us having spent a substantial amount of time and effort on trying to get the terminology right! My takeaway is that it’s just hard to pick terms that are both non-confusing and evocative, especially when naming abstract concepts. (And I don’t think we did that badly, all things considered.)
If you have ideas on how to improve the terms, I would like to hear them!

Vlad Mikulik 9 Jun 2022 16:54 UTC
LW: 5 AF: 4
AF
on: Who models the models that model models? An exploration of GPT-3′s in-context model fitting ability
Nice work!
This section in Anthropic’s work on Induction heads seems highly relevant—I would be interested in seeing an extension of your analysis that looks at what induction heads do in these tasks.

If we believe the claims in that paper, then in-context learning of any kind seems to driven by a fairly simple mechanism not unlike kNN—induction attention heads. Since it’s pretty tractable to locate induction heads in an automated way, we could potentially take a look at the actual mechanism being used to implement these predictions and verify/falsify the hypotheses you make about how GPT makes these predictions. (Although you’d probably have to switch to an open-source model.)

Vlad Mikulik 2 Sep 2020 15:42 UTC
LW: 5 AF: 3
AF
in reply to: nostalgebraist’s comment on: interpreting GPT: the logit lens
You might want to look into NMF, which, unlike PCA/SVD, doesn’t aim to create an orthogonal projection. It works well for interpretability because its components cannot cancel each other out, which makes its features more intuitive to reason about. I think it is essentially what you want, although I don’t think it will allow you to find directly the ‘larger set of almost orthogonal vectors’ you’re looking for.

Vlad Mikulik(Vladimir Mikulik)

Utility ≠ Reward

2-D Robustness

Clar­ify­ing Con­se­quen­tial­ists in the Solomonoff Prior

Clarifying Consequentialists in the Solomonoff Prior