# Humans can be assigned any values whatsoever…

*(Re)Posted as part of the AI Alignment Forum sequence on* *Value Learning.*

Rohin’s note:In the last post, we saw that a good broad value learning approach would need to understand the systematic biases in human planning in order to achieve superhuman performance. Perhaps we can just use machine learning again and learn the biases and reward simultaneously? This post by Stuart Armstrong (original here) and the associated paper say: “Not without more assumptions.”

This post comes from a theoretical perspective that may be alien to ML researchers; in particular, it makes an argument that simplicity priors do not solve the problem pointed out here, where simplicity is based on Kolmogorov complexity (which is an instantiation of the Minimum Description Length principle). The analog in machine learning would be an argument that regularization would not work. The proof used is specific to Kolmogorov complexity and does not clearly generalize to arbitrary regularization techniques; however, I view the argument as being suggestive that regularization techniques would also be insufficient to address the problems raised here.

Humans have no values… nor do any agent. Unless you make strong assumptions about their rationality. And depending on those assumptions, you get humans to have any values.

### An agent with no clear preferences

There are three buttons in this world, , , and , and one agent .

and can be operated by , while can be operated by an outside observer. will initially press button ; if ever is pressed, the agent will switch to pressing . If is pressed again, the agent will switch back to pressing , and so on. After a large number of turns , will shut off. That’s the full algorithm for .

So the question is, what are the values/preferences/rewards of ? There are three natural reward functions that are plausible:

- , which is linear in the number of times is pressed.
- , which is linear in the number of times is pressed.
- , where is the indicator function for being pressed an even number of times, being the indicator function for being pressed an odd number of times.

For , we can interpret as an maximising agent which overrides. For , we can interpret as an maximising agent which releases from constraints. And is the “ is always fully rational” reward. Semantically, these make sense for the various ’s being a true and natural reward, with “coercive brain surgery” in the first case, “release H from annoying social obligations” in the second, and “switch which of and gives you pleasure” in the last case.

But note that there is no semantic implications here, all that we know is , with its full algorithm. If we wanted to deduce its true reward for the purpose of something like Inverse Reinforcement Learning (IRL), what would it be?

### Modelling human (ir)rationality and reward

Now let’s talk about the preferences of an actual human. We all know that humans are not always rational. But even if humans were fully rational, the fact remains that we are physical, and vulnerable to things like coercive brain surgery (and in practice, to a whole host of other more or less manipulative techniques). So there will be the equivalent of “button ” that overrides human preferences. Thus, “not immortal and unchangeable” is in practice enough for the agent to be considered “not fully rational”.

Now assume that we’ve thoroughly observed a given human h (including their internal brain wiring), so we know the human policy (which determines their actions in all circumstances). This is, in practice all that we can ever observe—once we know perfectly, there is nothing more that observing h can teach us.

Let be a possible human reward function, and **R** the set of such rewards. A human (ir)rationality planning algorithm (hereafter referred to as a planner), is a map from **R** to the space of policies (thus says how a human with reward will actually behave—for example, this could be bounded rationality, rationality with biases, or many other options). Say that the pair is compatible if . Thus a human with planner and reward would behave as does.

What possible compatible pairs are there? Here are some candidates:

- , where and are some “plausible” or “acceptable” planner and reward functions (what this means is a big question).
- , where is the “fully rational” planner, and is a reward that fits to give the required policy.
- , where , and , where is defined as ; here is the “fully anti-rational” planner.
- , where maps all rewards to , and is trivial and constant.
- , where and .

### Distinguishing among compatible pairs

How can we distinguish between compatible pairs? At first appearance, we can’t. That’s because, by their definition of compatible, all pairs produce the correct policy . And once we have , further observations of tell us nothing.

I initially thought that Kolmogorov or algorithmic complexity might help us here. But in fact:

**Theorem:** The pairs , , are either simpler than , or differ in Kolmogorov complexity from it by a constant that is independent of .

**Proof:** The cases of and are easy, as these differ from and by two minus signs. Given , a fixed-length algorithm computes . Then a fixed length algorithm defines (by mapping input to ). Furthermore, given and any history , a fixed length algorithm computes the action the agent will take; then a fixed length algorithm defines and for .

So the Kolmogorov complexity can shift between and (all in for , all in for ), but it seems that the complexity of the pair doesn’t go up during these shifts.

This is puzzling. It seems that, in principle, one cannot assume anything about ’s reward at all! , , and is compatible with any possible reward . If we give up the assumption of human rationality—which we must—it seems we can’t say anything about the human reward function. So it seems IRL must fail.

How I understand the main point:

The goal is to get superhuman performance aligned with human values Rh. How might we achieve this? By learning the human values.Then we can use a perfect planner p⋆ to find the best actions to align the world with the human values. This will have superhuman performance, because humans’ planning algorithms are not perfect. They don’t always find the best actions to align the world with their values.

How do we learn the human values? By observing human behaviour, ie. their actions in each circumstance. This is modelled as the human policy π(h).

Behaviour is the known outside view of a human, and values+planner is the unknown inside view. We need to learn both the values and the planner such that p(R)=π(h).

Unfortunately, this equation is underdetermined. We only know π(h). p and R can vary independently.

Are there differences among the (p,R) candidates? One thing we could look at is their Kolmogorov complexity. Maybe the true candidate has the lowest complexity. But this is not the case, according to the article.

Yep, basically that. ^_^

Out of curiosity, is there an intuitive explanation as to why these are different? Is it mainly because ambitious value learning inevitably has to deal with lots of (systematic) mistakes in the data, whereas normally you’d make sure that the training data doesn’t contain (many) obvious mistakes? Or are there examples in ML where you can retroactively correct mistakes imported from a flawed training set?

(I’m not sure “training set” is the right word for the IRL context. Applied to ambitious value learning, what I mean would be the “human policy”.)

Update: Ah, it seems like the next post is all about this! :) My point about errors seems like it might be vaguely related, but the explanation in the next post feels more satisfying. It’s a different kind of problem because you’re not actually interested in predicting observable phenomena anymore, but instead are trying to infer the “latent variable” – the underlying principle(?) behind the inputs. The next post in the sequence also gives me a better sense of why people say that ML is typically “shallow” or “surface-level reasoning”.

Interestingly, humans are able to predict each other values in most cases—and this helps our society to exist. Relationship, market, just walking out—all it is based on our ability to read the intentions of other people successfully.

However, many cases of bad events happen when we don’t understand each other intentions: this enable scammers and interpersonal conflicts.

Only across small inferential gaps. That works for most cases only because people interact inside bubbles, groups based on similarity. Interactions between random people would be mostly puzzling.

I think that right now we don’t know how to bridge the gap between the thing that presses the buttons on the computer, and a fuzzy specification of a human as a macroscopic physical object. And so if you are defining “human” as the thing that presses the buttons, and you can take actions that fully control which buttons get pressed, it makes sense that there’s not necessarily a definition of what this “human” wants.

If we actually start bridging the gap, though, I think it makes lots of sense for the AI to start building up a model of the human-as-physical-object which also takes into account button presses, and in that case I’m not too pessimistic about regularization.

I think of the example as illustrative but the real power of the argument comes from the planner+reward formalism and the associated impossibility theorem. The fact that Kolmogorov complexity doesn’t help is worrying. It’s possible that other regularization techniques work where Kolmogorov complexity doesn’t, but that begs the question of what is so special about these other regularization techniques.

Suppose we start our AI off with the intentional stance, where we have a high-level description of these human objects as agents with desires and plans, beliefs and biases and abilities and limitations.

What I’m thinking when I say we need to “bridge the gap” is that I think if we knew what we were doing, we could stipulate that some set of human button-presses is more aligned with some complicated object “hDesires” than not, and the robot should care about hDesires, where hDesires is the part of the intentional stance description of the physical human that plays the functional role of desires.

For the reasonable option, two other statments hold true. At least one of which fails for all totally unreasonable rules of similar Komalgorov complexity that I can think of.

1) π(h) is good at optimizing R, (much better than random).

2) p(R) is Quickly computable. As opposed to the fully rational planner, with every bias turned into a goal, which would be slow to compute (I think).

Shifting all biases to goals should also increase the complexity of the goal function.

Even just insisting that R is simple (low Komelgorov complexity), and p is effective (displays many bits of optimization pressure towards R) should produce results more sane than these. (Maybe subtly flawed?)

EDIT: This has a tendancy to locate simple instrumental subgoals. Eg maximise entorpy.

I agree you can add in more assumptions in order to get better results. The hard part is a) how you know that your assumptions are always correct, and b) how you know when you have enough assumptions that you will actually find the correct p and R.

(You might be interested in Inferring Reward Functions from Demonstrators with Unknown Biases, which takes a similar perspective as you quite explicitly, and Resolving human values, completely and adequately, which takes this perspective implicitly.)