[Question] Can coherent extrapolated volition be estimated with Inverse Reinforcement Learning?

Jade Bishop15 Apr 2019 3:23 UTC

12 points

Inverse Reinforcement Learning Reinforcement Learning Coherent Extrapolated Volition

Given the following conditions, is it possible to approximate the coherent extrapolated value of humanity to a “good enough” level?:

Some form of reward/cost function estimation is used, such as inverse reinforcement learning or inverse optimal control. The details of the specific IRL/IOC algorithm in question are not important, just the fact that the reward/cost function is estimated. For the unfamiliar, IRL is essentially the opposite of traditional reinforcement learning in that, given a set of observations and actions, it tries to determine the reward, utility, or value function (all interchangeable) of the agent(s) that generated the set.
An agent is able to observe another (presumably human) agent’s behaviour and update their estimate of the reward function based on that, without having direct sensory input from them. This is essentially what mirror neurons do. Technologically-speaking, this is probably the most difficult part to achieve, but not too important for the purpose of this question.

Here is my reasoning to believe that this approximation will in fact work:

First, we assume that all these constraints are true.

The estimated reward function is continuously updated with data from every individual it meets, using some form of weighted experience replay system so as to not overwrite previously-learned information.

Given that IRL/IOC can already estimate the reward function of one agent, or even a specific class of agents such as streaked shearwater birds¹, with a sufficiently complex system this algorithm should be able to extend to complex (read: human) agents.

As the number of observations n approaches infinity (or some sufficiently large number), the reward function should approach a reward function that is a “good enough” approximation of the coherent extrapolated value of humanity.

Note that there does not need to exist some actual reward function that is natively used by real humans, evaluated by their brain. As long as human behaviour can be sufficiently approximated by a neural network, this will hold; given the wide abilities of neural networks, from classifiers to learning agents to machine translation, I don’t see this as too much of a stretch.

However, I do anticipate certain objections to this explanation. Let me run through a few of them.

Humans are too complex to have their behaviours estimated by inverse reinforcement learning.
- This seems to me like an argument from human exceptionalism, or anthropocentrism. I don’t see any reason for this to be true. Various animals already demonstrate many behaviours considered by anthropocentrists to be unique to humans, such as tool use in various primates and birds, as well as the ability of crows to recognise faces, parrots to mimic speech and perform math. From these examples, I don’t see any compelling arguments for an anthropocentric objection to this approach.
Getting the input and output necessary to perform online (i.e. real-time) inverse reinforcement learning is infeasible.
- This is one of the most compelling counterarguments to this approach. However, I think that even if recreating “mirror neurons” (i.e. sensory neurons that fire both when the agent does something or observes someone do something) is too difficult, another approach could be used. A sufficiently-realistic VRMMORPG-like environment (Virtual Reality Massively Multiplayer Online Role-Playing Game) could be used to collect input sensory data and behaviours from players. If players are properly incentivised to act as they would in a real environment, then with a sufficient amount of pre-training, a “close-enough” approximation of the CEV should be possible.
“Close-enough” doesn’t even mean anything!
- This is also an issue, yes. There are a number of ways to define “close-enough”, but I choose to leave the choice of which up to you. Some examples are: “functionally indistinguishable”, “functionally indistinguishable within a society”, “functionally indistinguishable within an intra-societal community”, or “functionally indistinguishable within a small group”. These aren’t exhaustive, and I can see any number of ways to define “close-enough”.
What do you mean by approximating the CEV? Isn’t it by definition incomprehensible to ourselves when extrapolated so far out? Doesn’t that mean it would be impossible to approximate it from individual observations?
- This is where it gets dicey. Since we don’t know the CEV, how do we know if we have successfully approximated it? Is it even able to be approximated? One of the issues I thought of while writing this is that individual human behaviour may not converge to the CEV. My expectation is that as the number of humans behaviours have been sampled from, as well as the number of samples taken from each individual human, approaches the volition an individual would have if they had the same resources as the entirety of the observed population. My assumption is that this is equivalent to the CEV, which may not be true.

However, I’d be interested to see if there are any rebuttals to my responses to these counterarguments, as well as any counterarguments that I didn’t bring up, of which there are definitely many. Also, if I made any mistakes or if anything in this post isn’t clear, feel free to ask and I’ll clarify it.

Footnotes

Hirakawa, Tsubasa, Takayoshi Yamashita, Toru Tamaki, Hironobu Fujiyoshi, Yuta Umezu, Ichiro Takeuchi, Sakiko Matsumoto, and Ken Yoda. “Can AI Predict Animal Movements? Filling Gaps in Animal Trajectories Using Inverse Reinforcement Learning.” Ecosphere 9, no. 10 (2018): N/a.

Jade Bishop15 Apr 2019 3:23 UTC

12 points

5 comments3 min readLW link

Inverse Reinforcement Learning Reinforcement Learning Coherent Extrapolated Volition

habryka 15 Apr 2019 18:25 UTC
15 points
Did you read Rohin Shah’s value learning sequence? It covers this whole area in a good amount of detail, and I think answers your question pretty straightforwardly:
Existing error models for inverse reinforcement learning tend to be very simple, ranging from Gaussian noise in observations of the expert’s behavior or sensor readings, to the assumption that the expert’s choices are randomized with a bias towards better actions.
In fact humans are not rational agents with some noise on top. Our decisions are the product of a complicated mess of interacting process, optimized by evolution for the reproduction of our children’s children. It’s not clear there is any good answer to what a “perfect” human would do. If you were to find any principled answer to “what is the human brain optimizing?” the single most likely bet is probably something like “reproductive success.” But this isn’t the answer we are looking for.
I don’t think that writing down a model of human imperfections, which describes how humans depart from the rational pursuit of fixed goals, is likely to be any easier than writing down a complete model of human behavior.
We can’t use normal AI techniques to learn this kind of model, either — what is it that makes a model good or bad? The standard view — “more accurate models are better” — is fine as long as your goal is just to emulate human performance. But this view doesn’t provide guidance about how to separate the “good” part of human decisions from the “bad” part.
Here is a link to the full sequence: https://www.lesswrong.com/s/4dHMdK5TLN6xcqtyc
- Rohin Shah 16 Apr 2019 17:33 UTC
  3 points
  Parent
  Fwiw the quoted section was written by Paul Christiano, and I have used that blog post in my sequence (with permission).
  Also, for this particular question you can read just Chapter 1 of the sequence.
  - habryka 16 Apr 2019 17:38 UTC
    3 points
    Parent
    Ah, yes. Sorry. Should have made the authorship that quote clearer.
- Jade Bishop 15 Apr 2019 20:03 UTC
  1 point
  Parent
  Thank you for your feedback! I haven’t read this yet, but it comes pretty close to a discussion I had with a friend over this post.
  
  Essentially, her argument started with a simple counterargument: She bought peanut M&Ms when she didn’t want to, and didn’t realise she was doing it until afterwards. In a similar situation where she was hungry and in the same place, she desired peanut M&Ms to satisfy her hunger, but this time she didn’t want them. She knew she didn’t want peanut M&Ms, and didn’t consciously decide to get them against that want; in this sense, I think a parallel can be drawn with akrasia, where rationality alone isn’t enough.
  
  Her point was this: There has to be a line drawn between “intentional conscious action” and “the result of a complex system of interacting parts that puppets the meat sack that holds our brain, sometimes in ways we don’t intend.” On a base level, this could result in, say, an AI that acts like a normal human but sometimes buys peanut M&Ms against their volition. On an agent-based level where an AI is no more or less capable than a human, this isn’t much of an issue, and such things could make individual AI agents more convincing.
  
  But if you want to make a superintelligent AI to run your ideal utopia, you don’t want it to decide to feed everyone peanut M&Ms against their will on a whim.
  
  The biggest issue is that we can’t determine the difference between “intentional action” and “unintentional response”. If we could, then it would then (according to her) be trivial to find out what the CEV of humanity is, no estimation needed.
  
  My largest assumption was that the lowest common denominator of human behaviour is “principled reasoning in pursuit of fixed, though unstated, goals”. More realistically, as another friend (and the post you linked) pointed out, the lowest common denominator of human behaviour is going to be “reproduce”, which has very unfortunate implications for the Friendliness of this hypothetical agent.
  
  A number of things could be done to ameliorate this, such as not including any means to reproduce or any data supporting reproduction in the trajectories, but they all seem inadequate or ad-hoc. I don’t want to staple together a bunch of things I barely understand and declare it the Solution To AI (not that I was attempting to do that, anyway), especially when the issue isn’t necessarily with the technology and theory. As the peanut-M&M-purchasing friend put, the technology is sufficient but this post overestimates humans. This wasn’t actually what I expected to have an issue on, and it shifts it from “improve technology and theories” to… what, “improve humans”? I’m at a loss as to where to go from here; inverse reinforcement learning has a demonstrable use-case and benefits, but the data is… not good. Garbage in gives garbage out. Is it really possible to improve human behaviour (or our analysis/collection of human behaviour) to achieve better results?
  - Rohin Shah 16 Apr 2019 17:35 UTC
    3 points
    Parent
    There’s a lot of speculation about related-ish topics in Chapter 3 of the sequence linked above.

No comments.