Stuart_Armstrong comments on Anthropomorphisation vs value learning: type 1 vs type 2 errors

Stuart_Armstrong 23 Sep 2020 8:08 UTC
LW: 2 AF: 1
AF

We don’t need a special module to get an everyday definition of doorknobs, and likewise I don’t think we don’t need a special module to get an everyday definition of human motivation.

I disagree. Doornobs exist in the world (even if the category is loosely defined, and has lots of edge cases), whereas goals/motivations are interpretations that we put upon agents. The main result of the Occam’s razor paper is that there the goals of an agent are not something that you can know without putting your own interpretation on it—even if you know every physical fact about the universe. And two very different interpretations can be equally valid, with no way of distinguishing between them.

(I like the anthropomorphising/dehumanising symmetry, but I’m focusing on the aspects of dehumanising that cause you to make errors of interpretation. For example, out-groups are perceived as being coherent, acting in concert without disagreements, and often being explicitly evil. This is an error, not just a reduction in social emotions)
- Steven Byrnes 23 Sep 2020 14:12 UTC
  LW: 2 AF: 1
  AF Parent
  It’s your first day working at the factory, and you’re assigned to shadow Alice as she monitors the machines on the line. She walks over to the Big Machine and says, “Looks like it’s flooping again,” whacks it, and then says “I think that fixed it”. This happens a few times a day perpetually. Over time, you learn what flooping is, kinda. When the Big Machine is flooping, it usually (but not always) makes a certain noise, it usually (but not always) has a loose belt, and it usually (but not always) has a gear that shifted out of place. Now you know what it means for the Big Machine to be flooping, although there are lots of edge cases where neither you nor Alice has a good answer for whether or not it’s truly flooping, vs sorta flooping, vs not flooping.
  By the same token, you could give some labeled examples of “wants to take a walk” to the aliens, and they can find what those examples have in common and develop a concept of “wants to take a walk”, albeit with edge cases.
  Then you can also give labeled examples of “wants to braid their hair”, “wants to be accepted”, etc., and after enough cycles of this, they’ll get the more general concept of “want”, again with edge cases.
  I don’t think I’m saying anything that goes against your Occam’s razor paper. As I understood it (and you can correct me!!), that paper was about fitting observations of humans to a mathematical model of “~~boundedly-rational~~ agent pursuing a utility function”, and proved that there’s no objectively best way to do it, where “objectively best” includes things like fidelity and simplicity. (My perspective on that is, “Well yeah, duh, humans are not ~~boundedly-rational~~ agents pursuing a utility function! The model doesn’t fit! There’s no objectively best way to hammer a square peg into a round hole! (ETA: the model doesn’t fit except insofar as the model is tautologically applicable to anything)”)
  I don’t see how the paper rules out the possibility of building an unlabeled predictive model of humans, and then getting a bunch of examples labeled “This is human motivation”, and building a fuzzy concept around those examples. The more labeled examples there are, the more tolerant you are of different inductive biases in the learning algorithm. In the limit of astronomically many labeled examples, you don’t need a learning algorithm at all, it’s just a lookup table.
  This procedure has nothing to do with fitting human behavior into a model of a ~~boundedly-rational~~ agent pursuing a utility function. It’s just an effort to consider all the various things humans do with their brains and bodies, and build a loose category in that space using supervised learning. Why not?
  What links here?
  - Steven Byrnes's comment on Anthropomorphisation vs value learning: type 1 vs type 2 errors by Stuart_Armstrong (25 Sep 2020 19:11 UTC; 2 points)
  - Stuart_Armstrong 23 Sep 2020 16:20 UTC
    LW: 2 AF: 1
    AF Parent
    
    that paper was about fitting observations of humans to a mathematical model of “boundedly-rational agent pursuing a utility function”
    
    It was “any sort of agent pursuing a reward function”.
    - Steven Byrnes 23 Sep 2020 20:10 UTC
      LW: 4 AF: 2
      AF Parent
      Sorry for the stupid question, but what’s the difference between “boundedly-rational agent pursuing a reward function” and “any sort of agent pursuing a reward function”?
      - Stuart_Armstrong 25 Sep 2020 13:44 UTC
        LW: 4 AF: 2
        AF Parent
        A boundedly-rational agent is assumed to be mostly rational, failing to be fully rational because of a failure to figure things out in enough detail.
        
        Humans are occasionally rational, often biased, often inconsistent, sometimes consciously act against their best interests, often follow heuristics without thinking, sometimes do think things through. This doesn’t seem to correspond to what is normally understood as “boundedly-rational”.
        Steven Byrnes 25 Sep 2020 19:11 UTC
        LW: 2 AF: 1
        AF Parent
        Gotcha, thanks. I have corrected my comment two above by striking out the words “boundedly-rational”, but I think the point of that comment still stands.