dxu comments on Alignment By Default

dxu 16 Aug 2020 16:18 UTC
6 points
0
I like this post a lot, and I think it points out a key crux between what I would term the “Yudkowsky” side (which seems to mostly include MIRI, though I’m not too sure about individual researchers’ views) and “everybody else”.

In particular, the disagreement seems to crystallize over the question of whether “human values” really are a natural abstraction. I suspect that if Eliezer thought that they were, he would be substantially less worried about AI alignment than he currently is (though naturally all of this is my read on his views).

You do provide some reasons to think that human values might be a natural abstraction, both in the post itself and in the comments, but I don’t see these reasons as particularly compelling ones. The one I view as the most compelling is the argument that humans seems to be fairly good at identifying and using natural abstractions, and therefore any abstract concept that we seem to be capable of grasping fairly quickly has a strong chance of being a natural one.

However, I think there’s a key difference between abstractions that are developed for the purposes of prediction, and abstractions developed for other purposes (by which I mostly mean “RL”). To the extent that a predictor doesn’t have sufficient computational power to form a low-level model of whatever it’s trying to predict, I definitely think that the abstractions it develops in the process of trying to improve its prediction will to a large extent be natural ones. (You lay out the reasons for this clearly enough in the post itself, so I won’t repeat them here.)

It seems to me, though, that if we’re talking about a learning agent that’s actually trying to take actions to accomplish things in some environment, there’s a substantial amount of learning going on that has nothing to do with learning to predict things with greater accuracy! The abstractions learned in order to select actions from a given action-space in an attempt to maximize a given reward function—these, I see little reason to expect will be natural. In fact, if the computational power afforded to the agent is good but not excellent, I expect mostly the opposite: a kludge of heuristics and behaviors meant to address different subcases of different situations, with not a whole lot of rhyme or reason to be found.

As agents go, humans are definitely of the latter type. And, therefore, I think the fact that we intuitively grasp the concept of “human values” isn’t necessarily an argument that “human values” are likely to be natural, in the way that it would be for e.g. trees. The latter would have been developed as a predictive abstraction, whereas the former seems to mainly consist of what I’ll term a reward abstraction. And it’s quite plausible to me that reward abstractions are only legible by default to agents which implement that particular reward abstraction, and not otherwise. If that’s true, then the fact that humans know what “human values” are is merely a consequence of the fact that we happen to be humans, and therefore have a huge amount of mind-structure in common.

To the extent that this is comparable to the branching pattern of a tree (which is a comparison you make in the post), I would argue that it increases rather than lessens the reason to worry: much like a tree’s branch structure is chaotic, messy, and overall high-entropy, I expect human values to look similar, and therefore not really encompass any kind of natural category.
- johnswentworth 16 Aug 2020 21:49 UTC
  2 points
  0
  Parent
  To the extent that this is comparable to the branching pattern of a tree (which is a comparison you make in the post), I would argue that it increases rather than lessens the reason to worry: much like a tree’s branch structure is chaotic, messy, and overall high-entropy, I expect human values to look similar, and therefore not really encompass any kind of natural category.
  Bit of a side-note, but the high entropy of tree branching comes from trees using the biological equivalent of random number generators when “deciding” when/whether to form a branch. The distribution of branch length-ratios/counts/angles is actually fairly simple and stable, and is one of the main characteristics which makes particular tree species visually distinctive. See L-systems for the basics, or speedtree for the industrial-grade version (and some really beautiful images).
  It’s that distribution which is the natural abstraction—i.e. the distribution summarizes information about branching which is relevant to far-away trees of the same species.
- johnswentworth 16 Aug 2020 21:15 UTC
  2 points
  0
  Parent
  I think there’s a subtle confusion here between two different claims:
  - Human values evolved as a natural abstraction of some territory.
  - Humans’ notion of “human values” is a natural abstraction of humans’ actual values.
  It sounds like your comment is responding to the former, while I’m claiming the latter.
  A key distinction here is between humans’ actual values, and humans’ model/notion of our own values. Humans’ actual values are the pile of heuristics inherited from evolution. But humans also have a model of their values, and that model is not the same as the underlying values. The phrase “human values” necessarily points to the model, because that’s how words work—they point to models. My claim is that the model is a natural abstraction of the actual values, not that the actual values are a natural abstraction of anything.
  This is closely related to this section from the OP:
  Human values are basically a bunch of randomly-generated heuristics which proved useful for genetic fitness; why would they be a “natural” abstraction? But remember, the same can be said of trees. Trees are a complicated pile of organic spaghetti code, but “tree” is still a natural abstraction, because the concept summarizes all the information from that organic spaghetti pile which is relevant to things far away. In particular, it summarizes anything about one tree which is relevant to far-away trees.
  Roughly speaking, the concept of “human values” summarizes anything about the values of one human which is relevant to the values of far-away humans.
  Does that make sense?