habryka comments on Counterarguments to the basic AI x-risk case

habryka 20 Oct 2022 5:48 UTC
8 points
−3
I do think these are better quotes. It’s possible that there was some update here between 2008 and 2013 (roughly when I started seeing the more live discussion happening), since I do really remember the “the problem is not getting the AI to understand, but to care” as a common refrain even back then (e.g. see the Robby post I linked).

I claim that this paragraph didn’t age well in light of the deep learning revolution: “running a neural network [...] over a set of winning and losing sequences of chess moves” basically is how AlphaZero learns from self-play!

I agree that this paragraph aged less well than other paragraphs, though I do think this paragraph is still correct (Edit: Eh, it might be wrong, depends a bit on how much neural networks in the 50s are the same as today). It did sure turn out to be correct by a narrower margin than Eliezer probably thought at the time, but my sense is it’s still not the case that we can train a straightforward neural net on winning and losing chess moves and have it generate winning moves. For AlphaGo, the Monte Carlo Tree Search was a major component of its architecture, and then any of the followup-systems was trained by pure self-play.

But in any case, I think your basic point of “Eliezer did not predict the Deep Learning revolution as it happened” here is correct, though I don’t think this specific paragraph has a ton of relevance to the discussion at hand.

The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires. But the real problem of Friendly AI is one of communication—transmitting category boundaries, like “good”, that can’t be fully delineated in any training data you can give the AI during its childhood.

I do think this paragraph seems like a decent quote, though I think at this point it makes sense to break it out into different pieces.

I think Eliezer is saying that what matters is whether we can point the AI to what we care about “during its childhood”, i.e. during relatively early training, before it has already developed a bunch of proxy training objectives.

I think the key question about the future that I think Eliezer was opining on, is then whether by the time we expect AIs to actually be able to have a close-to-complete understanding of what we mean by “goodness”, we still have any ability to shape their goals.

My model is that indeed, Eliezer was surprised, as I think most people were, that AIs of 2022 are as good at picking up complicated concept boundaries and learning fuzzy human concepts as they are, while still being quite incompetent at many other tasks. However, I think the statement of “AIs of 2022 basically understand goodness, or at least will soon enough understand goodness while we are still capable of meaningfully changing their goals” strikes me as very highly dubious, and I think the basic arguments for thinking that this capability will come after the AI has reached a capability level where we have little ability to shape its goals still seem correct to me, and like, one of the primary reasons for doom.

The reason why it still seems substantially out of AIs reach, is because our values do indeed seem quite fragile and to change substantially on reflection, such that it’s currently out of the reach of even a very smart human to fully understand what we mean by “goodness”.

Eliezer talks about this in the comment section you linked (actually, a great comment section between Eliezer and Shane Legg that I found quite insightful to read and am glad to have stumbled upon):

A moderately strong and unFriendly intelligence, operating in the current world without yet having replaced that world with paperclips, would certainly find it natural to form the category of “Things that (some) humans approve of”, and contrast it to “Things that will trigger a nuclear attack against me before I’m done creating my own nanotechnology.” But this category is not what we call “morality”. It naturally—from the AI’s perspective—includes things like bribes and deception, not just the particular class of human-approval-eliciting phenomena that we call “moral”.

Is it worth factoring out phenomena that elicit human feelings of righteousness, and working out how (various) humans reason about them? Yes, because this is an important subset of ways to persuade the humans to leave you alone until it’s too late; but again, that natural category is going to include persuasive techniques like references to religious authority and nationalism.

But what if the AI encounters some more humanistic, atheistic types? Then the AI will predict which of several available actions is most likely to make an atheistic humanist human show sympathy for the AI. This naturally leads the AI to model and predict the human’s internal moral reasoning—but that model isn’t going to distinguish anything along the lines of moral reasoning the human would approve of under long-term reflection, or moral reasoning the human would approve knowing the true facts. That’s just not a natural category to the AI, because the human isn’t going to get a chance for long-term reflection, and the human doesn’t know the true facts.

The natural, predictive, manipulative question, is not “What would this human want knowing the true facts?”, but “What will various behaviors make this human believe, and what will the human do on the basis of these various (false) beliefs?”

In short, all models that an unFriendly AI forms of human moral reasoning, while we can expect them to be highly empirically accurate and well-calibrated to the extent that the AI is highly intelligent, would be formed for the purpose of predicting human reactions to different behaviors and events, so that these behaviors and events can be chosen manipulatively.

But what we regard as morality is an idealized form of such reasoning—the idealized abstracted dynamic built out of such intuitions. The unFriendly AI has no reason to think about anything we would call “moral progress” unless it is naturally occurring on a timescale short enough to matter before the AI wipes out the human species. It has no reason to ask the question “What would humanity want in a thousand years?” any more than you have reason to add up the ASCII letters in a sentence.

Now it might be only a short step from a strictly predictive model of human reasoning, to the idealized abstracted dynamic of morality. If you think about the point of CEV, it’s that you can get an AI to learn most of the information it needs to model morality, by looking at humans—and that the step from these empirical models, to idealization, is relatively short and traversable by the programmers directly or with the aid of manageable amounts of inductive learning. Though CEV’s current description is not precise, and maybe any realistic description of idealization would be more complicated.

But regardless, if the idealized computation we would think of as describing “what is right” is even a short distance of idealization away from strictly predictive and manipulative models of what humans can be made to think is right, then “actually right” is still something that an unFriendly AI would literally never think about, since humans have no direct access to “actually right” (the idealized result of their own thought processes) and hence it plays no role in their behavior and hence is not needed to model or manipulate them.

Which is to say, an unFriendly AI would never once think about morality—only a certain psychological problem in manipulating humans, where the only thing that matters is anything you can make them believe or do. There is no natural motive to think about anything else, and no natural empirical category corresponding to it.

I think this argument is basically correct, and indeed, while current systems definitely are good at having human abstractions, I don’t think they really are anywhere close to having good models of the results of our coherent extrapolated volition, which is what Eliezer is talking about here. (To be clear, I do also separately think that LLMs are thinking about concepts for reasons other than deceiving or modeling humans, though like, I don’t think this changes the argument very much. I don’t think LLMs care very much about thinking carefully about morality, because it’s not very useful for predicting random internet text.)

I think separately, there is a different, indirect normativity approach that starts with “look, yes, we are definitely not going to get the AI to understand what our ultimate values are before the end, but maybe we can get it to understand a concept like ‘being conservative’ or ‘being helpful’ in enough detail that we can use it to supervise smarter AI systems, and then bootstrap ourselves into an aligned superintelligence”.

And I think indeed that plan looks better now than it likely looked to Eliezer in 2008, but I do want to distinguish it from the things that Eliezer was arguing against at the time, which were not about learning approaches to indirect normativity, but were arguments about how the AI would just learn all of human values by being pointed at a bunch of examples of good things and bad things, which still strikes me as extremely unlikely.
- Richard Korzekwa 20 Oct 2022 14:53 UTC
  9 points
  4
  Parent
  it’s still not the case that we can train a straightforward neural net on winning and losing chess moves and have it generate winning moves. For AlphaGo, the Monte Carlo Tree Search was a major component of its architecture, and then any of the followup-systems was trained by pure self-play.
  AlphaGo without the MCTS was still pretty strong:
  We also assessed variants of AlphaGo that evaluated positions using just the value network (λ = 0) or just rollouts (λ = 1) (see Fig. 4b). Even without rollouts AlphaGo exceeded the performance of all other Go programs, demonstrating that value networks provide a viable alternative to Monte Carlo evaluation in Go.
  Even with just the SL-trained value network, it could play at a solid amateur level:
  We evaluated the performance of the RL policy network in game play, sampling each move...from its output probability distribution over actions. When played head-to-head, the RL policy network won more than 80% of games against the SL policy network. We also tested against the strongest open-source Go program, Pachi14, a sophisticated Monte Carlo search program, ranked at 2 amateur dan on KGS, that executes 100,000 simulations per move. Using no search at all, the RL policy network won 85% of games against Pachi.
  I may be misunderstanding this, but it sounds like the network that did nothing but get good at guessing the next move in professional games was able to play at roughly the same level as Pachi, which, according to DeepMind, had a rank of 2d.
  - habryka 20 Oct 2022 16:01 UTC
    2 points
    0
    Parent
    Yeah, I mean, to be clear, I do definitely think you can train a neural network to somehow play chess via nothing but classification. I am not sure whether you could do it with a feed forward neural network, and it’s a bit unclear to me whether the neural networks from the 50s are the same thing as the neural networks from 2000s, but it does sure seem like you can just throw a magic category absorber at chess and then have it play OK chess.
    
    My guess is modern networks are not meaningfully more complicated, and the difference to back then was indeed just scale and a few tweaks, but I am not super confident and haven’t looked much into the history here.