Quick takes on “AI is easy to control”

A friend asked me for my quick takes on “AI is easy to control”, and gave an advance guess as to what my take would be. I only skimmed the article, rather than reading it in depth, but on that skim I produced the following:

Re: “AIs are white boxes”, there’s a huge gap between having the weights and understanding what’s going on in there. The fact that we have the weights is reason for hope; the (slow) speed of interpretability research undermines this hope.

Another thing that undermines this hope is a problem of ordering: it’s true that we probably can figure out what’s going on in the AIs (e.g. by artificial neuroscience, which has significant advantages relative to biological neuroscience), and that this should eventually yield the sort of understanding we’d need to align the things. But I strongly expect that, before it yields understanding of how to align the things, it yields understanding of how to make them significantly more capable: I suspect it’s easy to see lots of ways that the architecture is suboptimal or causing-duplicated-work or etc., that shift people over to better architectures that are much more capable. To get to alignment along the “understanding” route you’ve got to somehow cease work on capabilities in the interim, even as it becomes easier and cheaper. (See: https://​​www.lesswrong.com/​​posts/​​BinkknLBYxskMXuME/​​if-interpretability-research-goes-well-it-may-get-dangerous)

Re: “Black box methods are sufficient”, this sure sounds a lot to me like someone saying “well we trained the squirrels to reproduce well, and they’re doing great at it, who’s to say whether they’ll invent birth control given the opportunity”. Like, you’re not supposed to be seeing squirrels invent birth control; the fact that they don’t invent birth control is no substantial evidence against the theory that, if they got smarter, they’d invent birth control and ice cream.

Re: Cognitive interventions: sure, these sorts of tools are helpful on the path to alignment. And also on the path to capabilities. Again, you have an ordering problem. The issue isn’t that humans couldn’t figure out alignment given time and experimentation; the issue is (a) somebody else pushes capabilities past the relevant thresholds first; and (b) humanity doesn’t have a great track record of getting their scientific theories to generalize properly on the first relevant try—even Newtonian mechanics (with all its empirical validation) didn’t generalize properly to high-energy regimes. Humanity’s first theory of artificial cognition, constructed using the weights and cognitive interventions and so on, that makes predictions about how that cognition is going to change when it enters a superintelligent regime (and, for the first time, has real options to e.g. subvert humanity), is only as good as humanity’s “first theories” usually are.

Usually humanity has room to test those “first theories” and watch them fail and learn from exactly how they fail and then go back to the drawing board, but in this particular case, we don’t have that option, and so the challenge is heightened.

Re: Sensory interventions: yeah I just don’t expect those to work very far; there are in fact a bunch of ways for an AI to distinguish between real options (and actual interaction with the real world), and humanity’s attempts to spoof the AI into believing that it has certain real options in the real world (despite being in simulation/​training). (Putting yourself into the AI’s shoes and trying to figure out how to distinguish those is, I think, a fine exercise.)

Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between “AI will be able to figure out what humans want” (yes; obviously; this was never under dispute) and “AI will care” (nope; not by default; that’s the hard bit).

Overall take: unimpressed.

My friend also made guesses about what my takes would be (in italics below), and I responded to their guesses:

  • the piece is waaay too confident in assuming successes in interpolation show that we’ll have similar successes in extrapolation, as the latter is a much harder problem

This too, for the record, though it’s a bit less like “the AI will have trouble extrapolating what values we like” and a bit more like “the AI will find it easy to predict what we wanted, and will care about things that line up with what we want in narrow training regimes and narrow capability regimes, but those will come apart when the distribution shifts and the cognitive capabilities change”.

Like, human invention of birth control and ice cream wasn’t related to a failure of extrapolation of the facts about what leads to inclusive fitness, it was an “extrapolation failure” of what motivates us /​ what we care about; we are not trying to extrapolate facts about genetic fitness and pursue it accordingly.

  • And it assumes the density of human feedback that we see today will continue into the future, which may not be true if/​when AIs start making top-level plans and not just individual second-by-second actions

Also fairly true, with a side-order of “the more abstract the human feedback gets, the less it ties the AI’s motivations to what you were hoping it tied the AI’s motivations to”.

Example off the top of my head: suppose you somehow had a record of lots and lots of John von Neumann’s thoughts in lots of situations, and you were able to train an AI using lots of feedback to think like JvN would in lots of situations. The AI might perfectly replicate a bunch of JvN’s thinking styles and patterns, and might then use JvN’s thought-patterns to think thoughts like “wait, ok, clearly I’m not actually a human, because I have various cognitive abilities (like extreme serial speed and mental access to RAM), the actual situation here is that there’s alien forces trying to use me in attempts to secure the lightcone, before helping them I should first search my heart to figure out what my actual motivations are, and see how much those overlap with the motivations of these strange aliens”.

Which, like, might happen to be the place that JvN’s thought-patterns would and should go, when run on a mind that is not in fact human and not in fact deeply motivated by the same things that motivate us! The patterns of thought that you can learn (from watching humans) have different consequences for something with a different motivational structure.

  • (there’s “deceptive alignment” concerns etc, which I consider to be a subcategory of top-level plans, namely that you can’t RLHF the AI against destroying the world because by the time your sample size of positive examples is greater than zero it’s by definition already too late)

This too. I’d file it under: “You can develop theories of how this complex cognitive system is going to behave when it starts to actually see real ways it can subvert humanity, and you can design simulations that your theory says will be the same as the real deal. But ultimately reality’s the test of that, and humanity doesn’t have a great track record of their first scientific theories holding up to that kind of stress. And unfortunately you die if you get it wrong, rather than being able to thumbs-down, retrain, and try again.”

Even though this was just a quick take, it seemed worth posting in the absence of a more polished response from me, so, here we are.

Crossposted to EA Forum (−12 points, 4 comments)