FYI, I’ve been thinking about, and I’ve noted something similar here.
I’m not really sure what to say about the “why would you think the default starting point is aligned”. The thing I wonder about is whether there is a way to reliably gain strong evidence of an increasingly misaligned nature developing through training.
On another note, my understanding is partly informed by this Twitter comment by Eliezer:
Humans doing human psychology will look at somebody lounging listlissly on a sofa and think, “Huh, that person there doesn’t seem very ambitious; I bet they’re not that dangerous.” They’re talking about a real thing in the space of human psychology, but unfortunately that real thing does not map onto math in any simple way.
The sofa human, if we imagine for a moment that we’re talking in 1990 before the age of Google Maps, might hear about a new comic-book store and successfully plot their way across town on a previously untaken route, in order to buy a new kind of strategic board game, which they learn to play that night even though they’ve never played it before, and then they challenge one of their friends and win. There’s all kinds of puzzles the sofa human could solve which a chimpanzee could not, involving means-end reasoning, forward chaining and backchaining meeting in the middle, learning new categories about tactics that work or don’t work...
And yet the sofa human seems so soft and safe and unambitious! You can get a bunch of minimum-wage labor out of them, and they don’t try to take over the world at *all*. They don’t even talk about *wanting* to take over the world, except insofar as impotent national-politics gabble is a behavior they’ve learned to imitate from other humans. “If only our AIs could be like this!” some people think.
And there are really so many, many things going on here. I am not sure where I ought to start. I will start somewhere anyways.
The sofa human has been entrained, on a sub-evolutionary timescale, by intrinsic brain rewards, by externally stimulated punishments, to have been rewarded on past occasions for using means-end reasoning on playing chess, but not for using means-end reasoning on tasks similar to “taking over the world”. They can’t, in fact, take over the world, and smaller tasks in the same sequence, like becoming Mayor of Oakland or Governor of California, are also unrewarding to them. This isn’t some deep category written on the structure of stars, but it’s a natural category to *you*, who is also human, so it’s not surprising that the description of what the sofa human has and hasn’t learned to think about has a short description in your own native mental language, and that you can do a good job of predicting them using that description. It’s not a sofa *alien*.
It happens, even, that the board game is *about* taking over the world—or a rather simple logical structure meant to mimic that, under some hypothetical circumstances—and the sofadweller sure is coming up with some clever tactics in that board game! Weird, huh?
Already we have several important observations, here: - It’s not that the sofadweller lacks the *underlying basic cognitive machinery* to do general means-end reasoning on the particular topic of “world takeovers”. There’s a surface-level learned behavior not to *use* the general machinery for that specific topic. You can ask them to play a board game about it and they’ll do that. - It’s not like the sofadweller is way smarter than you and thinks much faster than you and was faced with an actual opportunity to solve their comic-book-related problems by taking over the world as an intermediate step, which they then very corrigibly turned down. It’s not like they were *offered* rulership of the Earth and dominion of the galaxy, via some clearly visible pathway, and turned it down. - Your ability to describe the sofadweller in simple-sounding standard humanese words like “unambitious” and get out nice useful predictions, possibly has something to do with you two not being utterly alien minds relative to each other.
FYI, I’ve been thinking about, and I’ve noted something similar here.
I’m not really sure what to say about the “why would you think the default starting point is aligned”. The thing I wonder about is whether there is a way to reliably gain strong evidence of an increasingly misaligned nature developing through training.
On another note, my understanding is partly informed by this Twitter comment by Eliezer: