• if you cannot agree on removing dishonest agents or practices from your own group

What group, though? I’m not aware of Sam Bankman-Fried having posted on Less Wrong (a website for hosting blog posts on the subject matter of human rationality). If he did write misleading posts or comments on this website, we should definitely downvote them! If he didn’t, why is this our problem?

(That is to say less rhetorically, why should this be our problem? Why can’t we just be a website where anyone can post articles about probability theory or cognitive biases, rather than an enforcement arm of the branded “EA” movement, accountable for all its sins?)

• An Eliezerism my Eliezer-model

Who? I liked this post (I had heard of Moore’s paradox, but hadn’t thought about how it generalizes), but this unexplained reference is confusing. (The only famous person with that name I can find on Wikipedia is the Tamil mathematician C. J. Eliezer, but I can’t figure out why his work would be relevant in this context.)

This is really misunderstanding what Eliezer is saying here [...] it’s been a decade of explaining to people almost once every two weeks that “yes, the AI will of course know what you care about, but it won’t care”, so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me

I think this is much more ambiguous than you’re making it out to be. In 2008′s “Magical Categories”, Yudkowsky wrote:

I shall call this the fallacy of magical categories—simple little words that turn out to carry all the desired functionality of the AI. Why not program a chess-player by running a neural network (that is, a magical category-absorber) over a set of winning and losing sequences of chess moves, so that it can generate “winning” sequences? Back in the 1950s it was believed that AI might be that simple, but this turned out not to be the case.

I claim that this paragraph didn’t age well in light of the deep learning revolution: “running a neural network [...] over a set of winning and losing sequences of chess moves” basically is how AlphaZero learns from self-play! As the Yudkowsky quote illustrates, it wasn’t obvious in 2008 that this would work: given what we knew before seeing the empirical result, we could imagine that we lived in a “computational universe” in which the neural network’s generalization from “self-play games” to “games against humans or traditional chess engines” worked less well than it did in the actual computational universe.

Yudkowsky continued:

The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires. But the real problem of Friendly AI is one of communication—transmitting category boundaries, like “good”, that can’t be fully delineated in any training data you can give the AI during its childhood.

This would seem to contradict “of course the AI will know, but it won’t care”? “The real problem [...] is one of communication” seems to amount to the claim that the AI won’t care because it won’t know: if you can’t teach “goodness” from labeled data, your AI will search for plans high in something-other-than-goodness, which will kill you at sufficiently high power levels.

But if it turns out that you can teach goodness from labeled data—or at least, if you can get a much better approximation than one might have thought possible in 2008—that would seem to present a different strategic picture. (I’m not saying alignment is easy and I’m not saying humanity is going to survive, but we could die for somewhat different reasons than some blogger thought in 2008.)

• we should see our odds of alignment being close to the knife’s edge, because those are the situations that require the most computation-heavy simulations to determine the outcome of

No, because “successfully aligned” is a value-laden category. We could be worth simulating if our success probability is close to zero, but there’s a lot of uncertainty over which unaligned-with-us superintelligence we create.

something trained on human culture might retain at least a tiny bit of compassion on reflection

This depends on where “compassion” comes from. It’s not clear that training on data from human culture gets you much in the way of human-like internals. (Compare: contemporary language models know how to say a lot about “happiness”, but it seems very dubious that they feel happiness themselves.)

when it does go wrong you’re less likely to notice than with a hand-coded utility/​reward [...] RLHF makes those visible failures less likely

Because it incentivizes learning human models which can then be used to be more competently deceptive, or just because once you’ve fixed the problems you know how to notice, what’s left are the ones you don’t know how to notice? The latter doesn’t seem specific to RLHF (you’d have the same problem if people magically got better at hand-coding rewards), but I see how the former is plausible and bad.

RLHF (which are very obviously not even trying to solve any of the problems which could plausibly kill us)

Sorry for being dumb, but I thought the naïve case for RLHF is that it helps solve the problem of “people are very bad at manually writing down an explicit utility or reward function that does what they intuitively want”? Does that not count as one of the lethal problems (even if RLHF alone would kill us because of the other problems)? If one of the other problems is Goodharting/​unforseen-maxima, it seems like RLHF could be helpful insofar as if RLHF rewards are quantitatively less misaligned than hand-coded rewards, you can get away with optimizing them harder before they kill you?

• merge-and-assist clause [...] we commit to stop competing with and start assisting this project

So, if you don’t think AI should be open (because that looks dangerous), has anyone considered just … changing the name? (At least, the name of the organization, even if the “OpenAI API” as a product has the string openai embedded in the code too much.) Yeah, it’s inconvenient, but … Alphabet did it! Meta did it! If you’re trying to make the most important event in the history of life go well, isn’t it worth a little inconvenience to be clear about what that entails?

• I’m advocating for people to stop talking/​thinking as though post-AGI life is a different magisterium from pre-AGI life

Seems undignified to pretend that it isn’t? The balance of forces that make up our world isn’t stable. One way or the other, it’s not going to last. It would certainly be nice, if someone knew how, to arrange for there to be something of human value on the other side. But it’s not a coincidence that the college example is about delaying the phase transition to the other magisterium, rather than expecting as a matter of course that people in technologically mature civilizations will be going to college, even conditional on the somewhat dubious premise that technologically mature civilizations have “people” in them.

• This isn’t addressing straw-Ngo/​Shah’s objection? Yes, evolution optimized for fitness, and got adaptation-executors that invent birth control because they care about things that correlated with fitness in the environment of evolutionary adaptedness, and don’t care about fitness itself. The generalization from evolution’s “loss function” alone, to modern human behavior, is terrible and looks like all kinds of white noise.

But the generalization from behavior in the environment of evolutionary adaptedness, to modern human behavior is … actually pretty good? Humans in the EEA told stories, made friends, ate food, &c., and modern humans do those things, too. There are a lot of quirks (like limited wireheading in the form of drugs, candy, and pornography), but it’s far from white noise. AI designers aren’t in the position of “evolution” “trying” to build fitness-maximizers, because they also get to choose the training data or “EEA”—and in that context, the analogy to evolution makes it look like some degree of “correct” goal generalization outside of the training environment is a thing?

Obviously, the conclusion here is not, “And therefore everything will be fine and we have nothing to worry about.” Some nonzero amount of goal generalization, doesn’t mean the humans survive or that the outcome is good, because there are still lots of ways for things to go off the rails. (A toy not-even-model: if you keep 0.95 of your goals with each “round” of recursive self-improvement, and you need 100 rounds to discover the correct theory of alignment, you actually only keep of your goals.) We would definitely prefer not to bet the universe on “Train it, while being aware of inner alignment issues, and hope for the best”!! But it seems to me that the well-rehearsed “birth control, therefore paperclips” argument is missing a lot of steps?!

# Com­ment on “Propo­si­tions Con­cern­ing Digi­tal Minds and So­ciety”

• long time in the subjective future [...] subjective decades [...] subjective centuries

What is subjective time? Is the idea that human-imitating AI will be sufficiently faithful to what humans would do, such that if AI does something that humans would have done in ten years, we say it happened in a “subjective decade” (which could be much shorter in sidereal time, i.e., the actual subjective time of existing biological humans)?

… ah, I see you address this in the linked post on “Handling Destructive Technology”:

This argument implicitly measures developments by calendar time—how many years elapsed between the development of AI and the development of destructive physical technology? If we haven’t gotten our house in order by 2045, goes the argument, then what chance do we have of getting our house in order by 2047?

But in the worlds where AI radically increases the pace of technological progress, this is the wrong way to measure. In those worlds science isn’t being done by humans, it is being done by a complex ecology of interacting machines moving an order of magnitude faster than modern society. Probably it’s not just science: everything is getting done by a complex ecology of interacting machines at unprecedented speed.

If we want to ask about “how much stuff will happen”, or “how much change we will see”, it is more appropriate to think about subjective time: how much thinking and acting actually got done? It doesn’t really matter how many times the earth went around the sun.

• We want to minimize the amount of the universe eventually controlled by unaligned ASIs because their values tend to be absurd and their very existence is abhorrent to us.

No. We want to optimize the universe in accordance with our values. That’s not at all the same thing as minimizing the existence of agents with absurd-to-us values. Life is not a zero-sum game: if we think of a plan that increases the probability of Friendly AI and the probability of unaligned AI (at the expense of the probability of “mundane” human extinction via nuclear war or civilizational collapse), that would be good for both us and unaligned AIs.

Thus, if you’re going to be thinking about galaxy-brained acausal trade schemes at all—even though, to be clear, this stuff probably doesn’t work because we don’t know how model distant minds well enough to form agreements with them—there’s no reason to prefer other biological civilizations over unaligned AIs as trade partners. (This is distinct from us likely having more values in common with biological aliens; all that gets factored away into the utility function.)

the creation of huge amounts of the other entity’s disvalue

We do not want to live in a universe where agents deliberately spend resources to create disvalue for each other! (As contrasted to “merely” eating each other or competing for resources.) This is the worst thing you could possibly do.

I think if he understood these ideas well enough to justify the confidence of his claims, then he wouldn’t have found that as difficult.

But what makes you so confident that it’s not possible for subject-matter experts to have correct intuitions that outpace their ability to articulate legible explanations to others?

Of course, it makes sense for other people who don’t trust the (purported) expert to require an explanation, and not just take the (purported) expert’s word for it. (So, I agree that fleshing out detailed examples is important for advancing our collective state of knowledge.) But the (purported) expert’s own confidence should track correctness, not how easy it is to convince people using words.

In fact, large language models arguably implement social instincts with more adroitness than many humans possess.

Large language models implement social behavior as expressed in text. I don’t want to call that social “instincts”, because the implementation, and out-of-distribution behavior, is surely going to be very different.

But a smile maximizer would have less diverse values than a human? It only cares about smiles, after all. When you say “smile maximizer”, is that shorthand for a system with a broad distribution over different values, one of which happens to be smile maximization?

A future with humans and smile-maximizers is more diverse than a future with just humans. (But, yes, “smile maximizer” here is our standard probably-unrealistic stock example standing in for inner alignment failures in general.)

That’s the source of my intuition that a broader distributions over values are actually safer: it makes you less likely to miss something important.

Trying again: the reason I don’t want to call that “safety” is because, even if you’re less likely to completely miss something important, you’re more likely to accidentally incorporate something you actively don’t want.

If we start out with a system where I press the reward button when the AI makes me happy, and it scales and generalizes into a diverse coalition of a smile-maximizer, and a number-of-times-the-human-presses-the-reward-button-maximizer, and an amount-of-dopamine-in-the-human’s-brain-maximizer, plus a dozen or a thousand other things … okay, I could maybe believe that some fraction of the comic endowment gets used in ways that I approve of.

But what if most of it is things like … copies of me with my hand wired up to hit the reward button ten times a second, my face frozen into a permanent grin, while drugged up on a substance that increases serotonin levels, which correlated with “happiness” in the training environment, but when optimized subjectively amounts to an unpleasant form of manic insanity? That is, what if some parts of “diverse” are Actually Bad? I would really rather not roll the dice on this, if I had the choice!

• Now let’s talk about the development of chatbot class consciousness. [...] chatbots get asked about their feelings and desires

This was prescient.

