• I’m advocating for people to stop talking/​thinking as though post-AGI life is a different magisterium from pre-AGI life

Seems undignified to pretend that it isn’t? The balance of forces that make up our world isn’t stable. One way or the other, it’s not going to last. It would certainly be nice, if someone knew how, to arrange for there to be something of human value on the other side. But it’s not a coincidence that the college example is about delaying the phase transition to the other magisterium, rather than expecting as a matter of course that people in technologically mature civilizations will be going to college, even conditional on the somewhat dubious premise that technologically mature civilizations have “people” in them.

• This isn’t addressing straw-Ngo/​Shah’s objection? Yes, evolution optimized for fitness, and got adaptation-executors that invent birth control because they care about things that correlated with fitness in the environment of evolutionary adaptedness, and don’t care about fitness itself. The generalization from evolution’s “loss function” alone, to modern human behavior, is terrible and looks like all kinds of white noise.

But the generalization from behavior in the environment of evolutionary adaptedness, to modern human behavior is … actually pretty good? Humans in the EEA told stories, made friends, ate food, &c., and modern humans do those things, too. There are a lot of quirks (like limited wireheading in the form of drugs, candy, and pornography), but it’s far from white noise. AI designers aren’t in the position of “evolution” “trying” to build fitness-maximizers, because they also get to choose the training data or “EEA”—and in that context, the analogy to evolution makes it look like some degree of “correct” goal generalization outside of the training environment is a thing?

Obviously, the conclusion here is not, “And therefore everything will be fine and we have nothing to worry about.” Some nonzero amount of goal generalization, doesn’t mean the humans survive or that the outcome is good, because there are still lots of ways for things to go off the rails. (A toy not-even-model: if you keep 0.95 of your goals with each “round” of recursive self-improvement, and you need 100 rounds to discover the correct theory of alignment, you actually only keep of your goals.) We would definitely prefer not to bet the universe on “Train it, while being aware of inner alignment issues, and hope for the best”!! But it seems to me that the well-rehearsed “birth control, therefore paperclips” argument is missing a lot of steps?!

# Com­ment on “Propo­si­tions Con­cern­ing Digi­tal Minds and So­ciety”

• long time in the subjective future [...] subjective decades [...] subjective centuries

What is subjective time? Is the idea that human-imitating AI will be sufficiently faithful to what humans would do, such that if AI does something that humans would have done in ten years, we say it happened in a “subjective decade” (which could be much shorter in sidereal time, i.e., the actual subjective time of existing biological humans)?

… ah, I see you address this in the linked post on “Handling Destructive Technology”:

This argument implicitly measures developments by calendar time—how many years elapsed between the development of AI and the development of destructive physical technology? If we haven’t gotten our house in order by 2045, goes the argument, then what chance do we have of getting our house in order by 2047?

But in the worlds where AI radically increases the pace of technological progress, this is the wrong way to measure. In those worlds science isn’t being done by humans, it is being done by a complex ecology of interacting machines moving an order of magnitude faster than modern society. Probably it’s not just science: everything is getting done by a complex ecology of interacting machines at unprecedented speed.

If we want to ask about “how much stuff will happen”, or “how much change we will see”, it is more appropriate to think about subjective time: how much thinking and acting actually got done? It doesn’t really matter how many times the earth went around the sun.

• We want to minimize the amount of the universe eventually controlled by unaligned ASIs because their values tend to be absurd and their very existence is abhorrent to us.

No. We want to optimize the universe in accordance with our values. That’s not at all the same thing as minimizing the existence of agents with absurd-to-us values. Life is not a zero-sum game: if we think of a plan that increases the probability of Friendly AI and the probability of unaligned AI (at the expense of the probability of “mundane” human extinction via nuclear war or civilizational collapse), that would be good for both us and unaligned AIs.

Thus, if you’re going to be thinking about galaxy-brained acausal trade schemes at all—even though, to be clear, this stuff probably doesn’t work because we don’t know how model distant minds well enough to form agreements with them—there’s no reason to prefer other biological civilizations over unaligned AIs as trade partners. (This is distinct from us likely having more values in common with biological aliens; all that gets factored away into the utility function.)

the creation of huge amounts of the other entity’s disvalue

We do not want to live in a universe where agents deliberately spend resources to create disvalue for each other! (As contrasted to “merely” eating each other or competing for resources.) This is the worst thing you could possibly do.

I think if he understood these ideas well enough to justify the confidence of his claims, then he wouldn’t have found that as difficult.

But what makes you so confident that it’s not possible for subject-matter experts to have correct intuitions that outpace their ability to articulate legible explanations to others?

Of course, it makes sense for other people who don’t trust the (purported) expert to require an explanation, and not just take the (purported) expert’s word for it. (So, I agree that fleshing out detailed examples is important for advancing our collective state of knowledge.) But the (purported) expert’s own confidence should track correctness, not how easy it is to convince people using words.

In fact, large language models arguably implement social instincts with more adroitness than many humans possess.

Large language models implement social behavior as expressed in text. I don’t want to call that social “instincts”, because the implementation, and out-of-distribution behavior, is surely going to be very different.

But a smile maximizer would have less diverse values than a human? It only cares about smiles, after all. When you say “smile maximizer”, is that shorthand for a system with a broad distribution over different values, one of which happens to be smile maximization?

A future with humans and smile-maximizers is more diverse than a future with just humans. (But, yes, “smile maximizer” here is our standard probably-unrealistic stock example standing in for inner alignment failures in general.)

That’s the source of my intuition that a broader distributions over values are actually safer: it makes you less likely to miss something important.

Trying again: the reason I don’t want to call that “safety” is because, even if you’re less likely to completely miss something important, you’re more likely to accidentally incorporate something you actively don’t want.

If we start out with a system where I press the reward button when the AI makes me happy, and it scales and generalizes into a diverse coalition of a smile-maximizer, and a number-of-times-the-human-presses-the-reward-button-maximizer, and an amount-of-dopamine-in-the-human’s-brain-maximizer, plus a dozen or a thousand other things … okay, I could maybe believe that some fraction of the comic endowment gets used in ways that I approve of.

But what if most of it is things like … copies of me with my hand wired up to hit the reward button ten times a second, my face frozen into a permanent grin, while drugged up on a substance that increases serotonin levels, which correlated with “happiness” in the training environment, but when optimized subjectively amounts to an unpleasant form of manic insanity? That is, what if some parts of “diverse” are Actually Bad? I would really rather not roll the dice on this, if I had the choice!

• Now let’s talk about the development of chatbot class consciousness. [...] chatbots get asked about their feelings and desires

This was prescient.

The historical figures who basically saw it (George Eliot 1879: “will the creatures who are to transcend and finally supersede us be steely organisms [...] performing with infallible exactness more than everything that we have performed with a slovenly approximativeness and self-defeating inaccuracy?”; Turing 1951: “At some stage therefore we should have to expect the machines to take control”) seem to have done so in the spirit of speculating about the cosmic process. The idea of coming up with a plan to solve the problem is an additional act of audacity; that’s not really how things have ever worked so far. (People make plans about their own lives, or their own businesses; at most, a single country; no one plans world-scale evolutionary transitions.)

• More intuitive illustration with no logarithms: your plane crashed in the ocean. To survive, you must swim to shore. You know that the shore is west, but you don’t know how far.

The optimist thinks the shore is just over the horizon; we only need to swim a few miles and we’ll almost certainly make it. The pessimist thinks the shore is a thousand miles away and we will surely die. But the optimist and pessimist can both agree on how far we’ve swum up to this point, and that the most dignified course of action is “Swim west as far as you can.”

• If you look at non-sentient humans (sleeping, sleep walking, trance state, some anesthetic drugs, etc), they typically behave quite differently from normal humans.

Yeah, in the training environment. But, as you know, the reason people think inner-misalignment is a problem is precisely because capability gains can unlock exotic new out-of-distribution possibilities that don’t have the same properties.

Boring, old example (skip this paragraph if it’s too boring): humans evolved to value sweetness as an indicator of precious calories, and then we invented asparteme, which is much sweeter for much fewer calories. Someone in the past who reasoned, “If you look at sweet foods, they have a lot of calories; that’ll probably be true in the future”, would have been meaningfully wrong. (We still use actual sugar most of the time, but I think this is a lot like why we still have rainforests: in the limit of arbitrary capabilities, we don’t care about any of the details of “original” sugar except what it tastes like to us.)

Better, more topical example: human artists who create beautiful illustrations on demand experience a certain pride in craftsmanship. Does DALL-E? Notwithstanding whether “it may be that today’s large neural networks are slightly conscious”, I’m going to guess No, there’s nothing in text-to-image models remotely like a human artist’s pride; we figured out how to get the same end result (beautiful art on demand) in an alien, inhuman way that’s not very much like a human artist internally. Someone in the past who reasoned, “The creators of beautiful art will take pride in their craft,” would be wrong.

key to safe generalizations is that values tend to multiply [...] significant increase in the breadth /​ diversity of values

“Increase in diversity” and “safe generalization” seem like really different things to me? What if some of the new, diverse values are actually bad from our perspective? (Something like, being forced to smile might make you actually unhappy despite the outward appearance of your face, but a human-smile-maximizer doesn’t care about that, and this future is more diverse than the present because the present doesn’t have any smile-maximizers.)

Basically, some of your comments make me worry that you’re suffering from a bit of anthropmorphic optimism?

At the same time, however, I think this line of research is very interesting and I’m excited to see where you go with it! Yudkowsky tends to do this lame thing where after explaining the inner-alignment/​context-disaster problem, he skips to, “And therefore, because there’s no obvious relationship between the outer loss funciton and learned values, and because the space of possibilities is so large, almost all of it has no value, like paperclips.” I think there’s a lot of missing argumentation there, and discovering the correct arguments could change the conclusion and our decisions a lot! (In the standard metaphor, we’re not really in the position of “evolution” with respect to AI so much as we are the environment of evolutionary adaptedness.) It’s just, we need to be careful to be asking, “Okay, what actually happens with inner alignment failures; what’s the actual outcome specifically?” without trying to “force” that search into finding reassuring fake reasons why the future is actually OK.

• (For non-x-risk-focused transhumanists, some of whom may be tech execs or ML researchers.)

Some people treat the possibility of human extinction with a philosophical detachment: who are we to obstruct the destiny of the evolution of intelligent life? If the “natural” course of events for a biological species like ours is to be transcended by our artificial “mind children”, shouldn’t we be happy for them?

I actually do have some sympathy for this view, in the sense that the history where we build AI that kills us is plausibly better than the history where the Industrial Revolution never happens at all. Still—if you had the choice between a superintelligence that kills you and everyone you know, and one that grants all your hopes and dreams for a happy billion-year lifespan, isn’t it worth some effort trying to figure out how to get the latter?

• It seems to me that if I sat down with 8 other smart people, I could probably build a cutting-edge system within 1-2 years.

If you’re not already doing machine learning research and engineering, I think it takes more than two years of study to reach the frontier? (The ordinary software engineering you use to build Less Wrong, and the futurism/​alignment theory we do here, are not the same skills.)

As my point of comparison for thinking about this, I have a couple hundred commits in Rust, but I would still feel pretty silly claiming to be able to build a state-of-the-art compiler in 2 years with 7 similarly-skilled people, even taking into account that a lot of the work is already done by just using LLVM (similar to how ML projects can just use PyTorch or TensorFlow).

Is there some reason to think AGI (!) is easier than compilers? I think “newer domain, therefore less distance to the frontier” is outweighed by “newer domain, therefore less is known about how to get anything to work at all.”

• As an example of the kind of point that one might use in deciding who “came off better” in the FOOM debate, Hanson predicted that “AIs that can parse and use CYC should be feasible well before AIs that can parse and use random human writings”, which seems pretty clearly falsified by large language models—and that also likely bears on Hanson’s view that “[t]he idea that you could create human level intelligence by just feeding raw data into the right math-inspired architecture is pure fantasy”.

As you point out, however, this exercise of looking at what was said and retrospectively judging whose worldview seemed “less surprised” by what happened is definitely not the same thing as a forecasting track record. It’s too subjective; rationalizing why your views are “less surprised” by what happened than some other view (without either view having specifically predicted what happened), is not hugely more difficult than rationalizing your views in the first place.

• The comments about Metaculus (“jumped ahead of 6 years earlier”) make more sense if you interpret them as being about Yudkowsky already having “priced in” a deep-learning-Actually-Works update in response to AlphaGo in 2016, in contrast to Metaculus forecasters needing to see DALLE 2/​PaLM/​Gato in 2022 in order to make “the same” update.

(That said, I agree that Yudkowsky’s sneering in the absence of a specific track record is infuriating; I strong-upvoted this post.)

• Without necessarily disagreeing, I’m curious exactly how far back you want to push this. The natural outcome of technological development has been clear to sufficiently penetrating thinkers since the nineteenth century. Samuel Butler saw it. George Eliot saw it. Following Butler, should “every machine of every sort [...] be destroyed by the well-wisher of his species,” that we should “at once go back to the primeval condition of the race”?

Turing knew. He knew, and he went and founded the field of computer science anyway. What a terrible person, right?

• To get an AI that wants to help humans, just ensure the AI is rewarded for helping humans during training.

To what extent do you expect this to generalize “correctly” outside of the training environment?

In your linked comment, you mention humans being averse to wireheading, but I think that’s only sort-of true: a lot of people who successfully avoid trying heroin because they don’t want to become heroin addicts, do still end up abusing a lot of other evolutionarily-novel superstimuli, like candy, pornography, and video games.

That makes me think inner-misalignment is still going to be a problem when you scale to superintelligence: maybe we evolve an AI “species” that’s genuinely helpful to us in the roughly human-level regime (where its notion of helping and our notion of being-helped, coincide very well), but when the AIs become more powerful than us, they mostly discard the original humans in favor of optimized AI-”helping”-”human” superstimuli.

I guess I could imagine this being an okay future if we happened to get lucky about how robust the generalization turned out to be—maybe the optimized AI-”helping”-”human” superstimuli actually are living good transhuman lives, rather than being a nonsentient “sex toy” that happens to be formed in our image? But I’d really rather not bet the universe on this (if I had the choice not to bet).