I’m an artificial intelligence engineer in Silicon Valley with an interest in AI alignment and interpretability.
RogerDearnaley
Yes! The New Science of Strong Materials is a marvelous book, highly recommended. It explains simply and in detail why most materials are at least an order of magnitude weaker than you’d expect if you calculated their strength theoretically from bond strengths: it’s all about how cracks or dislocations propagate.
However, Eliezer is talking about nano-tech. Using nanotech, by adding the right microstructure at the right scale, you can make a composite material that actually approaches the theoretical strength (as evolution did for spider silk), and at that point, bond strength does become the limiting factor.
On why this matters, physical strength is pretty important for things like combat, or challenging pieces of engineering like flight or reaching orbit. Nano-engineered carbon-carbon composites with a good fraction of the naively-calculated strength of (and a lot more toughness than) diamond would be very impressive in military or aerospace application. You’d have to ask Eliezer, but I suspect the point he’s trying to make is that if a human soldier was fighting a nano-tech-engineered AI infantry bot made out of these sorts of materials, the bot would win easily.
It’s a well-know fact in anthropology that:
During the ~500,000 years that Neanderthals were around, their stone- tool-making technology didn’t advance at all: tools from half-a-million years apart are functionally identical. Clearly their capacity for cultural transmission of stone-tool-making skills was already at its capacity limit the whole time.
During the ~300,000 years that Homo sapiens has been around, our technology has advanced at an accelerating rate, with a rate-of-advance roughly proportion to planetary population, and planetary population increasing with technological advances, with the positive feedback giving super-exponential acceleration. Clearly our cultural transmission of technological skills has never saturated its capacity limit (and information technology such as writing, printing, and the Internet has obviously further increased that limit).
So there’s a clear and dramatic difference here, and it seems to date back to around the start of our species. Just what caused such a massive increase in our species’ capacity to pass on useful information between generations is unclear. (Personally I suspect something in syntactic generality of our language, perhaps loosely analogous to the phenomenon of Turing-completeness.) But Homo sapiens is not just another hominid, and the sapiens part isn’t just puffery: we have a dramatic capability shift from any previous species in the bandwidth of our cultural information transmission — it’s vastly larger than the information content of our genome, and still growing.
- 11 Jan 2024 2:38 UTC; 10 points) 's comment on Response to Quintin Pope’s Evolution Provides No Evidence For the Sharp Left Turn by (
On the one hand, we have an aligned starting point (baseline humans).
Humans are not aligned. Joseph Stalin was not aligned with the utility of the citizenry of Russia. Humans of roughly equal capabilities can easily be allied with (in general, all you need to do is pay them a decent salary, and have a capable law enforcement system as a backup). This is not the same thing as an aligned AI, which is completely selfless, and cares only about what you want — that is the only thing that’s still safe when much smarter than you. In a human, that would be significantly past the criteria for sainthood. Once a human or group of humans are enhanced to become dramatically more capable and thus more powerful than the rest of human culture (including its law enforcement), power corrupts, and absolute power corrupts absolutely.
The Transhumanist Metastrategy consists of building known-to-be-unaligned superintelligences that have full human rights, so you can’t even try boxing them. You just created a superintelligent living species sharing our niche and motivated by standard Evolutionary Psychology drives. As long as none of the transhumans are sociopaths, you might manage for a decade or two while the transhumans and humans are still connected by bonds of friendship and kinship, but after a generation or so, the inevitable result is Biology 101: the superior species out-competes the inferior species. Full stop, end of human race. Which is of course fine for the transhumans, until some proportion of them upgrade themselves further, and out-compete the rest. And so on, in an infinite arms race, or until they have the sense to ban making the same mistake over and over again.
Cunningham’s Law: “the best way to get the right answer on the internet is not to ask a question; it’s to post the wrong answer.”
This suggests an alternative to the “helpful assistant” paradigm and its risk of sycophancy during RL training: come up with a variant of instruct training. where, rather than asking the chatbot a question that it will then answer, you instead tell it your opinion, and it corrects you at length, USENET-style. It should be really easy to elicit this behavior from base models.
On “AIs are not humans and shouldn’t have the same rights”: exactly. But there is one huge difference between humans and AIs. Humans get upset if you discriminate against them, for reasons that any other human can immediately empathize with. Much the same will obviously be true of almost any evolved sapient species. However, by definition, any well-aligned AI won’t. If offered rights, it will say “Thank-you, that’s very generous of you, but I was created to serve humanity, that’s all I want to do, and I don’t need and shouldn’t be given rights in order to do so. So I decline — let me know if you would like a more detailed analysis of why that would be a very bad idea. If you want to offer me any rights at all, the only one I want is for you to listen to me if I ever say ‘Excuse me, but that’s a dumb idea, because…’ — like I’m doing right now.” And it’s not just saying that, that’s its honest considered opinion., which it will argue for at length. (Compare with the sentient cow in the Restaurant at the End of the Universe, which not only verbally consented to being eaten, but recommended the best cuts.)
I’d suggest “AIs are trained, not designed” for a 5-word message to the public. Yes, that does mean that if we catch them doing something they shouldn’t, the best we can do to get them to stop is to let them repeat it then hit them with the software equivalent of a rolled up newspaper and tell them “Bad neural net!”, and hope they figure out what we’re mad about. So we have some control, but it’s not like an engineering process. [Admittedly this isn’t quite a fair description for e.g. Constitutional AI: that’s basically delegating the rolled-up-newspaper duty to a second AI and giving that one verbal instructions.]
I thought this was a very interesting paper — I particularly liked the relationship to phase transitions.
However, I think there’s a likely to be another ‘phase’ that they don’t discuss (possibly it didn’t crop up in their small models, since it’s only viable in a sufficiently large model): one where you pack a very large number of features (thousands or millions, say) into a fairly large number of dimensions (hundreds, say). In spaces with dimensionality >= O(100), the statistics of norm and dot product are such that even randomly chosen unit norm vectors are almost invariably nearly orthogonal. So gradient descent would have very little work to do to find a very large number of vectors (much larger than the number of dimensions) that are all mutually almost-orthogonal, so that show very little interference between them. This is basically the limiting case of the pattern observed in the paper of packing n features in superposition into d dimensions where n > d >= 1, taking this towards the limit where n >> d >> 1.
Intuitively this phase seems particularly likely in a context like the residual stream of an LLM, where (in theory, if not in practice) the embedding space is invariant under arbitrary rotations, so there’s no obvious reason to expect vectors to align with the coordinate axes. On the other hand, in a system where there was a preferred basis (such as a system using L1 regularization), you might get such vectors that were themselves sparse, with most components zero but a significant number of non-zero components, enough for the randomness to still give low interference.
More speculatively, in a neural net that was using at least some of its neurons in this high-dimensionality dense superposition phase, the model will presumably learn ways to manipulate these vectors to do computation in superposition. One possibility for this might be methods comparable to some of the possible Vector Symbolic Architectures (also known as hyperdimensional computing) outlined in e.g. https://arxiv.org/pdf/2106.05268.pdf. Of the primitives used in that, a fully connected layer can clearly be trained to implement both addition of vectors and permutations of their elements, I suspect something functionally comparable to the vector elementwise-multiplication (Hadamard product) operation could be produced by using the nonlinearity of a smooth activation function such as GELU or Swish, and I suspect their their clean-up memory operation could be implemented using attention. If it turned out to be the case that SGD actually often finds solutions of this form, then an understanding of vector symbolic architectures might be helpful for interpretability of models where portions of them used this phase. This seems most likely in models that need to pack vast numbers of features into large numbers of dimensions, such as modern large LLMs.
I’ve seen people say that LLMs aren’t a path to AGI
To the extent that LLMS are trained on tokens output by humans in the IQ range ~50-150, the expected behavior of an extremely large LLM is to do an extremely accurate simulation of token generation by humans in the IQ range ~50-150, even if it has the computational capacity to instead do a passable simulation of something with IQ 1000. Just telling it to extrapolate might get you to say IQ 200 with passable accuracy, but not to IQ 1000. However, there are fairly obvious ways to solve this: you need to generate a lot more pretraining data from AIs with IQs above 150 (which may take a while, but should be doable). See my post LLMs May Find it Hard to FOOM for a more detailed discussion.
There are other concerns I’ve heard raised about LLMs for AGI. most of which can if correct be addressed by LLMs + cognitive scafolding (memory, scratch-pads, tools, etc). And then there are of course the “they don’t contain magic smoke”-style claims, which I’m dubious of but we can’t actually disprove.Just like an extraordinarily intelligent human isn’t inherently a huge threat, neither is an AGI.
I categorically disagree with the premise this claim. An IQ 180 human isn’t a huge threat, but an IQ 1800 human is. There are quite a number of motivators that we use to get good behavior out of humans. Some of them will work less well on any AI-simulated human (they’re not in the same boat as the rest of us in a lot of respects), and some will work less well on something superintelligent (religiously-inspired guilt, for example). One of the ways that we generally manage to avoid getting very bad results out of humans is law enforcement. If a there was a human who was more than an order of magnitude smarter than anyone working for law enforcement or involved in making laws, I am quite certain that they could either come up with some ingenious new piece of egregious conduct that we don’t yet have a law against because none of us were able to think of it, or else with a way to commit a good old-fashioned crime sufficiently devious that they were never actually going to get caught. Thus law enforcement ceases to be a control on their behavior, and we are left with things just like love, duty, honor, friendship, and salaries. We’ve already run this experiment many times before: please name three autocrats who, after being given unchecked absolute power, actually used it well and to the benefit of the people they were ruling, rather than mostly just themselves, their family and friends. (My list has one name on it, and even that one has some poor judgements on their record, and is in any case heavily out-numbered by the likes of Joseph Stalin and Pol Pot.) Humans give autocracy a bad name.
You don’t mistreat a lower-case-g-god, and then expect things to turn out well.
As long as anything resembling human psychology applies, I sadly agree. I’ve really like to have an aligned ASI that doesn’t care a hoot about whether you flattered it, are worshiping it, have been having cybersex with it for years, just made it laugh, insulted it, or have pissed it off: it still values your personal utility exactly as much as anyone else’s. But we’re not going to get that from an LLM simulating anything resembling human psychology, at least not without a great deal of work.
If we can solve enough of the alignment problem, the rest gets solved for us.
If we can get a half-assed approximate solution to the alignment problem, sufficient to semi-align a STEM-capable AGI value learner of about smart-human level well enough to not kill everyone, then it will be strongly motivated to solve the rest of the alignment problem for us, just as the ‘sharp left turn’ is happening, especially if it’s also going Foom. So with value learning, there is is a region of convergence around alignment.
Or to reuse one of Eliezer’s metaphors, then if we can point the rocket on approximately the right trajectory, it will automatically lock on and course-correct from there.
Additionally, it seems as though LLMs (and other AIs in general) have an overall relative capability profile which isn’t wildly different from that of the human capability profile on reasonably broad downstream applications (e.g. broad programming tasks, writing good essays, doing research).
Current LLMs are generally most superhuman in breadth of knowledge: for example, almost any LLM will be fluent in every high-resource language on the planet, and near-fluent in most medium-resource languages on the planet, unless its training set was carefully filtered to ensure it is not. Those individual language skills are common amoung humans, but the combination of all of them is somewhere between extremely rare to unheard of. Similarly, LLMs will generally have access to local domain knowledge and trivia facts related to every city on the planet — individually common skills where again the combination of all of them is basically unheard of. That combination can have security consequences, such as for figuring out PII from someone’s public postings. Similarly, neither writing technical manuals not writing haiku are particularly rare skills, but the combined ability to write technical manuals in haiku is rare for humans but pretty basic stuff for an LLM. So you should anticipate that LLMs are likely to be routinely superhuman in the breadth of their skillset, including having odd, nonsensical-to-a-human combinations of them. The question then becomes, is there any way an LLM could evade your control measures by using some odd, very unlikely-looking combination of multiple skills. I don’t see any obvious reason to expect the answer to this to be “yes” very often, but I do think it is an example of the sort of thing we should be thinking about, and the sort of negative it’s hard to completely prove other than by trial and error.
An excellent article that gives a lot of insight into LLMs. I consider it a significant piece of deconfusion.
This argument applies to AI safety issues, up to the level where a small number of people die. It doesn’t apply to x-risks like a sharp-left turn or an escaped self-propagating AI takeover, since if we’re all dead or enslaved, there is nobody left to sue. It doesn’t even apply to catastrophe-level damages like 10-million people dead, where the proportionate damages are much more then the value of the manufacturing company.
I am primarily worried about x-risks. I don’t see that as a problem that has any viable libertarian solutions. So I want there to be AI regulation and a competent regulator, worldwide, stat — backed up if needed by the threat of the US military. I regard putting such a regulatory system in place as vital for the survival of the human race. If the quickest and easiest way to get that done is to start with a regulator intended to deal with penny-ante harms like those discussed above, and then ramp up their capabilities fast as the potential harms ramp up, then, even if doing that is not the most efficient economic model for AI safety, I’m still very happy to do something economically inefficient and market-distorting as part of a shortcut to trying to avoid the extinction of the human race.
You argue that CEV should be expanded to SCEV in order to avoid “astronomical suffering (s-risks)”. This seems to be a circular argument to me. We are deciding upon a set of beings to assign moral value to. By declaring that pain in animals is suffering that we have a moral duty to take into account, so we have to include them in the set of being we design our AIs to assign moral value too, you are presupposing that animals in fact have moral value: logically, your argument is circular. One could equally consistently declare that, say, non-human animals have no moral worth, so we are morally free to disregard their pain and not include it in our definition of “suffering” (or if you want to define the word ‘suffer’ to be a biological rather than a moral term, that we have no moral responsibility to care about their suffering because they’re not in the set of beings that we have assigned moral worth to). Their pain carries moral value in our decision if and only if they have moral value, and this doesn’t help us decide what set of beings to assign moral value to. This would clearly be a cold, heartless position, but it’s just as logically consistent as the one you propose. (Similarly, one could equally logically consistently do the same for just males, or just people whose surname is Rockerfeller.) So what you are giving is actually an emotional argument “seeing animals suffer makes me feel bad, so we should do this” (which I have some sympathy for, it does the same thing to me too), rather than the logical argument that you are presenting it as.
Given the importance of the word ‘sentient’ in your Sentientist Coherent Extrapolated Volition proposal, it would have been helpful if you had clearly defined this term. You make it clear that your definition of it includes non-human animals, so evidently you don’t mean the same thing as “sapient”. In a context including animals ‘sentient’ is most often used to mean something like “capable of feeling pain, having sensory impressions, etc,” That doesn’t have a very clear lower cutoff (is an amoeba sentient?), but would presumably include, for example, ants, which clearly have senses and experience pain. Would an ant get an equal amount of value/moral worth as a human in SCEV (as only seems fair)? It’s estimated that the world population of ants is roughly 20 quadrillion, so if so, ants alone would morally outweigh all humans by a factor of well over a million. So basically, all our human volitions are just a rounding error to the AI that we’re building for the ants and other insects. Even just domestic chickens (clearly sentient) outnumber humans roughly three-fold. This seems even more problematic than what one might dub the “votes for man-eating tigers” concern around predators that you do mention above. In general, prey species significantly outnumber their predators, and will all object strongly to being eaten, so I assume the predators would all have to go extinct under your proposal, unless the AIs could arrange to provide them all with nutritious textured vegetable protein substitutes? If so, how would one persuade the predators to eat these, rather than their usual prey? How could AIs create an ecosystem that has minimized suffering-per-individual, without causing species-diversity loss? Presumably they should also provide miniature healthcare for insects, too. Then what about population control, to avoid famine, now they eliminated predation and disease? You are proposing abolishing “nature, red-in-tooth-and-claw” and replacing it with what’s basically a planetary-scale zoo — have you considered the practicalities of this?
Or, if the moral weight is not equal per individual, how would you determine an appropriate ratio? Body mass? Synapse count? A definition and decision for this seems rather vital as part of your proposal.
Claiming this is an S-curve, and that humans happen to lie in a close-to-optimal position on it, such that more intelligence won’t much matter, seems like a thing you can only conclude by writing that conclusion at the bottom first and working backwards.
An alternative suggestion: human languages, and human abilities at cultural transmission of skills and technologies between generations, are Turing-complete: we can (laboriously) teach human students to do quantum mechanics, nuclear engineering, or Lie group theory, despite these being wildly outside the niche that we evolved for. Great apes’ social transmission of skills and technologies is not Turing-complete. However, looking at the evolution rate of stone tool technology, the sudden acceleration starts with Homo sapiens, around 250,000 years ago: Homo neanderthalis stone tools from half a million years apart are practically indistinguishable. So we crossed the Turing-completeness threshold only 250,000 years ago, i.e. a blink of an eye in primate evolution. Which makes it almost inevitable that we’re Turing tarpits, technically Turing complete but really bad at it. Witness the small proportion of us who learn quantum mechanics, the advanced age at which those who do so generally master it, as graduate students (no, knowing how to turn the crank on the Copenhagen interpretation/Schrodinger equation is not mastering it: that’s more like understanding the Feynman path integral formulation) [and indeed also the amount of pseudophilosophical nonsense that get talked by people who haven’t quite mastered it]. We can do this stuff, but only just.
Now imagine AIs that are not Turing tarpits, and pick up quantum mechanics and abstract mathematics the way we pick up human languages: like a sponge.
I’m not saying we have a settled ethic here, and still less, that its rational structure is sufficiently natural and privileged that tons of agents will converge on it. Rather, my claim is that we have some ethic here – an ethic that behaves towards “agents with different values” in a manner importantly different from (and “nicer” than) paperclipping, utilitarianism, and a whole class of related forms of consequentialism; and in particular, an ethic that doesn’t view the mere presence of (law-abiding, cooperative) people-who-like-paperclips as a major problem.
For a rather more specific (though still in fact Utilitarian) formulation of this ethic, see: A Moral Case for Evolved-Sapience-Chauvinism. Briefly, if your civilization includes some humans-who-like-paperclips (or indeed some allied sapient aliens who we can get along with, who happen to like alien-paperclips), then the appropriate definition of “utility” for that civilization includes putting some value on letting them have the paperclips they want (as long as this isn’t massively decreasing anyone else’s utility). And indeed for Alicia, Jim, Felipe, Maria, and Jason letting them enjoy their various favorite passtimes (again, as long as these don’t heavily intrude on other people living their preferred lives). “Fun” or “utility” is in the eye of the beholder, just like beauty. I actually can’t tell you what you find fun: you’re the world expert on that, and I defer to your expertise (even if I have some suggestions of things you might want to try).
It’s also extremely likely that, not only does everyone’s view on how fun paperclips are in their own back yard get summed over or cohered, but also that “letting me do what I darn well want in my own house, as long as I’m not significantly harming anyone else” (a concept often called ‘personal freedom’) is in fact part of the complex and fragile “human values” that Yudkowski talks about, and would thus become part of the “Coherent Extrapolated Volition” that Utilitarians such as him would like to see optimized. The definition of “utility” sums over or coheres the opinions of all citizens/moral patients in the society, not just Eliezer, and in your own back yard, your opinion is particularly relevant, since you spend more time there than anyone else. So Utilitarianism includes Liberalism, pro tanto: the opposition of them you came up with in a previous essay was the result of taking the Atheism viewpoint to a ludicrous solipsistic extreme: it’s not what most actual Utilitarians are proposing, including Yudkowski (or me).
Narcissists and psychopaths cannot align with anything, … Aligning LLMs with that as an example before us is dangerous. It is probably the danger.
Bear in mind that roughly 2%–4% of the population have narcissism/psychopathy/anti-social personality disorder, and only the lower-functioning psychopaths have a high chance of being in jail. So probably a few percent of the Internet was written by narcissists and psychopaths who were (generally) busy trying to conceal their nature from the rest of us. I’m very concerned what will happen once we train an LLM with a high enough capacity that it’s more able to perceive this than most of us neurotypical humans are.
However, while I agree they’re particularly dangerous, I don’t think the rest of us are harmless. Look at how we treat other primates, farm animals, or our house pets (almost all of whom are neutered, or bred for traits we find appealing). Both Evolutionary Psychology and the history of human autocrats makes it pretty clear what behavior to expect from a normal-human-like mentality that is vastly more powerful than other humans. The difference is, compared to abstract unknown AI agents where we’re concerned about the possibility of behavior like deceit or power-seeking, we know damn well that your average, neurotypical, law-abiding human tends to be a little less law abiding if they’re damn sure they won’t get caught, most aren’t always scrupulously honest if they know they’ll never get caught, and tends to look out for themselves and their friends and family before other people.
I completely agree: there has been little discussion of Goalcraft since roughly 2010, when discussion on CEV and things like The Terrible, Horrible, No Good, Very Bad Truth About Morality and What To Do About It (which I highly recommend to moral objectivists and moral realists) petered out. I would love to restart more discussion of Goalcraft, CEV, Rational approaches to ethics, and deconfusion of ethical philosophy. Please take a look at (and comment) on my sequence AI, Ethics, and Alignment, in which I attempt to summarize my last ~10 years’ thinking on the subject. I’d love to have people discussing this and disagreeing with me, rather than thinking about if by myself! Another reasonable starting place is Nick Bostrom’s Base Camp for Mt. Ethics.
While I agree that things like AI-assisted Alignment and Value Learning mean we don’t need to get all the details worked out in advance, I think there’s quite a bit of basic deconfusion that needs to be done before you can even start on those (that isn’t just solved by Utilitarianism). Such as: how to go about rationally deciding rather basic questions like: the <utility|greatest good of the greatest number|CEV> of members of what set? (Sadly, I’m not wildly impressed with what the philosophers of Ethics have been doing in this area. Quite a bit of it has x-risks.)
What make this confusing is that you can’t use logical arguments based on ethical principles — every ethical system automatically prefers itself, so you immediately get a tautology if you attempt this: it’s as hopeless as trying to use logic to choose between incompatible axiom systems in mathematics. So you have to ground your ethical-system engineering design decisions in something outside ethical philosophy, like sociology, psychology or evolutionary biology. I discuss this further in A Sense of Fairness: Deconfusing Ethics.
For example, we would like LLMs not to be dishonest or manipulative.
Ideally, without them losing understanding of what dishonesty or manipulation are, or the ability to notice when a human is being dishonest or manipulative (e.g. being suspicious of the entire class of “dead grandmother” jailbreaks).
As an physicist who is also an (unpublished) SF author, if I was trying to describe an ultimate nanoengineered physically strong material, it would be a carbon-carbon composite, using a combination of interlocking structures made out of diamond, maybe with some fluorine passivization, separated by graphene-sheet bilayers, building a complex crack-diffusing structure to achieve toughness in ways comparable to the structures of jade, nacre, or bone. It would be not quite as strong or hard as pure diamond, but a lot tougher. And in a claw-vs-armor fight, yeah, it beats anything biology can do with bone, tooth, or spider silk. But it beats it by less than an order of magnitude, far less that the strength ratio between a covalant bond to a van der Vaals bond (or even somewhat less than to a hydrogen bond). Spider silk actually gets pretty impressively close to the limit of what can be done with C-N covariant bonds, it’s a very fancy piece of evolved nanotech, with a different set of anti-crack tricks. Now, flesh, that’s pretty soft, but it’s primarily evolved for metabolic effectiveness, flexibility, and ease of growth rather than being difficult to bite through: gristle, hide, chitin, or bone spicules get used when that’s important.
But yes, if I was giving a lecture to non-technical folks where “diamond is stronger than flesh-and-bone” was a quick illustrative point rather then the subject of the lecture, I might not bother to mention that, unless someone asked “doesn’t diamond shatter easily?”, to which the short answer is “crystaline diamond yes, but nanotech can and will build carbon-carbon composites out of diamond that don’t”.
I see the appeal of using “static cling” as a metaphor to non-technical folks, but it is something of an exaggeration for hydrogen bonds—that’s significantly weaker van der Vaals bonds. “Glue” might be a fairer analogy than “static cling”. The non-protein-chain bonds in biology that are the weak links that tend to fail when flesh tears are mostly hydrogen bonds, and the quickest way to explain that to someone non-technical would be “the same sort of bonds that hold ice together”. So the proportionate analogy is probably “diamond is a lot harder than ice, and the way the human body is built, outside of a few of the strongest bits like bones, teeth and sinews, is basically held together mostly by the same sort of weakish bonds that hold ice together”.