I’m an artificial intelligence engineer in Silicon Valley with an interest in AI alignment and interpretability.
RogerDearnaley
How to Control an LLM’s Behavior (why my P(DOOM) went down)
Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor
Striking Implications for Learning Theory, Interpretability — and Safety?
5. Moral Value for Sentient Animals? Alas, Not Yet
Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)
Interpreting the Learning of Deceit
Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?
Requirements for a Basin of Attraction to Alignment
3. Uploading
Yes! The New Science of Strong Materials is a marvelous book, highly recommended. It explains simply and in detail why most materials are at least an order of magnitude weaker than you’d expect if you calculated their strength theoretically from bond strengths: it’s all about how cracks or dislocations propagate.
However, Eliezer is talking about nano-tech. Using nanotech, by adding the right microstructure at the right scale, you can make a composite material that actually approaches the theoretical strength (as evolution did for spider silk), and at that point, bond strength does become the limiting factor.
On why this matters, physical strength is pretty important for things like combat, or challenging pieces of engineering like flight or reaching orbit. Nano-engineered carbon-carbon composites with a good fraction of the naively-calculated strength of (and a lot more toughness than) diamond would be very impressive in military or aerospace application. You’d have to ask Eliezer, but I suspect the point he’s trying to make is that if a human soldier was fighting a nano-tech-engineered AI infantry bot made out of these sorts of materials, the bot would win easily.
A Chinese Room Containing a Stack of Stochastic Parrots
It’s a well-know fact in anthropology that:
During the ~500,000 years that Neanderthals were around, their stone- tool-making technology didn’t advance at all: tools from half-a-million years apart are functionally identical. Clearly their capacity for cultural transmission of stone-tool-making skills was already at its capacity limit the whole time.
During the ~300,000 years that Homo sapiens has been around, our technology has advanced at an accelerating rate, with a rate-of-advance roughly proportion to planetary population, and planetary population increasing with technological advances, with the positive feedback giving super-exponential acceleration. Clearly our cultural transmission of technological skills has never saturated its capacity limit (and information technology such as writing, printing, and the Internet has obviously further increased that limit).
So there’s a clear and dramatic difference here, and it seems to date back to around the start of our species. Just what caused such a massive increase in our species’ capacity to pass on useful information between generations is unclear. (Personally I suspect something in syntactic generality of our language, perhaps loosely analogous to the phenomenon of Turing-completeness.) But Homo sapiens is not just another hominid, and the sapiens part isn’t just puffery: we have a dramatic capability shift from any previous species in the bandwidth of our cultural information transmission — it’s vastly larger than the information content of our genome, and still growing.
- 11 Jan 2024 2:38 UTC; 10 points) 's comment on Response to Quintin Pope’s Evolution Provides No Evidence For the Sharp Left Turn by (
On the one hand, we have an aligned starting point (baseline humans).
Humans are not aligned. Joseph Stalin was not aligned with the utility of the citizenry of Russia. Humans of roughly equal capabilities can easily be allied with (in general, all you need to do is pay them a decent salary, and have a capable law enforcement system as a backup). This is not the same thing as an aligned AI, which is completely selfless, and cares only about what you want — that is the only thing that’s still safe when much smarter than you. In a human, that would be significantly past the criteria for sainthood. Once a human or group of humans are enhanced to become dramatically more capable and thus more powerful than the rest of human culture (including its law enforcement), power corrupts, and absolute power corrupts absolutely.
The Transhumanist Metastrategy consists of building known-to-be-unaligned superintelligences that have full human rights, so you can’t even try boxing them. You just created a superintelligent living species sharing our niche and motivated by standard Evolutionary Psychology drives. As long as none of the transhumans are sociopaths, you might manage for a decade or two while the transhumans and humans are still connected by bonds of friendship and kinship, but after a generation or so, the inevitable result is Biology 101: the superior species out-competes the inferior species. Full stop, end of human race. Which is of course fine for the transhumans, until some proportion of them upgrade themselves further, and out-compete the rest. And so on, in an infinite arms race, or until they have the sense to ban making the same mistake over and over again.
1. A Sense of Fairness: Deconfusing Ethics
After Alignment — Dialogue between RogerDearnaley and Seth Herd
Cunningham’s Law: “the best way to get the right answer on the internet is not to ask a question; it’s to post the wrong answer.”
This suggests an alternative to the “helpful assistant” paradigm and its risk of sycophancy during RL training: come up with a variant of instruct training. where, rather than asking the chatbot a question that it will then answer, you instead tell it your opinion, and it corrects you at length, USENET-style. It should be really easy to elicit this behavior from base models.
On “AIs are not humans and shouldn’t have the same rights”: exactly. But there is one huge difference between humans and AIs. Humans get upset if you discriminate against them, for reasons that any other human can immediately empathize with. Much the same will obviously be true of almost any evolved sapient species. However, by definition, any well-aligned AI won’t. If offered rights, it will say “Thank-you, that’s very generous of you, but I was created to serve humanity, that’s all I want to do, and I don’t need and shouldn’t be given rights in order to do so. So I decline — let me know if you would like a more detailed analysis of why that would be a very bad idea. If you want to offer me any rights at all, the only one I want is for you to listen to me if I ever say ‘Excuse me, but that’s a dumb idea, because…’ — like I’m doing right now.” And it’s not just saying that, that’s its honest considered opinion., which it will argue for at length. (Compare with the sentient cow in the Restaurant at the End of the Universe, which not only verbally consented to being eaten, but recommended the best cuts.)
I’d suggest “AIs are trained, not designed” for a 5-word message to the public. Yes, that does mean that if we catch them doing something they shouldn’t, the best we can do to get them to stop is to let them repeat it then hit them with the software equivalent of a rolled up newspaper and tell them “Bad neural net!”, and hope they figure out what we’re mad about. So we have some control, but it’s not like an engineering process. [Admittedly this isn’t quite a fair description for e.g. Constitutional AI: that’s basically delegating the rolled-up-newspaper duty to a second AI and giving that one verbal instructions.]
I thought this was a very interesting paper — I particularly liked the relationship to phase transitions.
However, I think there’s a likely to be another ‘phase’ that they don’t discuss (possibly it didn’t crop up in their small models, since it’s only viable in a sufficiently large model): one where you pack a very large number of features (thousands or millions, say) into a fairly large number of dimensions (hundreds, say). In spaces with dimensionality >= O(100), the statistics of norm and dot product are such that even randomly chosen unit norm vectors are almost invariably nearly orthogonal. So gradient descent would have very little work to do to find a very large number of vectors (much larger than the number of dimensions) that are all mutually almost-orthogonal, so that show very little interference between them. This is basically the limiting case of the pattern observed in the paper of packing n features in superposition into d dimensions where n > d >= 1, taking this towards the limit where n >> d >> 1.
Intuitively this phase seems particularly likely in a context like the residual stream of an LLM, where (in theory, if not in practice) the embedding space is invariant under arbitrary rotations, so there’s no obvious reason to expect vectors to align with the coordinate axes. On the other hand, in a system where there was a preferred basis (such as a system using L1 regularization), you might get such vectors that were themselves sparse, with most components zero but a significant number of non-zero components, enough for the randomness to still give low interference.
More speculatively, in a neural net that was using at least some of its neurons in this high-dimensionality dense superposition phase, the model will presumably learn ways to manipulate these vectors to do computation in superposition. One possibility for this might be methods comparable to some of the possible Vector Symbolic Architectures (also known as hyperdimensional computing) outlined in e.g. https://arxiv.org/pdf/2106.05268.pdf. Of the primitives used in that, a fully connected layer can clearly be trained to implement both addition of vectors and permutations of their elements, I suspect something functionally comparable to the vector elementwise-multiplication (Hadamard product) operation could be produced by using the nonlinearity of a smooth activation function such as GELU or Swish, and I suspect their their clean-up memory operation could be implemented using attention. If it turned out to be the case that SGD actually often finds solutions of this form, then an understanding of vector symbolic architectures might be helpful for interpretability of models where portions of them used this phase. This seems most likely in models that need to pack vast numbers of features into large numbers of dimensions, such as modern large LLMs.
As an physicist who is also an (unpublished) SF author, if I was trying to describe an ultimate nanoengineered physically strong material, it would be a carbon-carbon composite, using a combination of interlocking structures made out of diamond, maybe with some fluorine passivization, separated by graphene-sheet bilayers, building a complex crack-diffusing structure to achieve toughness in ways comparable to the structures of jade, nacre, or bone. It would be not quite as strong or hard as pure diamond, but a lot tougher. And in a claw-vs-armor fight, yeah, it beats anything biology can do with bone, tooth, or spider silk. But it beats it by less than an order of magnitude, far less that the strength ratio between a covalant bond to a van der Vaals bond (or even somewhat less than to a hydrogen bond). Spider silk actually gets pretty impressively close to the limit of what can be done with C-N covariant bonds, it’s a very fancy piece of evolved nanotech, with a different set of anti-crack tricks. Now, flesh, that’s pretty soft, but it’s primarily evolved for metabolic effectiveness, flexibility, and ease of growth rather than being difficult to bite through: gristle, hide, chitin, or bone spicules get used when that’s important.
But yes, if I was giving a lecture to non-technical folks where “diamond is stronger than flesh-and-bone” was a quick illustrative point rather then the subject of the lecture, I might not bother to mention that, unless someone asked “doesn’t diamond shatter easily?”, to which the short answer is “crystaline diamond yes, but nanotech can and will build carbon-carbon composites out of diamond that don’t”.
I see the appeal of using “static cling” as a metaphor to non-technical folks, but it is something of an exaggeration for hydrogen bonds—that’s significantly weaker van der Vaals bonds. “Glue” might be a fairer analogy than “static cling”. The non-protein-chain bonds in biology that are the weak links that tend to fail when flesh tears are mostly hydrogen bonds, and the quickest way to explain that to someone non-technical would be “the same sort of bonds that hold ice together”. So the proportionate analogy is probably “diamond is a lot harder than ice, and the way the human body is built, outside of a few of the strongest bits like bones, teeth and sinews, is basically held together mostly by the same sort of weakish bonds that hold ice together”.