mesaoptimizer comments on Dreams of AI alignment: The danger of suggestive names

mesaoptimizer 11 Feb 2024 17:08 UTC
13 points
4
This is a pretty good essay, and I’m glad you wrote it. I’ve been thinking similar thoughts recently, and have been attempting to put them into words. I have found myself somewhat more optimistic and uncertain about my models of alignment due to these realizations.

Anyway, on to my disagreements.

It’s hard when you’ve[2] read Dreams of AI Design and utterly failed to avoid same mistakes yourself.

I don’t think that “Dreams of AI Design” was an adequate essay to get people to understand this. These distinctions are subtle, and as you might tell, not an epistemological skill that comes native to us. “Dreams of AI Design” is about confusing the symbol with the substance -- '5 with 5, in Lisp terms, or the variable name five with the value of 5 (in more general programming language terms). It is about ensuring that all the symbols you use to think with actually are mapping onto some substance. It is not about the more subtle art of noticing that you are incorrectly equivocating between a pre-theoretic concept such as “optimization pressure” and the actual process of gradient updates. I suspect that Eliezer may have made at least one such mistake that may have made him significantly more pessimistic about our chances of survival. I know I’ve made this mistake dozens of times. I mean, my username is “mesaoptimizer”, and I don’t endorse that term or concept anymore as a way of thinking about the relevant parts of the alignment problem.

It’s hard when your friends are using the terms, and you don’t want to be a blowhard about it and derail the conversation by explaining your new term.

I’ve started to learn to be less neurotic about ensuring that people’s vaguely defined terms actually map onto something concrete, mainly because I have started to value the fact that these vaguely defined terms, if not incorrectly equivocated, hold valuable information that we might otherwise lose. Perhaps you might find this helpful.

When I try to point out such (perceived) mistakes, I feel a lot of pushback, and somehow it feels combative.

I empathize with those pushing back, because to a certain extent it seems like what you are stating seems obvious to someone who has learned to translate these terms into the more concrete locally relevant formulations ad-hoc, and given such an assumption, it seems like you are making a fuss about something that doesn’t really matter and in fact even reaching for examples to prove your point. On the other hand, I expect that ad-hoc adjustment to such terms is insufficient to actually do productive alignment research—I believe that the epistemological skill you are trying to point at is extremely important for people working in this domain.

I’m uncertain about how confused senior alignment researchers are when it comes to these words and concepts. It is likely that some may have cached some mistaken equivocations and are therefore too pessimistic and fail to see certain alignment approaches panning out, or too optimistic and think that we have a non-trivial probability of getting our hands on a science accelerator. And deference causes a cascade of everyone (by inference or by explicit communication) also adopting these incorrect equivocations.