LawrenceC comments on Dreams of AI alignment: The danger of suggestive names

LawrenceC 13 Feb 2024 6:35 UTC
2 points
0
I broadly agree that a lot of discussion about AI x-risk is confused due to the use of suggestive terms. Of the ones you’ve listed, I would nominate “optimizer”, “mesa optimization”, “(LLMs as) simulators”, “(LLMs as) agents”, and “utility” as probably the most problematic. I would also add “deception/deceptive alignment”, “subagent”, “shard”, “myopic”, and “goal”. (It’s not a coincidence that so many of these terms seem to be related to notions of agency or subcomponents of agents; this seems to be the main place where sloppy reasoning can slide in.)
I also agree that I’ve encountered a lot of people who confidently predict Doom on the basis of subtle word games.
However, I also agree with Ryan’s comment that these confusions seem much less common when we get to actual senior AIS researchers or people who’ve worked significantly with real models. (My guess is that Alex would disagree with me on this.) Most conversations I’ve been in that used these confused terms tended to involve MATS fellows or other very junior people (I don’t interact with other more junior people much, unfortunately, so I’m not sure.) I’ve also had several conversations with people who seemed relieved at how reasonable and not confused the relevant researchers have been (e.g. with Alexander Gietelink-Oldenziel).
I suspect that a lot of the confusions stem from the way that the majority of recruitment/community building is conducted—namely, by very junior people recruiting even more junior people (e.g. via student groups). Not only is there only a very limited amount of communication bandwidth available to communicate with potential new recruits (and therefore encouraging more arguments by analogy or via suggestive words), the people doing the communication are also likely to use a lot of (in large part because they’re very junior, and likely not technical researchers).^[1] There’s also historical reasons why this is the case: a lot of early EA/AIS people were philosophers, and so presented detailed philosophical arguments (often routing through longtermism) about specific AI doom scenarios that in turn were suffered lossy compression during communication, as opposed to more robust general arguments (e.g. Ryan Greenblatt’s example of “holy shit AI (and maybe the singularity), that might be a really big deal”).^[2]
Similarly, on LessWrong, I suspect that the majority of commenters are not people who have deeply engaged with a lot of the academic ML literature or have spent significant time doing AIS or even technical ML work.
And I’d also point a finger at lot of the communication from MIRI in particular as the cause for these confusions, e.g. the “sharp left-turn” concept seems to be primarily communicated via metaphor and cryptic sayings, while their communications about Reward Learning and Human Values seems in retrospect to have at least been misleading if not fundamentally confused. I suspect that the relevant people involved have much better models, but I think this did not come through in their communication.
I’m not super sure what to do about it; the problem of suggestive names (or in general, of smuggling connotations into technical work) is not a unique one to this community, nor is it one that can be fixed with reading a single article or two (as your post emphasizes). I’d even argue this community does better than a large fraction of academics (even ML academics).
John mentioned using specific, concrete examples as a way to check your concepts. If we’re quoting old rationalist foundation texts, then the relevant example from “Surely You’re Joking, Mr. Feynman” is relevant:
“I had a scheme, which I still use today when somebody is explaining something that I’m trying to understand: I keep making up examples. For instance, the mathematicians would come in with a terrific theorem, and they’re all excited. As they’re telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball) – disjoint (two balls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on. Finally they state the theorem, which is some dumb thing about the ball which isn’t true for my hairy green ball thing, so I say, ‘False!’”
Unfortunately, in my experience, general instructions of the form “create concrete examples when listening to a chain of reasoning involving suggestive terms” do not seem to work very well, even if examples of doing so are provided, so I’m not sure there’s a scalable solution here.
My preferred approach is to give the reader concrete examples to chew on as early as possible, but this runs into the failure mode of contingent facts about the example being taken as a general point (or even worse, the failure mode where the reader assumes that the concrete case is the general point being made). I’d consider mathematical equations (even if they are only toy examples) to be helpful as well, assuming you strip away the suggestive terms and focus only on the syntax/semantics. But I find that I also have a lot of difficulty getting other people to create examples I’d actually consider concrete. Frustratingly, many “concrete” examples I see smuggle in even more suggestive terms or connotations, and sometimes even fail to capture any of the semantics of the original idea.
So in the end, maybe I have nothing better than to repeat Alex’s advice at the end of the post:
All to say: Do not trust this community’s concepts and memes, if you have the time. Do not trust me, if you have the time. Verify.
At the end of the day, while saying “just be better” does not serve as actionable advice, there might not be an easier answer.
1. ^
  To be clear, I think that many student organizers and community builders in general do excellent work that is often incredibly underappreciated. I’m making a specific claim about the immediate causal reasons for why this is happening, and not assigning fault. I don’t see an easy way for community builders to do better, short of abandoning specialization and requiring everyone to be a generalist who also does techncical AIS work.
2. ^
  That being said, I think that it’s worth trying to make detailed arguments concretizing general concerns, in large part to make sure that the case for AI x-risk doesn’t “come down to a set of subtle word games”. (e.g. I like Ajeya’s doom story. ) After all, it’s worth concretizing a general concern, and making sure that any concrete instantiations of the concern are possible. I just think that detailed arguments (where the details matter) often get compressed in ways that end up depending on suggestive names, especially in cases with limited communication bandwith.