WillPetillo comments on What if Alignment is Not Enough?

WillPetillo 27 Jan 2025 23:06 UTC
2 points
1
I actually don’t think the disagreement here is one of definitions. Looking up Webster’s definition of control, the most relevant meaning is: “a device or mechanism used to regulate or guide the operation of a machine, apparatus, or system.” This seems...fine? Maybe we might differ on some nuances if we really drove down into the details, but I think the more significant difference here is the relevant context.

Absent some minor quibbles, I’d be willing to concede that an AI-powered HelperBot could control the placement of a chair, within reasonable bounds of precision, with a reasonably low failure rate. I’m not particularly worried about it, say, slamming the chair down too hard, causing a splinter to fly into its circuitry and transform it into MurderBot. Nor am I worried about the chair placement setting off some weird “butterfly effect” that somehow has the same result. I’m going to go out on a limb and just say that chair placement seems like a pretty safe activity, at least when considered in isolation.

The reason I used the analogy “I may well be able to learn the thing if I am smart enough, but I won’t be able to control for the person I will become afterwards” is because that is an example of the kind of reference class of context that SNC is concerned with. Another is: “what is expected shift to the global equilibrium if I construct this new invention X to solve problem Y?” In your chair analogy, this would be like the process of learning to place the chair (rewiring some aspect of its thinking process), or inventing an upgraded chair and releasing this novel product into the economy (changing its environmental context). This is still a somewhat silly toy example, but hopefully you see the distinction between these types of processes vs. the relatively straightforward matter of placing a physical object. It isn’t so much about straightforward mistakes (though those can be relevant), as it is about introducing changes to the environment that shift its point of equilibrium. Remember, AGI is a nontrivial thing that affects the world in nontrivial ways, so these ripple effects (including feedback loops that affect the AGI itself) need to be accounted for, even if that isn’t a class of problem that today’s engineers often bother with because it Isn’t Their Job.

Re human-caused doom, I should clarify that the validity of SNC does not depend on humanity not self destructing without AI. Granted, if people kill themselves off before AI gets the chance, SNC becomes irrelevant. Similarly, if the alignment problem as it is commonly understood by Yudkowsky et. al. is not solved pre-AGI and a rogue AI turns the world into paperclips or whatever, that would not make SNC invalid, only irrelevant. By analogy, global warming isn’t going to prevent the Sun from exploding, even though the former could very well affect how much people care about the latter.

Your second point about the relative strengths of the destructive forces is a relevant crux. Yes, values are an attractor force. Yes, an ASI could come up with some impressive security systems that would probably thwart human hackers. The core idea that I want readers to take from this sequence is recognition of the reference class of challenges that such a security system is up against. If you can see that, then questions of precisely how powerful various attractor states are and how these relative power levels scale with complexity can be investigated rigorously rather than assumed away.
- Dakara 28 Jan 2025 13:09 UTC
  1 point
  0
  Parent
  Re human-caused doom, I should clarify that the validity of SNC does not depend on humanity not self destructing without AI. Granted, if people kill themselves off before AI gets the chance, SNC becomes irrelevant.
  Yup, that’s a good point, I edited my original comment to reflect it.
  Your second point about the relative strengths of the destructive forces is a relevant crux. Yes, values are an attractor force. Yes, an ASI could come up with some impressive security systems that would probably thwart human hackers. The core idea that I want readers to take from this sequence is recognition of the reference class of challenges that such a security system is up against. If you can see that, then questions of precisely how powerful various attractor states are and how these relative power levels scale with complexity can be investigated rigorously rather than assumed away.
  With that being said we have come to a point of agreement. It was a pleasure to have this discussion with you. It made me think of many fascinating things that I wouldn’t have thought about otherwise. Thank you!