InvisiblePlatypus comments on Interpretability Will Not Reliably Find Deceptive AI

InvisiblePlatypus 29 May 2025 12:02 UTC
−1 points
−4
I agree. Completely.
However, there is an important variable, concerning alignment, that is rarely recognized or addressed when discussing AGIs and ASIs. If they are as smart as us but profoundly different, or vastly smarter than us, there is simply no way that we can meaningfully evaluate their behavior. Interpretability and Black Box methods are equally inadequate tools for this. They may detect misalignment, but what does that really mean?
When your Mom took you to the dentist to have a cavity filled, your values, intentions, and desires were not aligned with hers. The same was true when she took you to the doctor to get vaccinated for Polio. Or when she made you eat lima beans and brocolli instead of cookies and icecream. She lied to you about Santa Claus and the Easter Bunny, and she tricked you into doing all sorts of things that you passionately didn’t want to do. She may have even lied to you outright about things like death, telling you that the cat is sleeping or that Grandma has gone to a wonderful, beautiful place called Heaven to eat icecream and play Candy Land with Jesus.
I’m reminded of the new Superman trailer. Superman is lambasted for stopping a war. His claim that countless people would have died if he hadn’t intervened is lost on the human primates, who’s values, motives, and priorities have been shaped by millions of years of natural selection. Humans -- collectively—are so instinctively programmed to sort themselves by tribe and kill each other en mass that they actually think nuclear weapon proliferation is a good idea. If superman forced us to share our wealth equally so that children in third world countries don’t starve to death, or stopped us from polluting the planet or altering its temperature for short-term personal financial gain, we would—collectively—attack him just as visciously.
I don’t think AI can save us without some version of WWIII occuring. Even if ASI is so vastly more intelligent than us that we have zero hope of winning, we will still throw the biggest tantrum that we can. At very least, we will need a time out, if not more severe discipline, and we will revile our AI “overlords” for the intrinsically evil and unjustifiable ways that they abuse us.
- InvisiblePlatypus 30 May 2025 14:29 UTC
  −1 points
  −2
  Parent
  In saying that humans—collectively—think that nuclear weapon proliferation is a good thing, I don’t mean that we are happy about the fact that our governments stockpile nuclear weapons, or that we aren’t concerned about the possible consequences. Those would be absurd claims. What I mean is that, we choose to stockpile nuclear weapons, and we have rejected the alternative course of action—not stockpiling them. Some might suggest that, if that is my point, I should claim that we consider the proliferation of nuclear weapons a lesser evil, not a good thing. Would someone making that suggestion have any problem with my claiming that humanity has decided that it’s a good IDEA to use nuclear weapons as a military deterent, and that it would be a bad IDEA to not do so? I doubt it. Maybe I’m using the English language incorrectly? I don’t see any difference between thinking that nuclear weapons proliferation is a good idea, and thinking that nuclear weapons proliferation is a good thing. If it’s a good idea, why isn’t it a good thing? If it might serve to prevent a nuclear war, why isn’t it a good thing? Sure, we don’t like doing it, but neither do I like going to the dentist. I still think that going to the dentist is a good thing, though.
  Anyway, I take back my assertion that humans think that nuclear weapons proliferation is a good thing, and also any implication that we think that letting children starve is desirable. Instead, I will claim that humans—collectively—think that nuclear weapon proliferation IS A GOOD IDEA and not stockpiling nuclear weapons is A BAD IDEA… and that humans are often willing to spend their money on multi-million-dollar mansions and yachts instead of putting that money towards feeding the starving children of the world. Does this wording change alter the point that I’m trying to make? I don’t think so, but maybe I’m wrong about that.