Daniel Kokotajlo comments on Shanzson AI 2027 Timeline

Daniel Kokotajlo 14 Jul 2025 21:41 UTC
3 points
0
Note that as Agent-4 is actually misaligned, it is highly plausible that the humanoid robots now already wildly popular among people and at homes, ends up killing at least one human due to its misalignment.
I doubt this. That’s like saying “Since Agent-4 is misaligned, it’s highly plausible that the self-driving car projects it’s helping with will end up killing pedestrians.” Why would it be in Agent-4′s interest to kill pedestrians? Why would it sabotage the self-driving car software that way? Wouldn’t that make it look bad and bring suspicion, whilst serving no useful purpose?
I think Agent-4 would be genuinely trying to make safe self-driving cars and safe humanoid robots, insofar as it was involved in those projects.
- shanzson 15 Jul 2025 9:57 UTC
  1 point
  0
  Parent
  I definitely see your point. I think I am taking misalignment’s definition in a broader sense including emergent misalignment, which would blatantly get triggered in various random scenarios.
  “Since Agent-4 is misaligned, it’s highly plausible that the self-driving car projects it’s helping with will end up killing pedestrians.”—I think the closest line to compare it with would be—“Since Agent-4 is misaligned, it’s highly plausible that the autonomous cars it drives ends up killing pedestrians.” Considering the trigger point being anything that could trigger its misalignment including meddling by external actors.
  In case of Humanoid robots, I am assuming this case, with some of its engineer owners (or external hackers) meddling with it to the point that it ends up killing a human with the blame being put on the company that built it.
  Additionally, Agent-4 does have Superintelligence by that time, but I still think there are chances where even superintelligent entities could also make mistakes in reality or when interacting with realistic environments especially the more nascent ones (or who knows there are more types of misalignment that gets opened up as we reach Superintelligence in AI) even if it would be aware enough to hide its misalignment. Superintelligence may make them more intelligent than humans by many folds but not mistake-proof entirely.