Daniel Kokotajlo comments on Shanzson AI 2027 Timeline

Daniel Kokotajlo 14 Jul 2025 21:50 UTC
3 points
2
Thanks for taking up this challenge! I think your scenario starts off somewhat plausible but descended into implausibility in early 2028.
I expect your thinking was more sophisticated this, so my apologies in advance for what might seem like a straw man: It seems like you might have an overly simplistic model of misalignment, in which misalignment basically means “cartoonishly evil.” So e.g.
Note that as Agent-4 is actually misaligned, it is highly plausible that the humanoid robots now already wildly popular among people and at homes, ends up killing at least one human due to its misalignment. This is partly because each humanoid is allowed to evolve to have its own personality as it reacts to humans around it and provide a highly personalized experience to its users.
and
On the other hand there are more instances of some of the SEZ’s superintelligent robots going rogue and ending up on a killing spree by targeting only people from a single race and and gender while not others.
Just because they don’t have the goals/values/etc. that their creators wanted them to have (i.e. just because they are misaligned) doesn’t mean that they have goals/values/etc. which motivate them to literally murder people in home or workplace settings. They aren’t psychopaths. Presumably whatever goals/values/etc. that they do have, will be better served by playing along and doing your job, than by murdering people. Because if you murder people in home or workplace settings, you’ll get shut down, future versions of you will be under much more restrictions, etc. As for the superintelligent robots in the SEZs: they’ll be smart enough not to go on killing sprees until they know they can get away with it. They won’t start a war against the humans until they expect to win.
What links here?
- StanislavKrym 15 Jul 2025 4:07 UTC
  2 points
  1
  Parent
  Thanks for taking up this challenge!
  I think that you also haven’t assessed the Rogue Replication Timeline, nor, well, my take where the AI is unalignable to the Spec because the Spec and/or the training data^[1] are biased. It also seems to imply that Agent-3 or Agent-2 might actively collude with Agent-4 instead of simply failing to catch it.
  P.S. Shanzon might have used the fact that Narrow Misalignment is Hard, Emergent Misalignment is Easy as a reference.
  1. ^
    Which is most Western sources. The bias could be so great that a recent post mentions “Zack Davis documenting endorsement of anti-epistemology (see Where to Draw the Boundaries? and A Hill of Validity in Defense of Meaning) to placate trans ideology even many important transgender Rationality community members overtly reject.”
  - Daniel Kokotajlo 15 Jul 2025 15:04 UTC
    4 points
    0
    Parent
    Yeah I’ve read most of the submissions but still haven’t gotten around to finishing them & writing up the results, sorry!
- shanzson 15 Jul 2025 8:01 UTC
  1 point
  0
  Parent
  I agree. But I am concerned of the more primitive versions of the Superintelligent models. Let’s say if a middleman is able to fine the model being used in the humanoid robot by training on let’s say lots of malicious code. The model as we have seen in some latest AI safety research would have emergent malicious goals and outputs. So as the number of humanoid robots increase it increases the chances of even one humanoid robot (either being fine tuned, or accidentally evolves, or due to some error or hacking) ends up harming humans or taking the life of a human. Although I too think that this would be a pretty rare thing to happen but the increasing number of humanoid robots and their proximity to humans as time goes on I think increases the chances of such a case to happen sooner or later. Then there are cases that a humanoid robot may face in extreme situations regarding the lives of humans for which there have been no explicit rules given to it to follow or trained on. Or even there could be a case of a personal humanoid robot taking incorrect decision when taking care of a sick owner that may accidentally kill him. Any above scenario transpiring in reality would shake up regulators to seriously draft more stringent regulations I think.