Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 28 Nov 2025 18:50 UTC
2 points
0
Thanks for the reply! I’m afraid I haven’t read Crystal Society, but on that recommendation I will.
I still think it would be great if you wrote more content of the sort I’m asking for. Put it this way: I imagine a bunch of readers will bounce off the ending, having a reaction “and then things stopped making sense, there was a power struggle over who the AI should be most loyal to and as a result the AI just sorta snapped and took over the world and was very bad for no reason. I feel like that’s when it went from hard sci-fi to soft sci-fi, or worse, basically just a plot hole.”

I, being more charitable & knowledgeable, instead thought of it as a puzzle to try to figure out. Why did Yunna behave the way she did? What changes exactly were caused by Chen Bai’s hasty commands? Etc. But I think probably more of your readers will react like the above than like I did. Worse, their reaction may even be correct as far as I can tell in that I still haven’t decided what I think of the plausibility of the ending. The spoilered paragraphs above you gave are great & helpful; couldn’t you at least include them as an appendix or something? Or better yet, an appendix that’s several pages long. Or better yet, just extend the epilogue chapter to be like 5x longer and contain more of these important explanations of what just happened...

If you don’t want to modify the already-published book, you could make it a blog post or something. Or a sequel!

Object level: It feels like in the conflict between jailbroken-Yunna (jYunna) and regular Yunna, a couple outcomes were possible: jYunna could win entirely, Yunna could win entirely, or (what actually happened) they could both lose, or they could both win. Seems kinda just fiat /unexplained that they both lost instead of one of the other outcomes happening. (They both lost in that, the resulting system seems to have been a noncorrigible agent that goes on to take over the world in a way that neither Chen Bai nor Li Feng would have wanted, and predictably so, right? So wouldn’t this have been a bad outcome from the perspective of jYunna and Yunna both? And couldn’t they have predicted it? So then why did it happen? Yes, things were rushed and mistakes were made. Which mistakes exactly! How is the corrigibility implemented anyway? Is it a text or neuralese file somewhere saying “Li Feng”? Is it a bunch of training environments that reinforce corrigible-to-Li-Feng behavior, themselves generated by older versions of Yunna given the prompt “create a training environment to reinforce corrigible-to-Li-Feng behavior, and an automated grader to go with it?” Is it the concept of Li Feng found using interpretability tools in Yunna’s mind, stitched together “by hand” to the concept of “corrigibility” and “final goal?”

I think this would be helpful to me because I want more people—including myself—to think more deeply and gears-level-y about how a mildly superhuman AGI mind (that has been trained to be obedient, or corrigible, or whatever) would work on the inside and evolve over the course of an intelligence explosion. I feel like there just hasn’t been that much thinking on the subject, and it’s a complicated and difficult and confusing and unprecented subject. By contrast, stuff like “how might it feel to be working at a secret government AGI project” and “what might the early stages of AGI look like, what with politics and geopolitical conflict and so forth” is important by less complicated, confusing, etc. and more handled already e.g. by Red Heart, Situational Awareness, etc.
- Max Harms 29 Nov 2025 14:49 UTC
  2 points
  0
  Parent
  You’re right that it’s a puzzle. Putting puzzles in my novels is, I guess, a bit of an authorial tic. There’s a similar sort of puzzle in Crystal, and a bunch of readers didn’t like it (basically, I claim, because it was too hard; Carl Shulman is, afaik, the only one who thought it was obvious).
  I think the amount of detail you’re hoping for would only really work as an additional piece, and my guess is that it would only actually be interesting to nerds like us who are already swimming in alignment thoughts. But maybe there’s still value in having a technical companion piece to Red Heart! My sense from most other alignment researchers who read the book is that they wanted me to more explicitly endorse their worldview at the end, not that they wanted to read an appendix. But your interest there is an update. Maybe I’ll run a poll.
  The short story about why both Yunnas failed is because corrigibility is a tricky property to get perfectly right, and in a rushed conflict it is predictable that there would be errors. Errors around who the principal is, in particular, are difficult to correct, and that’s where the conflict was.
  - Max Harms 29 Nov 2025 14:59 UTC
    4 points
    0
    Parent
    https://x.com/raelifin/status/1994783061888962885?s=20
    - Max Harms 30 Nov 2025 16:51 UTC
      4 points
      0
      Parent
      Interesting. I didn’t expect a Red Heart follow-up to be so popular. Some part of me thinks that there’s a small-sample size thing going on, but it’s still enough counter-evidence that I’ll put in some time and effort thinking about writing a technical companion to the book. Thanks for the nudge!