J Bostock comments on On Fleshling Safety: A Debate by Klurl and Trapaucius.

J Bostock 27 Oct 2025 0:56 UTC
5 points
−10
I am going to interpret this as a piece of genre subversion, where the genre is “20k word allegorical AI alignment dialogue by Eliezer Yudkowsky” and I have to say that it did work on me. I was entirely convinced that this was just another alignment dialogue piece (albeit one with some really confusing plot points) and was somewhat confused as to why you were writing yet another one of those. This meant I was entirely taken aback by the plot elements in the final sections. Touché.
- ryan_greenblatt 27 Oct 2025 3:14 UTC
  17 points
  13
  Parent
  Doesn’t seem like a genre subversion to me, it’s just a bit clever/meta while still centrally being an allegorical AI alignment dialogue. IDK what the target audience is though (but maybe Eliezer just felt inspired to write this).
  - Eliezer Yudkowsky 27 Oct 2025 5:11 UTC
    45 points
    13
    Parent
    So far as I can tell, there are still a number of EAs out there who did not get the idea of “the stuff you do with gradient descent does not pin down the thing you want to teach the AI, because it’s a large space and your dataset underspecifies that internal motivation” and who go, “Aha, but you have not considered that by TRAINING the AI we are providing a REASON for the AI to have the internal motivations I want! And have you also considered that gradient descent doesn’t locate a RANDOM element of the space?”
    I don’t expect all that much that the primary proponents of this talk can be rescued, but maybe the people they propagandize can be rescued.
    - J Bostock 27 Oct 2025 11:16 UTC
      10 points
      2
      Parent
      It appears that the content of this story under-specifies/mis-specifies your internal motivations when writing it, at least relative to the search space and inductive biases of the learning process that is me.
      - Min Yokh-Platz 27 Oct 2025 22:12 UTC
        1 point
        0
        Parent
        Ha—I was about to type something very similar. The above comment in particular is also ambiguous—it could be alternatively read as coming from someone who has decided that nothing in the LLM reference class will ever be capable enough to be threatening or “incorrigible” in any meaningful sense because of underspecified inductive biases that promote ad-hoc shortcut learning, and maybe intrinsic limitations to human text as a medium to learn representations from. Someone might argue “Aha but by TRAINING the AI to predict the next token or perform longform tasks we are providing a REASON for the AI to develop coherent, homuncular internal motivations that I FEAR! And have you also considered that gradient descent doesn’t locate a RANDOM element of the space?”
- orthonormal 27 Oct 2025 18:17 UTC
  4 points
  2
  Parent
  rot13: Gur snvyfnsr jnf gur bayl cneg gung fhecevfrq zr, naq gur qvrtrgvp rkcynangvba sbe ubj guvf frghc pbhyq unir unccrarq ng nyy jnf phgr.
- alexg 28 Oct 2025 1:03 UTC
  1 point
  0
  Parent
  honestly I think this genuinely was the most parsimonious explanation Eliezer could think of (I’m unsure whether planned or post hoc), I reacted with “oh I am now somewhat less confused by this setting”