Eliezer Yudkowsky comments on On Fleshling Safety: A Debate by Klurl and Trapaucius.

Eliezer Yudkowsky 27 Oct 2025 5:11 UTC
45 points
13
So far as I can tell, there are still a number of EAs out there who did not get the idea of “the stuff you do with gradient descent does not pin down the thing you want to teach the AI, because it’s a large space and your dataset underspecifies that internal motivation” and who go, “Aha, but you have not considered that by TRAINING the AI we are providing a REASON for the AI to have the internal motivations I want! And have you also considered that gradient descent doesn’t locate a RANDOM element of the space?”
I don’t expect all that much that the primary proponents of this talk can be rescued, but maybe the people they propagandize can be rescued.
- J Bostock 27 Oct 2025 11:16 UTC
  10 points
  2
  Parent
  It appears that the content of this story under-specifies/mis-specifies your internal motivations when writing it, at least relative to the search space and inductive biases of the learning process that is me.
  - Min Yokh-Platz 27 Oct 2025 22:12 UTC
    1 point
    0
    Parent
    Ha—I was about to type something very similar. The above comment in particular is also ambiguous—it could be alternatively read as coming from someone who has decided that nothing in the LLM reference class will ever be capable enough to be threatening or “incorrigible” in any meaningful sense because of underspecified inductive biases that promote ad-hoc shortcut learning, and maybe intrinsic limitations to human text as a medium to learn representations from. Someone might argue “Aha but by TRAINING the AI to predict the next token or perform longform tasks we are providing a REASON for the AI to develop coherent, homuncular internal motivations that I FEAR! And have you also considered that gradient descent doesn’t locate a RANDOM element of the space?”