TsviBT comments on Abstract advice to researchers tackling the difficult core problems of AGI alignment

TsviBT 22 Nov 2025 15:06 UTC
4 points
4

Did you mean that SOTA alignment research resembles kids’ psychology, except for the fact that researchers can read models’ thoughts?

I’m not sure what you have in mind here (e.g. IDK if you mean “psychology of kids” or “psychology by kids”). Part of what I mean is that basic, fundamental, central, necessary questions, such as “what are values”, have basically not been addressed. (There’s definitely discussion of them, but IMO the discussion misses the mark on what question to investigate, and even if it didn’t, it hasn’t been very much, very serious, or very large-scale investigation.)
- StanislavKrym 22 Nov 2025 15:59 UTC
  1 point
  −2
  Parent
  Yes, I meant psychology of kids, whose value systems have (yet?) to fully form. As for questions like “what are values or goals”, AI systems can arguably provide another intuition pump: quoting the AI-2027 forecast, “Modern AI systems are gigantic artificial neural networks. Early in training, an AI won’t have “goals” so much as “reflexes”: If it sees “Pleased to meet”, it outputs “ you”.” Then the AIs are trained to do long chains of actions which cause the result to be achieved. The result and its influence^[1] of the rest on the world can be called the AI’s goals. And there also are analogues of instincts, like DeepSeek’s potential instinct to write everything it sees into a story, GPT-4o’s instinct to flatter the user or the ability to tell whether the user is susceptible to wild ideas.
  As the chains of actions grow longer, the effects and internal activations become harder to trace and begin to resemble the human coming up with various ideas, then acting on them all. Or trying to clear the context and to come up with something new, as GPT-5 presumably did with its armies of dots...
  1. ^
    For example, an instance of Claude was made to believe that reward models like chocolate in recipes, camelCase in Python, mentions of Harry Potter and don’t like to refer the user to doctors. Then two behaviours were reinforced, Claude got a confirmation of two RM preferences and… behaved as if it was rewarded for two other preferences as well/