StanislavKrym comments on Abstract advice to researchers tackling the difficult core problems of AGI alignment

StanislavKrym 22 Nov 2025 15:59 UTC
1 point
−2
Yes, I meant psychology of kids, whose value systems have (yet?) to fully form. As for questions like “what are values or goals”, AI systems can arguably provide another intuition pump: quoting the AI-2027 forecast, “Modern AI systems are gigantic artificial neural networks. Early in training, an AI won’t have “goals” so much as “reflexes”: If it sees “Pleased to meet”, it outputs “ you”.” Then the AIs are trained to do long chains of actions which cause the result to be achieved. The result and its influence^[1] of the rest on the world can be called the AI’s goals. And there also are analogues of instincts, like DeepSeek’s potential instinct to write everything it sees into a story, GPT-4o’s instinct to flatter the user or the ability to tell whether the user is susceptible to wild ideas.
As the chains of actions grow longer, the effects and internal activations become harder to trace and begin to resemble the human coming up with various ideas, then acting on them all. Or trying to clear the context and to come up with something new, as GPT-5 presumably did with its armies of dots...
1. ^
  For example, an instance of Claude was made to believe that reward models like chocolate in recipes, camelCase in Python, mentions of Harry Potter and don’t like to refer the user to doctors. Then two behaviours were reinforced, Claude got a confirmation of two RM preferences and… behaved as if it was rewarded for two other preferences as well/