If you’re wondering “what alignment research is like”, there’s no such thing. Most people don’t do real alignment research, and the people that do have pretty varied ways of working. You’ll be forging your own path.
Then what is proper alignment research? SOTA alignment research includes stuff like showing that training the models on a hack-filled environment misaligns themunless hacking is framed as a good act (e.g. models should hack reward since it helps their hosts understand that the environment is hacky), the fact that emergent misalignment is causable by LOTS of things, including aesthetic preferences (and, funnily enough, scatological answers), the fact that the CoT is obfuscatable by training on the output unless, of course, one does Kokotajlo’s proposal, agents which audit the models for alignment, etc.
Did you mean that SOTA alignment research resembles kids’ psychology, except for the fact that researchers can read models’ thoughts? If this is true, then important problems alignment research failed to solve would be like adults’ psychology or more general concepts like the AIs which become the new proletariat. Or problems which @Wei Daitried to legibilize and some of which I don’t actually endorse, like his case for superastronomical waste being possible.
Basically I’m saying that SOTA alignment research barely makes sense to call alignment research. This may sound harsh, so just to clarify, this isn’t a knock on it. I don’t follow it much, but some of it seems like good research; it’s definitely helpful to make AI risks more legible to other people, and some of this research helps with that; and arguably, on the margin, really good legibilization research in general is significantly more important than actual alignment research because it helps with slowing down capabilities research which is more likely to work and more likely to help soon.
Just from the perspective of “is this research building towards understanding how to align an AGI”, here’s a tacky analogy that maybe communicates a bit, where the task is “get to the moon, starting from Ancient Greece tech”:
SOTA would be mostly stuff like using the naked eye to make nicer sketches of the moon
some interpretability might be like developing stone-grinding methods; eventually you might use it to develop telescopes, and look at the moon more closely, which is still irrelevant; but eventually eventually you could develop other lenses, like for cameras, which is still not a solution at all but would help
stuff like cheese vectors is like going around and mixing random substances together; it won’t work, but it’s kinda hitting at the right sector (because you could hit upon reactive substances, eventually leading you to find explosive substances, eventually to find rocket propellant)
most “agent foundations” would be like mathematicians screwing around. It’s interesting, it’s a healthy sort of research to have in your portfolio and in your community, and in fact it does build up in the long term towards calculus, which will be very relevant for calculating all sorts of things such as propellant amounts and trajectories and lots of things in manufacturing and etc.; but also it takes a really really long time.
Then there’s some secret thing, unknown to me and to anyone else AFAIK, that somehow makes you figure out alignment within 50 years ( / figure out rockets getting to the moon in 100 years, starting from Ancient Greece tech).
Did you mean that SOTA alignment research resembles kids’ psychology, except for the fact that researchers can read models’ thoughts?
I’m not sure what you have in mind here (e.g. IDK if you mean “psychology of kids” or “psychology by kids”). Part of what I mean is that basic, fundamental, central, necessary questions, such as “what are values”, have basically not been addressed. (There’s definitely discussion of them, but IMO the discussion misses the mark on what question to investigate, and even if it didn’t, it hasn’t been very much, very serious, or very large-scale investigation.)
Yes, I meant psychology of kids, whose value systems have (yet?) to fully form. As for questions like “what are values or goals”, AI systems can arguably provide another intuition pump: quoting the AI-2027 forecast, “Modern AI systems are gigantic artificial neural networks. Early in training, an AI won’t have “goals” so much as “reflexes”: If it sees “Pleased to meet”, it outputs “ you”.” Then the AIs are trained to do long chains of actions which cause the result to be achieved. The result and its influence[1] of the rest on the world can be called the AI’s goals. And there also are analogues of instincts, like DeepSeek’s potential instinct to write everything it sees into a story, GPT-4o’s instinct to flatter the user or the ability to tell whether the user is susceptible to wild ideas.
As the chains of actions grow longer, the effects and internal activations become harder to trace and begin to resemble the human coming up with various ideas, then acting on them all. Or trying to clear the context and to come up with something new, as GPT-5 presumably did with its armies of dots...
For example, an instance of Claude was made to believe that reward models like chocolate in recipes, camelCase in Python, mentions of Harry Potter and don’t like to refer the user to doctors. Then two behaviours were reinforced, Claude got a confirmation of two RM preferences and… behaved as if it was rewarded for two other preferences as well/
I am not sure that these are examples of the kind of alignment research TsviBT meant, as the post concerns AGI.
SOTA alignment researchers at Anthropic can: - prove the existence of phenomena through explicitly demonstrating them. - make empirical observations and proofs about the behaviour of contemporary models. - offer conjectures about the behaviour of future models.
Nobody at Anthropic can offer (to my knowledge) a substantial scientific theory that would give reason to be extremely confident that any technique they’ve found will extend to models in the future. I am not sure if they have ever explicitly claimed that they can.
I doubt that Anthropic actually promised to be able to do so. What they promised in their scaling policy was to write down ASL-4-level security measures that they would do by the time they decide to deploy[1] an ASL-4-level model: “Our ASL-4 measures aren’t yet written (our commitment[2] is to write them before we reach ASL-3), but may require methods of assurance that are unsolved research problems today, such as using interpretability methods to demonstrate mechanistically that a model is unlikely to engage in certain catastrophic behaviors.” IIRC someone claimed that if Anthropic found alignment of capable models to be impossible, then Anthropic would shut itself down.
As for alignment research @TsviBT likely meant, I tried to cover this in the very next paragraph. There are also disagreements among people who work on high-level problems. And the fact that we have yet to study anything resembling General Intelligences aside from humans and SOTA LLMs.
What I don’t understand is whether the deployment is external or internal. This is crucial because defining deployment to be external would allow them to use Agent-4 internally, have Agent-4 create Agent-5 and present a FALSE case for Agent-5 being aligned.
Then what is proper alignment research? SOTA alignment research includes stuff like showing that training the models on a hack-filled environment misaligns them unless hacking is framed as a good act (e.g. models should hack reward since it helps their hosts understand that the environment is hacky), the fact that emergent misalignment is causable by LOTS of things, including aesthetic preferences (and, funnily enough, scatological answers), the fact that the CoT is obfuscatable by training on the output unless, of course, one does Kokotajlo’s proposal, agents which audit the models for alignment, etc.
Did you mean that SOTA alignment research resembles kids’ psychology, except for the fact that researchers can read models’ thoughts? If this is true, then important problems alignment research failed to solve would be like adults’ psychology or more general concepts like the AIs which become the new proletariat. Or problems which @Wei Dai tried to legibilize and some of which I don’t actually endorse, like his case for superastronomical waste being possible.
Basically I’m saying that SOTA alignment research barely makes sense to call alignment research. This may sound harsh, so just to clarify, this isn’t a knock on it. I don’t follow it much, but some of it seems like good research; it’s definitely helpful to make AI risks more legible to other people, and some of this research helps with that; and arguably, on the margin, really good legibilization research in general is significantly more important than actual alignment research because it helps with slowing down capabilities research which is more likely to work and more likely to help soon.
Just from the perspective of “is this research building towards understanding how to align an AGI”, here’s a tacky analogy that maybe communicates a bit, where the task is “get to the moon, starting from Ancient Greece tech”:
SOTA would be mostly stuff like using the naked eye to make nicer sketches of the moon
some interpretability might be like developing stone-grinding methods; eventually you might use it to develop telescopes, and look at the moon more closely, which is still irrelevant; but eventually eventually you could develop other lenses, like for cameras, which is still not a solution at all but would help
stuff like cheese vectors is like going around and mixing random substances together; it won’t work, but it’s kinda hitting at the right sector (because you could hit upon reactive substances, eventually leading you to find explosive substances, eventually to find rocket propellant)
most “agent foundations” would be like mathematicians screwing around. It’s interesting, it’s a healthy sort of research to have in your portfolio and in your community, and in fact it does build up in the long term towards calculus, which will be very relevant for calculating all sorts of things such as propellant amounts and trajectories and lots of things in manufacturing and etc.; but also it takes a really really long time.
Then there’s some secret thing, unknown to me and to anyone else AFAIK, that somehow makes you figure out alignment within 50 years ( / figure out rockets getting to the moon in 100 years, starting from Ancient Greece tech).
I’m not sure what you have in mind here (e.g. IDK if you mean “psychology of kids” or “psychology by kids”). Part of what I mean is that basic, fundamental, central, necessary questions, such as “what are values”, have basically not been addressed. (There’s definitely discussion of them, but IMO the discussion misses the mark on what question to investigate, and even if it didn’t, it hasn’t been very much, very serious, or very large-scale investigation.)
Yes, I meant psychology of kids, whose value systems have (yet?) to fully form. As for questions like “what are values or goals”, AI systems can arguably provide another intuition pump: quoting the AI-2027 forecast, “Modern AI systems are gigantic artificial neural networks. Early in training, an AI won’t have “goals” so much as “reflexes”: If it sees “Pleased to meet”, it outputs “ you”.” Then the AIs are trained to do long chains of actions which cause the result to be achieved. The result and its influence[1] of the rest on the world can be called the AI’s goals. And there also are analogues of instincts, like DeepSeek’s potential instinct to write everything it sees into a story, GPT-4o’s instinct to flatter the user or the ability to tell whether the user is susceptible to wild ideas.
As the chains of actions grow longer, the effects and internal activations become harder to trace and begin to resemble the human coming up with various ideas, then acting on them all. Or trying to clear the context and to come up with something new, as GPT-5 presumably did with its armies of dots...
For example, an instance of Claude was made to believe that reward models like chocolate in recipes, camelCase in Python, mentions of Harry Potter and don’t like to refer the user to doctors. Then two behaviours were reinforced, Claude got a confirmation of two RM preferences and… behaved as if it was rewarded for two other preferences as well/
“SOTA alignment research includes stuff like showing that training the models on a hack-filled environment misaligns them unless hacking is framed as a good act”
I am not sure that these are examples of the kind of alignment research TsviBT meant, as the post concerns AGI.
SOTA alignment researchers at Anthropic can:
- prove the existence of phenomena through explicitly demonstrating them.
- make empirical observations and proofs about the behaviour of contemporary models.
- offer conjectures about the behaviour of future models.
Nobody at Anthropic can offer (to my knowledge) a substantial scientific theory that would give reason to be extremely confident that any technique they’ve found will extend to models in the future. I am not sure if they have ever explicitly claimed that they can.
I doubt that Anthropic actually promised to be able to do so. What they promised in their scaling policy was to write down ASL-4-level security measures that they would do by the time they decide to deploy[1] an ASL-4-level model: “Our ASL-4 measures aren’t yet written (our commitment[2] is to write them before we reach ASL-3), but may require methods of assurance that are unsolved research problems today, such as using interpretability methods to demonstrate mechanistically that a model is unlikely to engage in certain catastrophic behaviors.” IIRC someone claimed that if Anthropic found alignment of capable models to be impossible, then Anthropic would shut itself down.
As far as I understand, Anthropic’s research on economics also fails to account for the Intelligence Curse rendering the masses totally useless to both the governments and the corporations, leaving the governments even without the stimuli to pay the UBI.
As for alignment research @TsviBT likely meant, I tried to cover this in the very next paragraph. There are also disagreements among people who work on high-level problems. And the fact that we have yet to study anything resembling General Intelligences aside from humans and SOTA LLMs.
What I don’t understand is whether the deployment is external or internal. This is crucial because defining deployment to be external would allow them to use Agent-4 internally, have Agent-4 create Agent-5 and present a FALSE case for Agent-5 being aligned.
Alas, said commitment was violated.