I am not sure that these are examples of the kind of alignment research TsviBT meant, as the post concerns AGI.
SOTA alignment researchers at Anthropic can: - prove the existence of phenomena through explicitly demonstrating them. - make empirical observations and proofs about the behaviour of contemporary models. - offer conjectures about the behaviour of future models.
Nobody at Anthropic can offer (to my knowledge) a substantial scientific theory that would give reason to be extremely confident that any technique they’ve found will extend to models in the future. I am not sure if they have ever explicitly claimed that they can.
I doubt that Anthropic actually promised to be able to do so. What they promised in their scaling policy was to write down ASL-4-level security measures that they would do by the time they decide to deploy[1] an ASL-4-level model: “Our ASL-4 measures aren’t yet written (our commitment[2] is to write them before we reach ASL-3), but may require methods of assurance that are unsolved research problems today, such as using interpretability methods to demonstrate mechanistically that a model is unlikely to engage in certain catastrophic behaviors.” IIRC someone claimed that if Anthropic found alignment of capable models to be impossible, then Anthropic would shut itself down.
As for alignment research @TsviBT likely meant, I tried to cover this in the very next paragraph. There are also disagreements among people who work on high-level problems. And the fact that we have yet to study anything resembling General Intelligences aside from humans and SOTA LLMs.
What I don’t understand is whether the deployment is external or internal. This is crucial because defining deployment to be external would allow them to use Agent-4 internally, have Agent-4 create Agent-5 and present a FALSE case for Agent-5 being aligned.
“SOTA alignment research includes stuff like showing that training the models on a hack-filled environment misaligns them unless hacking is framed as a good act”
I am not sure that these are examples of the kind of alignment research TsviBT meant, as the post concerns AGI.
SOTA alignment researchers at Anthropic can:
- prove the existence of phenomena through explicitly demonstrating them.
- make empirical observations and proofs about the behaviour of contemporary models.
- offer conjectures about the behaviour of future models.
Nobody at Anthropic can offer (to my knowledge) a substantial scientific theory that would give reason to be extremely confident that any technique they’ve found will extend to models in the future. I am not sure if they have ever explicitly claimed that they can.
I doubt that Anthropic actually promised to be able to do so. What they promised in their scaling policy was to write down ASL-4-level security measures that they would do by the time they decide to deploy[1] an ASL-4-level model: “Our ASL-4 measures aren’t yet written (our commitment[2] is to write them before we reach ASL-3), but may require methods of assurance that are unsolved research problems today, such as using interpretability methods to demonstrate mechanistically that a model is unlikely to engage in certain catastrophic behaviors.” IIRC someone claimed that if Anthropic found alignment of capable models to be impossible, then Anthropic would shut itself down.
As far as I understand, Anthropic’s research on economics also fails to account for the Intelligence Curse rendering the masses totally useless to both the governments and the corporations, leaving the governments even without the stimuli to pay the UBI.
As for alignment research @TsviBT likely meant, I tried to cover this in the very next paragraph. There are also disagreements among people who work on high-level problems. And the fact that we have yet to study anything resembling General Intelligences aside from humans and SOTA LLMs.
What I don’t understand is whether the deployment is external or internal. This is crucial because defining deployment to be external would allow them to use Agent-4 internally, have Agent-4 create Agent-5 and present a FALSE case for Agent-5 being aligned.
Alas, said commitment was violated.