Another foundation of this program was a set of algorithmic approaches, Iterated Distillation and Amplification (IDA), that supposedly offers a path to actually building these approval-directed AI agents.
I am (and have always been) a skeptic of IDA: I just don’t think any of those algorithms would work very well.
Additionally, I am not sure of the extent to which Approval Reward in humans isn’t a proxy for long-term/large scale coordination, since Alex tricking Hugh would cause Hugh to become adversarial if the truth is revealed. That being said, the hard problem of wireheading the human does have a human analogue in misaligned governments engaging in wholesale deception of their citizens.
Finally, I suspect that the humans have Approval Reward from their entire collective with which they identify, not just from another human or collective. What if, say, Agent-3 develops the Approval Reward from the entire Agent-3 collective instead of the human collective and has the collective’s values drift?
P.S. I also suspect that the first and second issues mentioned by Demski can be partially solved by ensuring that the AI is rewarded for having the explanations understood by the simulated human so that the simulated human could rerun the experiments. Even if the simulated human can’t understand the concept of “metastable covalent plasma flux resonances”, the AI wasn’t born with this concept, the concept is either BS requiring the simulated human to reject the plan (edit: think of torsion fields in pseudoscience) or was created through some interaction of other concepts which are hopefully closer to being comprehensible.
First of all, I doubt that IDA-like agents cannot be bootstrapped to the ASI at all. Since GPT-5.4 Pro has likely solved a FrontierMath open problem, I suspect that Yudkowsky’s case against bootstrapping IDA-based agents to ASI either ended up Proving Too Much or proves the different idea that we can’t expect IDA agents to be bootstrapped into an aligned ASI.
Additionally, I am not sure of the extent to which Approval Reward in humans isn’t a proxy for long-term/large scale coordination, since Alex tricking Hugh would cause Hugh to become adversarial if the truth is revealed. That being said, the hard problem of wireheading the human does have a human analogue in misaligned governments engaging in wholesale deception of their citizens.
Finally, I suspect that the humans have Approval Reward from their entire collective with which they identify, not just from another human or collective. What if, say, Agent-3 develops the Approval Reward from the entire Agent-3 collective instead of the human collective and has the collective’s values drift?
P.S. I also suspect that the first and second issues mentioned by Demski can be partially solved by ensuring that the AI is rewarded for having the explanations understood by the simulated human so that the simulated human could rerun the experiments. Even if the simulated human can’t understand the concept of “metastable covalent plasma flux resonances”, the AI wasn’t born with this concept, the concept is either BS requiring the simulated human to reject the plan (edit: think of torsion fields in pseudoscience) or was created through some interaction of other concepts which are hopefully closer to being comprehensible.