I don’t agree, they somehow optimize the goal of being a HHH assistant. We could almost say that they optimize the goal of being aligned. As nostalgbraist reminds us, Anthropic’s HHH paper was an alignment work in the first place. It’s not that surprising that such optimizers happen to be more aligned that the canonical optimizers envisioned by Yudkowsky.
Edit : precision : by “they” I mean the base models trying to predict the answers of an HHH assistant as good as possible (“as good as possible” being clearly a process of optimization or I don’t know what it’s mean). And in my opinion a sufficiently good prediction is effectively or pratically a simulation. Maybe not a bit perfect simulation, but a lossy simulation, an heuristic towards simulation.
Or because they are not optimizers at all.
I don’t agree, they somehow optimize the goal of being a HHH assistant. We could almost say that they optimize the goal of being aligned. As nostalgbraist reminds us, Anthropic’s HHH paper was an alignment work in the first place. It’s not that surprising that such optimizers happen to be more aligned that the canonical optimizers envisioned by Yudkowsky.
Edit : precision : by “they” I mean the base models trying to predict the answers of an HHH assistant as good as possible (“as good as possible” being clearly a process of optimization or I don’t know what it’s mean). And in my opinion a sufficiently good prediction is effectively or pratically a simulation. Maybe not a bit perfect simulation, but a lossy simulation, an heuristic towards simulation.