Re counterfactual impact: the biggest shift came from talking to Nate at BAGI, after which I wrote this post on disentangling arguments about AI risk, in which I identified the “target loading problem”. This seems roughly equivalent to inner alignment, but was meant to avoid the difficulties of defining an “inner optimiser”. At some subsequent point I changed my mind and decided it was better to focus on inner optimisers—I think this was probably catalysed by your paper, or by conversations with Vlad which were downstream of the paper. I think the paper definitely gave me some better terminology for me to mentally latch onto, which helped steer my thoughts in more productive directions.
Re 2d robustness: this is a good point. So maybe we could say that the process orthogonality thesis is somewhat true, in a “spherical cow” sense. There are some interventions that only affect capabilities, or only alignment. And it’s sometimes useful to think of alignment as being all about the reward function, and capabilities as involving everything else. But as with all spherical cow models, this breaks down when you look at it closely—e.g. when you’re thinking about the “curriculum” which an agent needs to undergo to become generally intelligent. Does this seem reasonable?
Also, I think that many other people believe in the process orthogonality thesis to a greater extent than I do. So even if we don’t agree about how much it breaks down, if this is a convenient axis which points in roughly the direction on which we disagree, then I’d still be happy about that.
Re counterfactual impact: the biggest shift came from talking to Nate at BAGI, after which I wrote this post on disentangling arguments about AI risk, in which I identified the “target loading problem”. This seems roughly equivalent to inner alignment, but was meant to avoid the difficulties of defining an “inner optimiser”. At some subsequent point I changed my mind and decided it was better to focus on inner optimisers—I think this was probably catalysed by your paper, or by conversations with Vlad which were downstream of the paper. I think the paper definitely gave me some better terminology for me to mentally latch onto, which helped steer my thoughts in more productive directions.
Re 2d robustness: this is a good point. So maybe we could say that the process orthogonality thesis is somewhat true, in a “spherical cow” sense. There are some interventions that only affect capabilities, or only alignment. And it’s sometimes useful to think of alignment as being all about the reward function, and capabilities as involving everything else. But as with all spherical cow models, this breaks down when you look at it closely—e.g. when you’re thinking about the “curriculum” which an agent needs to undergo to become generally intelligent. Does this seem reasonable?
Also, I think that many other people believe in the process orthogonality thesis to a greater extent than I do. So even if we don’t agree about how much it breaks down, if this is a convenient axis which points in roughly the direction on which we disagree, then I’d still be happy about that.