I agree that ELK would not directly help with problems like manipulating humans, and that our proposal in the appendices is basically “Solve the problem with decoupling.” And if you merely defer to future humans about hoe good things are you definitely reintroduce these problems.
I agree that having humans evaluate future trajectories is very similar to having humans evaluate alternative trajectories. The main difference is that future humans are smarter—even if they aren’t any better acquainted with those alternative futures, they’ve still learned a lot (and in particular have hopefully learned everything that an AI from 2022 knows).
I agree that realistic futures are way too complicated for 2022 humans to judge whether they are good, and this is a way in which the “paperclip” example is very unrealistic. Internally we refer to these undesirable outcomes as “flourishing-prime” to emphasize that it’s very hard to distinguish from flourishing.
I think the biggest point of disagreement is that I don’t think the foresight evaluations required by this procedure are likely to be very hard, and in particular won’t likely have that much overlap with the hard parts of debate or amplification. Ideally I think they are going to be very similar to the decisions a human makes by default when deciding what they want to do tomorrow (and then when they put their faith in that future self to make a wise decision about what to do the day after that). I think the core difference is that in the easy case the quality of our reasoning doesn’t need to scale up with the quality of our AI, we can take it slow while our AI defends us and gives us time+space to become wiser.
That said, I do think that even without alignment worries it’s not easy to chart a positive course towards becoming the kind of people we’d want to defer to, and I don’t have that much confidence in us doing it well. Foresight-based evaluations may be closely related to leveraging AI to increase wisdom differentially (which may be important even if you aren’t trying to handle alignment risk), but that seems like a slightly different ballgame.
In particular, I’m not imagining avoiding subtle manipulation from AI by doing foresight evaluations of whether those actions are manipulative. I’m imagining something more like a whitelist where we do foresight evaluations to decide what we do want to be influenced by. (Though as Wei Dai often points out this may be hard, since e.g. social processes play a crucial role in how our views evolve but those processes involve competitive dynamics amongst humans that may make it hard to distinguish social influences we want from manipulation enabled by with AI advisors.)
As a side note, when reading the ELK report, it was confusing at first to me why we’d want different solutions for subtle manipulation vs. sensor tampering. Within the language of decoupling, here is one mental model which might be a helpful explanation:
Decoupling handles RF tampering.
We design the RF to have very low-level RF-inputs, such that any RF-input tampering will require blatant tampering. In particular, we want to make it impossible for subtle manipulation to affect the RF inputs (e.g. camera videos).
We use narrow ELK to handle RF-input tampering.
I mostly agree with this, but want to clarify that “very low-level RF inputs” includes things like humans living lives as normal, eating normal food, breathing normal air, etc., rather than things like ”
We also only expect that this is sufficient to guide deliberation over the very short term. If I’m watching a human in 2030, I don’t necessarily expect to be able to tell whether things are going well or poorly for them. But I don’t want to defer to them. Instead I’ll defer to a human in 2029 about whether 2030 looks good, having already established (by deferring to a 2028 human) that the 2029 future is going well. This process is definitely a bit scary, but on reflection I think it’s not really worse than the status quo of each human using their own faculties to chart a course for the next year.
I agree that ELK would not directly help with problems like manipulating humans, and that our proposal in the appendices is basically “Solve the problem with decoupling.” And if you merely defer to future humans about hoe good things are you definitely reintroduce these problems.
I agree that having humans evaluate future trajectories is very similar to having humans evaluate alternative trajectories. The main difference is that future humans are smarter—even if they aren’t any better acquainted with those alternative futures, they’ve still learned a lot (and in particular have hopefully learned everything that an AI from 2022 knows).
I agree that realistic futures are way too complicated for 2022 humans to judge whether they are good, and this is a way in which the “paperclip” example is very unrealistic. Internally we refer to these undesirable outcomes as “flourishing-prime” to emphasize that it’s very hard to distinguish from flourishing.
I think the biggest point of disagreement is that I don’t think the foresight evaluations required by this procedure are likely to be very hard, and in particular won’t likely have that much overlap with the hard parts of debate or amplification. Ideally I think they are going to be very similar to the decisions a human makes by default when deciding what they want to do tomorrow (and then when they put their faith in that future self to make a wise decision about what to do the day after that). I think the core difference is that in the easy case the quality of our reasoning doesn’t need to scale up with the quality of our AI, we can take it slow while our AI defends us and gives us time+space to become wiser.
That said, I do think that even without alignment worries it’s not easy to chart a positive course towards becoming the kind of people we’d want to defer to, and I don’t have that much confidence in us doing it well. Foresight-based evaluations may be closely related to leveraging AI to increase wisdom differentially (which may be important even if you aren’t trying to handle alignment risk), but that seems like a slightly different ballgame.
In particular, I’m not imagining avoiding subtle manipulation from AI by doing foresight evaluations of whether those actions are manipulative. I’m imagining something more like a whitelist where we do foresight evaluations to decide what we do want to be influenced by. (Though as Wei Dai often points out this may be hard, since e.g. social processes play a crucial role in how our views evolve but those processes involve competitive dynamics amongst humans that may make it hard to distinguish social influences we want from manipulation enabled by with AI advisors.)
I mostly agree with this, but want to clarify that “very low-level RF inputs” includes things like humans living lives as normal, eating normal food, breathing normal air, etc., rather than things like ”
We also only expect that this is sufficient to guide deliberation over the very short term. If I’m watching a human in 2030, I don’t necessarily expect to be able to tell whether things are going well or poorly for them. But I don’t want to defer to them. Instead I’ll defer to a human in 2029 about whether 2030 looks good, having already established (by deferring to a 2028 human) that the 2029 future is going well. This process is definitely a bit scary, but on reflection I think it’s not really worse than the status quo of each human using their own faculties to chart a course for the next year.