I found this post a bit odd, in that I was assuming the context was comparing
“Plan A: Humans solve alignment” -versus-
“Plan B: Humans outsource the solving of alignment to AIs”
If that’s the context, you can say “Plan B is a bad plan because humans are too incompetent to know what they’re looking for, or recognize a good idea when they see it, etc.”. OK sure, maybe that’s true. But if it’s true, then both plans are doomed! It’s not an argument to do Plan A, right?
To be clear, I don’t actually care much, because I already thought that Plan A was better than Plan B anyway (for kinda different reasons from you—see here).
I think the missing piece here is that people who want to outsource the solving of alignment to AIs are usually trying to avoid engaging with the hard problems of alignment themselves. So the key difference is that, in B, the people outsourcing usually haven’t attempted to understand the problem very deeply.
I don’t agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later. I expect alignment researchers to be central to automation attempts.
It seems to me like the basic equation is something like: “If today’s alignment researchers would be able to succeed given a lot more time, then they also are reasonably likely to succeed given access to a lot of human-level-ish AIs.” There are reasons this could fail (perhaps future alignment research will require major adaptations and different skills such that today’s top alignment researchers will be unable to assess it; perhaps there are parallelization issues, though AIs can give significant serial speedup), but the argument in this post seems far from a knockdown.
Also, it seems worth noting that non-experts work productively with experts all the time. There are lots of shortcomings and failure modes, but the video is a parody.
I don’t agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later.
Indeed, I think you’re a good role model in this regard and hope more people will follow your example.
I took John to be arguing that we won’t get a good solution out of this paradigm (so long as the humans doing it aren’t expert at alignment), rather than we couldn’t recognize a good solution if it were proposed.
Separately, I think that recognizing good solutions is potentially pretty fraught, especially the more capable the system we’re outsourcing it to is. Like anything about a proposed solution that we don’t know how to measure or we don’t understand could be exploited, and it’s really hard to tell those failures exist almost definitionally. E.g., a plan with many steps where it’s hard to verify there won’t be unintended consequences, a theory of interpretability which leaves out a key piece we’d need to detect deception, etc. etc. It’s really hard to know/trust that things like that won’t happen when we’re dealing with quite intelligent machines (potentially optimizing against us), and it seems hard to get this sort of labor out of not-very-intelligent machines (for similar reasons as John points out in his last post, i.e., before a field is paradigmatic outsourcing doesn’t really work, since it’s difficult to specify questions well when we don’t even know what questions to ask in the first place).
In general these sorts of outsourcing plans seem to me to rely on a narrow range of AI capability levels (one which I’m not even sure exists): smart enough to solve novel scientific problems, but not smart enough to successfully deceive us if it tried. That makes me feel pretty skeptical about such plans working.
I found this post a bit odd, in that I was assuming the context was comparing
“Plan A: Humans solve alignment” -versus-
“Plan B: Humans outsource the solving of alignment to AIs”
If that’s the context, you can say “Plan B is a bad plan because humans are too incompetent to know what they’re looking for, or recognize a good idea when they see it, etc.”. OK sure, maybe that’s true. But if it’s true, then both plans are doomed! It’s not an argument to do Plan A, right?
To be clear, I don’t actually care much, because I already thought that Plan A was better than Plan B anyway (for kinda different reasons from you—see here).
I think the missing piece here is that people who want to outsource the solving of alignment to AIs are usually trying to avoid engaging with the hard problems of alignment themselves. So the key difference is that, in B, the people outsourcing usually haven’t attempted to understand the problem very deeply.
Also Plan B is currently being used to justify accelerating various danger tech by folks with no solid angles on Plan A...
I don’t agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later. I expect alignment researchers to be central to automation attempts.
It seems to me like the basic equation is something like: “If today’s alignment researchers would be able to succeed given a lot more time, then they also are reasonably likely to succeed given access to a lot of human-level-ish AIs.” There are reasons this could fail (perhaps future alignment research will require major adaptations and different skills such that today’s top alignment researchers will be unable to assess it; perhaps there are parallelization issues, though AIs can give significant serial speedup), but the argument in this post seems far from a knockdown.
Also, it seems worth noting that non-experts work productively with experts all the time. There are lots of shortcomings and failure modes, but the video is a parody.
Indeed, I think you’re a good role model in this regard and hope more people will follow your example.
I took John to be arguing that we won’t get a good solution out of this paradigm (so long as the humans doing it aren’t expert at alignment), rather than we couldn’t recognize a good solution if it were proposed.
Separately, I think that recognizing good solutions is potentially pretty fraught, especially the more capable the system we’re outsourcing it to is. Like anything about a proposed solution that we don’t know how to measure or we don’t understand could be exploited, and it’s really hard to tell those failures exist almost definitionally. E.g., a plan with many steps where it’s hard to verify there won’t be unintended consequences, a theory of interpretability which leaves out a key piece we’d need to detect deception, etc. etc. It’s really hard to know/trust that things like that won’t happen when we’re dealing with quite intelligent machines (potentially optimizing against us), and it seems hard to get this sort of labor out of not-very-intelligent machines (for similar reasons as John points out in his last post, i.e., before a field is paradigmatic outsourcing doesn’t really work, since it’s difficult to specify questions well when we don’t even know what questions to ask in the first place).
In general these sorts of outsourcing plans seem to me to rely on a narrow range of AI capability levels (one which I’m not even sure exists): smart enough to solve novel scientific problems, but not smart enough to successfully deceive us if it tried. That makes me feel pretty skeptical about such plans working.