I took John to be arguing that we won’t get a good solution out of this paradigm (so long as the humans doing it aren’t expert at alignment), rather than we couldn’t recognize a good solution if it were proposed.
Separately, I think that recognizing good solutions is potentially pretty fraught, especially the more capable the system we’re outsourcing it to is. Like anything about a proposed solution that we don’t know how to measure or we don’t understand could be exploited, and it’s really hard to tell those failures exist almost definitionally. E.g., a plan with many steps where it’s hard to verify there won’t be unintended consequences, a theory of interpretability which leaves out a key piece we’d need to detect deception, etc. etc. It’s really hard to know/trust that things like that won’t happen when we’re dealing with quite intelligent machines (potentially optimizing against us), and it seems hard to get this sort of labor out of not-very-intelligent machines (for similar reasons as John points out in his last post, i.e., before a field is paradigmatic outsourcing doesn’t really work, since it’s difficult to specify questions well when we don’t even know what questions to ask in the first place).
In general these sorts of outsourcing plans seem to me to rely on a narrow range of AI capability levels (one which I’m not even sure exists): smart enough to solve novel scientific problems, but not smart enough to successfully deceive us if it tried. That makes me feel pretty skeptical about such plans working.
I took John to be arguing that we won’t get a good solution out of this paradigm (so long as the humans doing it aren’t expert at alignment), rather than we couldn’t recognize a good solution if it were proposed.
Separately, I think that recognizing good solutions is potentially pretty fraught, especially the more capable the system we’re outsourcing it to is. Like anything about a proposed solution that we don’t know how to measure or we don’t understand could be exploited, and it’s really hard to tell those failures exist almost definitionally. E.g., a plan with many steps where it’s hard to verify there won’t be unintended consequences, a theory of interpretability which leaves out a key piece we’d need to detect deception, etc. etc. It’s really hard to know/trust that things like that won’t happen when we’re dealing with quite intelligent machines (potentially optimizing against us), and it seems hard to get this sort of labor out of not-very-intelligent machines (for similar reasons as John points out in his last post, i.e., before a field is paradigmatic outsourcing doesn’t really work, since it’s difficult to specify questions well when we don’t even know what questions to ask in the first place).
In general these sorts of outsourcing plans seem to me to rely on a narrow range of AI capability levels (one which I’m not even sure exists): smart enough to solve novel scientific problems, but not smart enough to successfully deceive us if it tried. That makes me feel pretty skeptical about such plans working.