a company successfully solves control for “high-stakes”/”concentrated” threats, meaning that AIs can safely be deployed without a large risk of them hacking their own datacenters or exfiltrating or causing any similar catastrophic action.
This seems like quite a difficult task, especially since we (or some other trusted agent) need to read the outputs of these AIs to get any value from their safety research. Schemers could self-exfiltrate by manipulating us somehow.
The company probably can’t initially tell apart the schemers from the aligned AIs, or indeed, tell which of the projects are succeeding at a day-to-day basis. (It’s just too hard to evaluate which research directions are genuinely promising vs. a waste of time.) But they run them all in parallel, and after several months of massively sped-up work, it’s likely much easier to adjudicate whose solutions makes sense to adopt.
This is similar to how, in well-functioning scientific fields, it’s not typically possible to get consensus on what speculative research directions are most promising. But once a research project or whole research agenda is completed, there are normally ways in which the value-add can be demonstrated.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work. That’s a disanalogy with well-functioning scientific fields: scientists don’t deliberately pick bad research directions and try hard to make them seem good.
I think self-exfiltration via manipulation seems pretty hard. I think we’re likely to have transformatively useful systems that can’t do that, for some amount of time. (Especially since there’s no real reason to train them to be good at manipulation, though of course they might generalize from other stuff.) I agree people should definitely be thinking about it as a potential problem and try to mitigate and estimate the risk.
That’s a disanalogy with well-functioning scientific fields: scientists don’t deliberately pick bad research directions and try hard to make them seem good.
Some scientists do optimize for (something like) “impressiveness” of their work, regardless of whether it’s good or bad. It’s true that they don’t intentionally compromise impressiveness in order to make the work more misleading. That said, if some models optimize for impressiveness, and some models compromise impressiveness to make their work more misleading, then I guess the humans should be somewhat more likely to use the more impressive work? So maybe that limits the degree of influence that intentionally misleading work could I have.
I am definitely pretty worried about humans ability to judge what work is good or bad. In addition to the disanalogy you mention, I’d also highlight (i) non-experts typically struggle to judge which expert is right in a dispute, and if humans are slow & dumb enough relative to the AIs, humans might be analogous to non-experts, and (ii) it’s plausible to me that human science benefits a lot from the majority of scientists (and maybe especially the most capable scientists?) having at least a weak preference for truth, e.g. because it’s easier to recognize a few bad results when you can lean on the presumed veracity of most established results.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work
if you ask each variant to review the research of the other variants, then the schemers need to optimise their research for looking good to each variant. but the optimistic assumption is that at least one variant is an equally capable non-schemer.
then the schemers need to optimise their research for looking good to each variant
Not necessarily. The nice AIs also need to be able to win a debate against the schemers, as judged by humans. It’s not enough for the variants to be able to recognize poor research if they can’t show their work in an unrefutable (by the schemers) way.
This seems like quite a difficult task, especially since we (or some other trusted agent) need to read the outputs of these AIs to get any value from their safety research. Schemers could self-exfiltrate by manipulating us somehow.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work. That’s a disanalogy with well-functioning scientific fields: scientists don’t deliberately pick bad research directions and try hard to make them seem good.
I think self-exfiltration via manipulation seems pretty hard. I think we’re likely to have transformatively useful systems that can’t do that, for some amount of time. (Especially since there’s no real reason to train them to be good at manipulation, though of course they might generalize from other stuff.) I agree people should definitely be thinking about it as a potential problem and try to mitigate and estimate the risk.
Some scientists do optimize for (something like) “impressiveness” of their work, regardless of whether it’s good or bad. It’s true that they don’t intentionally compromise impressiveness in order to make the work more misleading. That said, if some models optimize for impressiveness, and some models compromise impressiveness to make their work more misleading, then I guess the humans should be somewhat more likely to use the more impressive work? So maybe that limits the degree of influence that intentionally misleading work could I have.
I am definitely pretty worried about humans ability to judge what work is good or bad. In addition to the disanalogy you mention, I’d also highlight (i) non-experts typically struggle to judge which expert is right in a dispute, and if humans are slow & dumb enough relative to the AIs, humans might be analogous to non-experts, and (ii) it’s plausible to me that human science benefits a lot from the majority of scientists (and maybe especially the most capable scientists?) having at least a weak preference for truth, e.g. because it’s easier to recognize a few bad results when you can lean on the presumed veracity of most established results.
if you ask each variant to review the research of the other variants, then the schemers need to optimise their research for looking good to each variant. but the optimistic assumption is that at least one variant is an equally capable non-schemer.
Not necessarily. The nice AIs also need to be able to win a debate against the schemers, as judged by humans. It’s not enough for the variants to be able to recognize poor research if they can’t show their work in an unrefutable (by the schemers) way.