I think self-exfiltration via manipulation seems pretty hard. I think we’re likely to have transformatively useful systems that can’t do that, for some amount of time. (Especially since there’s no real reason to train them to be good at manipulation, though of course they might generalize from other stuff.) I agree people should definitely be thinking about it as a potential problem and try to mitigate and estimate the risk.
That’s a disanalogy with well-functioning scientific fields: scientists don’t deliberately pick bad research directions and try hard to make them seem good.
Some scientists do optimize for (something like) “impressiveness” of their work, regardless of whether it’s good or bad. It’s true that they don’t intentionally compromise impressiveness in order to make the work more misleading. That said, if some models optimize for impressiveness, and some models compromise impressiveness to make their work more misleading, then I guess the humans should be somewhat more likely to use the more impressive work? So maybe that limits the degree of influence that intentionally misleading work could I have.
I am definitely pretty worried about humans ability to judge what work is good or bad. In addition to the disanalogy you mention, I’d also highlight (i) non-experts typically struggle to judge which expert is right in a dispute, and if humans are slow & dumb enough relative to the AIs, humans might be analogous to non-experts, and (ii) it’s plausible to me that human science benefits a lot from the majority of scientists (and maybe especially the most capable scientists?) having at least a weak preference for truth, e.g. because it’s easier to recognize a few bad results when you can lean on the presumed veracity of most established results.
I think self-exfiltration via manipulation seems pretty hard. I think we’re likely to have transformatively useful systems that can’t do that, for some amount of time. (Especially since there’s no real reason to train them to be good at manipulation, though of course they might generalize from other stuff.) I agree people should definitely be thinking about it as a potential problem and try to mitigate and estimate the risk.
Some scientists do optimize for (something like) “impressiveness” of their work, regardless of whether it’s good or bad. It’s true that they don’t intentionally compromise impressiveness in order to make the work more misleading. That said, if some models optimize for impressiveness, and some models compromise impressiveness to make their work more misleading, then I guess the humans should be somewhat more likely to use the more impressive work? So maybe that limits the degree of influence that intentionally misleading work could I have.
I am definitely pretty worried about humans ability to judge what work is good or bad. In addition to the disanalogy you mention, I’d also highlight (i) non-experts typically struggle to judge which expert is right in a dispute, and if humans are slow & dumb enough relative to the AIs, humans might be analogous to non-experts, and (ii) it’s plausible to me that human science benefits a lot from the majority of scientists (and maybe especially the most capable scientists?) having at least a weak preference for truth, e.g. because it’s easier to recognize a few bad results when you can lean on the presumed veracity of most established results.