One obvious question, as someone who loves analyzing safety problems through near-term perspectives whenever possible, is what if the models we currently have access to are the most trusted models we’ll ever have? Would these kinds of security methods work, or are these models not powerful enough?
My reasonably informed guesses:
No, they are not close to powerful enough. Not only could they be deliberately fooled, but more importantly they’d break things all the time when nobody was even trying to fool them.
That won’t stop people from selling the stuff you propose in the short term… or from buying it.
In the long term, the threat actors probably aren’t human; humans might not even be setting the high-level goals. And those objectives, and the targets available, might change a great deal. And the basic software landscape probably changes a lot… hopefully with AI producing a lot of provably correct software. At that point, I’m not sure I want to risk any guesses.
My reasonably informed guesses:
No, they are not close to powerful enough. Not only could they be deliberately fooled, but more importantly they’d break things all the time when nobody was even trying to fool them.
That won’t stop people from selling the stuff you propose in the short term… or from buying it.
In the long term, the threat actors probably aren’t human; humans might not even be setting the high-level goals. And those objectives, and the targets available, might change a great deal. And the basic software landscape probably changes a lot… hopefully with AI producing a lot of provably correct software. At that point, I’m not sure I want to risk any guesses.
I don’t know how long the medium term is.