This analogy falters a bit if you consider the research proposals that use advanced AI to police itself (a.k.a., tigers controlling tigers). I hope we can scale robust versions of that.
I’ve worked a bit on these kinds of proposals and I’m fairly confident that they fundamentally don’t scale indefinitely.
The limiting factor is how well a model can tell its own bad behaviour from the honeypots you’re using to catch it out, which as it turns out models can do pretty well.
(Then there are mitigations but the mitigations introduce further problems which aren’t obviously easier to deal with)
This analogy falters a bit if you consider the research proposals that use advanced AI to police itself (a.k.a., tigers controlling tigers). I hope we can scale robust versions of that.
I’ve worked a bit on these kinds of proposals and I’m fairly confident that they fundamentally don’t scale indefinitely.
The limiting factor is how well a model can tell its own bad behaviour from the honeypots you’re using to catch it out, which as it turns out models can do pretty well.
(Then there are mitigations but the mitigations introduce further problems which aren’t obviously easier to deal with)