This analogy falters a bit if you consider the research proposals that use advanced AI to police itself (a.k.a., tigers controlling tigers). I hope we can scale robust versions of that.
I’ve worked a bit on these kinds of proposals and I’m fairly confident that they fundamentally don’t scale indefinitely.
The limiting factor is how well a model can tell its own bad behaviour from the honeypots you’re using to catch it out, which as it turns out models can do pretty well.
(Then there are mitigations but the mitigations introduce further problems which aren’t obviously easier to deal with)
A simple analogy for why the “using LLMs to control LLMs” approach is flawed:
It’s like training 10 mice to control 7 chinchillas, who will control 4 mongooses, controlling three raccoons, which will reign in one tiger.
A lot has to go right for this to work, and you better hope that there aren’t any capability jumps akin to raccoons controlling tigers.
I just wanted to release this analogy out into the wild to be picked up by any public/political-facing people to pick up if useful for persuasion.
This analogy falters a bit if you consider the research proposals that use advanced AI to police itself (a.k.a., tigers controlling tigers). I hope we can scale robust versions of that.
I’ve worked a bit on these kinds of proposals and I’m fairly confident that they fundamentally don’t scale indefinitely.
The limiting factor is how well a model can tell its own bad behaviour from the honeypots you’re using to catch it out, which as it turns out models can do pretty well.
(Then there are mitigations but the mitigations introduce further problems which aren’t obviously easier to deal with)