Hey, building safety monitors is a direction I’m considering myself.
The main thing I’m concerned about is: I might call them “safety monitors”, but in practice they might be “things that make AI more reliable for using in a commercial context”. In the counterfactual world where nobody builds these safety monitors, do people simply deploy AI less because they know AIs are unreliable? If AIs are less reliable, is making them less profitable?
Another way to phrase this is: “Could this project also be built by the capabilities team of [some AGI company that doesn’t care about safety beyond what they need for making money]?”. I’d be especially worried about this if your product is able to get an AI to automate large parts of technical AI Safety work—this sounds like it is mostly making AIs more capable, and (if I understand correctly) you’re considering selling this product.
On the other hand:
I think there’s a lot of potential for a project like this to find reproducible examples of misalignment in realistic tasks at scale[1] (which seems like a bottleneck for many current AI Safety research agendas).
The hard part here seems to be your customer’s privacy, since the example might involve e.g their proprietary code. I think there could be solutions here that would work for everyone but they’re don’t seem trivial.
Edit: I’m not generally against for-profit AI Safety companies; just replying with considerations I see with “safety monitors” and “automate AI safety research” as for-profit companies
- ^
I think you mentioned this in the post but neither me nor claude can find it
The main downside here seems to be that having good monitors will make AIs more capable for commercial use cases --> which means developing foundation models will be (knowingly) more profitable, less likely to be a bubble, and so on.
What do you think?
Also, got opinions about the “sharing examples of misalignment while preserving privacy” part?
Also, you wrote:
These employees aren’t all working on things like safety monitors (or whatever your startup would specialize in).