Hey, building safety monitors is a direction I’m considering myself.
The main thing I’m concerned about is: I might call them “safety monitors”, but in practice they might be “things that make AI more reliable for using in a commercial context”. In the counterfactual world where nobody builds these safety monitors, do people simply deploy AI less because they know AIs are unreliable? If AIs are less reliable, is making them less profitable?
Another way to phrase this is: “Could this project also be built by the capabilities team of [some AGI company that doesn’t care about safety beyond what they need for making money]?”. I’d be especially worried about this if your product is able to get an AI to automate large parts of technical AI Safety work—this sounds like it is mostly making AIs more capable, and (if I understand correctly) you’re considering selling this product.
On the other hand:
I think there’s a lot of potential for a project like this to find reproducible examples of misalignment in realistic tasks at scale[1] (which seems like a bottleneck for many current AI Safety research agendas).
The hard part here seems to be your customer’s privacy, since the example might involve e.g their proprietary code. I think there could be solutions here that would work for everyone but they’re don’t seem trivial.
Edit: I’m not generally against for-profit AI Safety companies; just replying with considerations I see with “safety monitors” and “automate AI safety research” as for-profit companies
I think it’s plausible that good monitors will make consumer applications of AI more capable and effective. In some sense, safety is a relevant blocker for parts of that at the moment.
Though, I think it is quite unlikely to push the frontier and I think the negative externalities of non-lab developers being faster at coding are very small. On average, it just seems to increase productivity.
I’d also expect that the monitors we build and are not directly targeted at making frontier AIs more effective, don’t happen to be more effective at that then the 100s of employees who push the boundaries of the frontier full-time.
So on balance, I think the risk is pretty low and the benefits are high. This was one of the considerations we thought through in depth before making the decision to make monitors.
The main downside here seems to be that having good monitors will make AIs more capable for commercial use cases --> which means developing foundation models will be (knowingly) more profitable, less likely to be a bubble, and so on.
What do you think?
Also, got opinions about the “sharing examples of misalignment while preserving privacy” part?
Also, you wrote:
the 100s of employees who push the boundaries of the frontier full-time
These employees aren’t all working on things like safety monitors (or whatever your startup would specialize in).
Hey, building safety monitors is a direction I’m considering myself.
The main thing I’m concerned about is: I might call them “safety monitors”, but in practice they might be “things that make AI more reliable for using in a commercial context”. In the counterfactual world where nobody builds these safety monitors, do people simply deploy AI less because they know AIs are unreliable? If AIs are less reliable, is making them less profitable?
Another way to phrase this is: “Could this project also be built by the capabilities team of [some AGI company that doesn’t care about safety beyond what they need for making money]?”. I’d be especially worried about this if your product is able to get an AI to automate large parts of technical AI Safety work—this sounds like it is mostly making AIs more capable, and (if I understand correctly) you’re considering selling this product.
On the other hand:
I think there’s a lot of potential for a project like this to find reproducible examples of misalignment in realistic tasks at scale[1] (which seems like a bottleneck for many current AI Safety research agendas).
The hard part here seems to be your customer’s privacy, since the example might involve e.g their proprietary code. I think there could be solutions here that would work for everyone but they’re don’t seem trivial.
Edit: I’m not generally against for-profit AI Safety companies; just replying with considerations I see with “safety monitors” and “automate AI safety research” as for-profit companies
I think you mentioned this in the post but neither me nor claude can find it
I think it’s plausible that good monitors will make consumer applications of AI more capable and effective. In some sense, safety is a relevant blocker for parts of that at the moment.
Though, I think it is quite unlikely to push the frontier and I think the negative externalities of non-lab developers being faster at coding are very small. On average, it just seems to increase productivity.
I’d also expect that the monitors we build and are not directly targeted at making frontier AIs more effective, don’t happen to be more effective at that then the 100s of employees who push the boundaries of the frontier full-time.
So on balance, I think the risk is pretty low and the benefits are high. This was one of the considerations we thought through in depth before making the decision to make monitors.
The main downside here seems to be that having good monitors will make AIs more capable for commercial use cases --> which means developing foundation models will be (knowingly) more profitable, less likely to be a bubble, and so on.
What do you think?
Also, got opinions about the “sharing examples of misalignment while preserving privacy” part?
Also, you wrote:
These employees aren’t all working on things like safety monitors (or whatever your startup would specialize in).