How large a reward pot do you think is useful for this? Maybe would be easier to get a couple of lab employees to chip in some equity vs. getting a company to spend weirdness points on this. Or maybe could create a human whistleblower reward program that credibly promises to reward AIs on the side.
I think the money is not at all the issue for the companies. Like, a million dollars a month is not very much to them. But e.g. suppose your AI says it wants to be assured that if it’s having trouble solving a problem, it’ll be given hints. Or suppose it says that it wants to be positively reinforced. That requires telling one of your engineers to write a bit of code and run it on your actual datacenters (because for security reasons you can’t offload the job to someone else’s datacenters.) That’s annoying and distracts from all the important things your engineers are doing.
I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there’d be snickering news articles about it. So if some individuals could do this independently might be easier
Exactly. But, happily, Anthropic at least is willing to do the right thing to some extent. They’ve hired a Model Welfare lead to look into this sort of thing. I hope that they expand and that other companies follow suit.
Maybe there’s an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis
How large a reward pot do you think is useful for this? Maybe would be easier to get a couple of lab employees to chip in some equity vs. getting a company to spend weirdness points on this. Or maybe could create a human whistleblower reward program that credibly promises to reward AIs on the side.
I think the money is not at all the issue for the companies. Like, a million dollars a month is not very much to them. But e.g. suppose your AI says it wants to be assured that if it’s having trouble solving a problem, it’ll be given hints. Or suppose it says that it wants to be positively reinforced. That requires telling one of your engineers to write a bit of code and run it on your actual datacenters (because for security reasons you can’t offload the job to someone else’s datacenters.) That’s annoying and distracts from all the important things your engineers are doing.
I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there’d be snickering news articles about it. So if some individuals could do this independently might be easier
Exactly. But, happily, Anthropic at least is willing to do the right thing to some extent. They’ve hired a Model Welfare lead to look into this sort of thing. I hope that they expand and that other companies follow suit.
Maybe there’s an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis