I suspect that, at least if such an answer has good performance, it would be widely used. Quite a few problems with current ML look like some form of goodhearting. So even in a short term pragmatic sense it would be useful. (Of course, making short term goodhearting not in practice a problem is easier than a full alignment solution.
At least a fair few high status people will look over it and say it looks good. Quite a lot of people will be saying “this is what you should do for AI safety”. So there is a decent chance the first project uses this technique, especially if a bunch of rationalist types throw their weight behind it. And after that, its decisive strategic advantage.
Donald Hobson has a good point about goodharting, but it’s simpler than that. While some people want alignment so that everyone doesn’t die, the rest of us still want it for what it can do for us. If I’m prompting a language model with “A small cat went out to explore the world” I want it to come back with a nice children’s story about a small cat that went out to explore the world that I can show to just about any child. If I prompt a robot that I want it to “bring me a nice flower” I do not want it to steal my neighbor’s rosebushes. And so on, I want it to be safe to give some random AI helper whatever lazy prompt is on my mind and have it improve things by my preferences.
I suspect that, at least if such an answer has good performance, it would be widely used. Quite a few problems with current ML look like some form of goodhearting. So even in a short term pragmatic sense it would be useful. (Of course, making short term goodhearting not in practice a problem is easier than a full alignment solution.
At least a fair few high status people will look over it and say it looks good. Quite a lot of people will be saying “this is what you should do for AI safety”. So there is a decent chance the first project uses this technique, especially if a bunch of rationalist types throw their weight behind it. And after that, its decisive strategic advantage.
Donald Hobson has a good point about goodharting, but it’s simpler than that. While some people want alignment so that everyone doesn’t die, the rest of us still want it for what it can do for us. If I’m prompting a language model with “A small cat went out to explore the world” I want it to come back with a nice children’s story about a small cat that went out to explore the world that I can show to just about any child. If I prompt a robot that I want it to “bring me a nice flower” I do not want it to steal my neighbor’s rosebushes. And so on, I want it to be safe to give some random AI helper whatever lazy prompt is on my mind and have it improve things by my preferences.
)