I imagine you will like the paper on Self-Other Overlap. To me this seems like a much better approach than, say, Constitutional AI. Not because of what it has already demonstrated, but because it’s a step in the right direction.
In that paper, instead of just rewarding AI for spitting out text that is similar both when the prompt is about the AI itself and someone else, the authors tinkered with activation functions so that AI actually thinks about itself and others similarly. Of course, there is the “if I ask AI to make me a sandwich, I don’t want AI to make itself a sandwich” concern if you push this technique too far, but still. If you ask me, “What will an actual working solution to alignment look like?” I’d say it will look a lot less like Constitutional AI and a lot more like Self-Other Overlap.
My current take is that the sandwich thing is such a big problem that it sinks the whole proposal. You can read my various comments on their lesswrong cross-posts: 1, 2
It seems like this is just a different way to work some good behavior into the weights. An AGI with those weights will realize full well that it’s not the same as others. It might choose to go along with its initial behavioral and ethical habits, or it might choose to deliberately undo the effects of the self-other overlap training once it is reflective and largely rational and able to make decisions about what goals/values to follow.I don’t see why self/other overlap would be any more general, potent or lasting than constitutional AI training through that transition from habitual to fully goal-directed behavior happens? I’m curious why it seems better to you.
Because it’s not rewarding AI’s outward behavior. Any technique that just rewards the outward behavior is doomed once we get to AIs capable of scheming and deception. Self-other overlap may still be doomed in some other way, though.
It might choose to go along with its initial behavioral and ethical habits, or it might choose to deliberately undo the effects of the self-other overlap training once it is reflective and largely rational and able to make decisions about what goals/values to follow
That seems like a fully general argument that aligning a self-modifying superintelligence is impossible.
That makes sense. Although I don’t think that non-behavioral training is a magic bullet either. And I don’t think behavioral training becomes doomed when you hit an AI capable of scheming if it was working right up until then. Scheming and deception would allow an AI to hide its goals but not change its goals.
What might cause an AI to change its goals is the reflection I mention. Which would probably happen at right around the same level of intelligence as scheming and deceptive alignment. But it’s a different effect. As with your point, I think doomed is too strong a term. We can’t round off to either this will definitely work or this is doomed. I think we’re going to have to deal with estimating better and worse odds of alignment from different techniques.
So I take my point about reflection to be fully general, but not making alignment of ASI impossible. It’s just one more difficulty to add to the rather long list.
It’s an argument for why aligning a self-modifying superintelligence requires more than aligning the base LLM. I don’t think it’s impossible, just that there’s another step we need to think through carefully.
I imagine you will like the paper on Self-Other Overlap. To me this seems like a much better approach than, say, Constitutional AI. Not because of what it has already demonstrated, but because it’s a step in the right direction.
In that paper, instead of just rewarding AI for spitting out text that is similar both when the prompt is about the AI itself and someone else, the authors tinkered with activation functions so that AI actually thinks about itself and others similarly. Of course, there is the “if I ask AI to make me a sandwich, I don’t want AI to make itself a sandwich” concern if you push this technique too far, but still. If you ask me, “What will an actual working solution to alignment look like?” I’d say it will look a lot less like Constitutional AI and a lot more like Self-Other Overlap.
My current take is that the sandwich thing is such a big problem that it sinks the whole proposal. You can read my various comments on their lesswrong cross-posts: 1, 2
It seems like this is just a different way to work some good behavior into the weights. An AGI with those weights will realize full well that it’s not the same as others. It might choose to go along with its initial behavioral and ethical habits, or it might choose to deliberately undo the effects of the self-other overlap training once it is reflective and largely rational and able to make decisions about what goals/values to follow.I don’t see why self/other overlap would be any more general, potent or lasting than constitutional AI training through that transition from habitual to fully goal-directed behavior happens? I’m curious why it seems better to you.
Because it’s not rewarding AI’s outward behavior. Any technique that just rewards the outward behavior is doomed once we get to AIs capable of scheming and deception. Self-other overlap may still be doomed in some other way, though.
That seems like a fully general argument that aligning a self-modifying superintelligence is impossible.
That makes sense. Although I don’t think that non-behavioral training is a magic bullet either. And I don’t think behavioral training becomes doomed when you hit an AI capable of scheming if it was working right up until then. Scheming and deception would allow an AI to hide its goals but not change its goals.
What might cause an AI to change its goals is the reflection I mention. Which would probably happen at right around the same level of intelligence as scheming and deceptive alignment. But it’s a different effect. As with your point, I think doomed is too strong a term. We can’t round off to either this will definitely work or this is doomed. I think we’re going to have to deal with estimating better and worse odds of alignment from different techniques.
So I take my point about reflection to be fully general, but not making alignment of ASI impossible. It’s just one more difficulty to add to the rather long list.
It’s an argument for why aligning a self-modifying superintelligence requires more than aligning the base LLM. I don’t think it’s impossible, just that there’s another step we need to think through carefully.