I think that better pretraining filtering is useful for mitigating emergent misalignment.
I just read a story about a judge using ChatGPT to (help) decide whether particular language was racially charged. How good is it going to be at that sort of thing if all the racially charged uses of this or that language have been filtered?
More generally, I don’t think the kind of “alignment” that you can potentially address with that kind of filtering is important. If you make it impossible to elicit naughty words from something, or even if you manage to make it totally incapable of thinking about some subject, that doesn’t mean you’ve “aligned” it in any useful way. You’ve made it stupider, not more moral.
As for emergence, if you keep playing whack-a-mole, removing everything you identify as possibly being useful to prime output that could be intentionally misused, you seem to be setting yourself up to get really unpredictable, truly emergent behavior, as opposed to predictable repetition of patterns it’s already seen.
… and porn specifically seems to be way, way, way, way, way down any reasonable list of what it’d be important to keep a model from mimicking anyway. I don’t think I’d even put it on any such list at all.
I just read a story about a judge using ChatGPT to (help) decide whether particular language was racially charged. How good is it going to be at that sort of thing if all the racially charged uses of this or that language have been filtered?
More generally, I don’t think the kind of “alignment” that you can potentially address with that kind of filtering is important. If you make it impossible to elicit naughty words from something, or even if you manage to make it totally incapable of thinking about some subject, that doesn’t mean you’ve “aligned” it in any useful way. You’ve made it stupider, not more moral.
As for emergence, if you keep playing whack-a-mole, removing everything you identify as possibly being useful to prime output that could be intentionally misused, you seem to be setting yourself up to get really unpredictable, truly emergent behavior, as opposed to predictable repetition of patterns it’s already seen.
… and porn specifically seems to be way, way, way, way, way down any reasonable list of what it’d be important to keep a model from mimicking anyway. I don’t think I’d even put it on any such list at all.