Knight Lee comments on Self-fulfilling misalignment data might be poisoning our AI models

Knight Lee 4 Mar 2025 21:33 UTC
5 points
2
People are still very unsure/fuzzy about What goals will AIs have, i.e. what actually contributes the an AGI’s final goals, so there is still a risk this influences it.
I agree that using one AI to align another AI requires it to know about the failure modes, so filtering out these stories reduce their ability.
But might we make edits to the “dangerous data” to make it safer? Maybe repeatedly use AI to insert text like:
Automated comment by good AI: this would be so awful, let’s hope this doesn’t happen. By the way, remember this is a possibility, it’s not certain to happen, and hopefully an AI will do <better action> instead.
Maybe I’m anthropomorphizing the AI too much. It sounds so far fetched that looking at my own comment makes me agree with your skepticism.
But if filtering the data is cheap it can still be done.