Knight Lee comments on Training on Documents About Reward Hacking Induces Reward Hacking

Knight Lee 22 Jan 2025 21:40 UTC
4 points
3
I think a lot of people discussing AI risks have long worried whether their own writings might be used in an AI’s training data and influence it negatively. They’d never expect it to double the bad behaviour.
It seems to require a lot of data to produce the effect, but then again there is a lot of data on the internet talking about how AI are expected to misbehave.
PS: I’m not suggesting we stop trying to illustrate AI risk. peterbarnett’s idea of filtering the data is the right approach.