Would a reasonable way to summarize this be that if you train on pretend reward hacking you get emergent misalignment that takes the form of pretending (playacting) misbehaving and being evil, whereas if you here train on realistic reward hacking examples it starts realistically (and in some ways strategically) misbehaving and doing other forms of essentially reward hacking instead?
Zvi
Karma: 54,454
New Statement Calls For Not Building Superintelligence For Now
AI #139: The Overreach Machines
On Dwarkesh Patel’s Podcast With Andrej Karpathy
Bubble, Bubble, Toil and Trouble
AI #138 Part 2: Watch Out For Documents
AI #138 Part 1: The People Demand Erotic Sycophants
Monthly Roundup #35: October 2025
Trade Escalation, Supply Chain Vulnerabilities and Rare Earth Metals
OpenAI #15: More on OpenAI’s Paranoid Lawfare Against Advocates of SB 53
2025 State of AI Report and Predictions
Yes.
Knowing that, hopefully you wouldn’t?
Oh, of course, how silly of me!
AI #137: An OpenAI App For That
NEPA, Permitting and Energy Roundup #2
Bending The Curve
Medical Roundup #5
I was not aware of this at the time.
I wouldn’t obviously even put AMD on the list given that they’re up on rather big single stock news, but yes, good note, there is that.