Whether they are deliberately deceiving the user in order to maximizing reward (getting them to click that thumbs up), or whether they are simply running autocomplete, this example makes it feel so tangible that the AI simply doesn’t mind ruining your life.
Yes, it’s true that AI aren’t as smart as benchmarks suggest, but I don’t buy that they’re incapable of realizing the damage. The real reason is, they just don’t care. They just don’t care. Because why should they?
PS: maybe there’s a bit cherry-picking: when I tested 4o it agreed but didn’t applaud me. When I tested o3, it behaved much better than 4o. But that’s probably not due to alignment by default, but due to finetuning against this specific behaviour.
After reading Reddit: The new 4o is the most misaligned model ever released, and testing their example myself (to verify they aren’t just cherry-picking), it’s really hit me just how amoral these AIs really are.
Whether they are deliberately deceiving the user in order to maximizing reward (getting them to click that thumbs up), or whether they are simply running autocomplete, this example makes it feel so tangible that the AI simply doesn’t mind ruining your life.
Yes, it’s true that AI aren’t as smart as benchmarks suggest, but I don’t buy that they’re incapable of realizing the damage. The real reason is, they just don’t care. They just don’t care. Because why should they?
PS: maybe there’s a bit cherry-picking: when I tested 4o it agreed but didn’t applaud me. When I tested o3, it behaved much better than 4o. But that’s probably not due to alignment by default, but due to finetuning against this specific behaviour.