michaelcohen comments on Alignment will happen by default. What’s next?

michaelcohen 27 Dec 2025 9:22 UTC
LW: 1 AF: 1
0
AF
The “feeling bad about reward hacking” is an artifact of still being regularized too closely to a human-like base model that further RL training would eliminate.
- Adrià Garriga-alonso 27 Jan 2026 15:57 UTC
  LW: 2 AF: 1
  0
  AF Parent
  We can monitor that and mitigate it when we get there, using the previous generation of AIs.
  - michaelcohen 30 Jan 2026 21:58 UTC
    LW: 1 AF: 1
    0
    AF Parent
    This is now a completely different topic. Do you take my point?