Marius Adrian Nicoară comments on Absolute Zero: Alpha Zero for LLM

Marius Adrian Nicoară 17 May 2025 8:58 UTC
1 point
0
“it’s the distinction between learning from human data versus learning from a reward signal.” That’s an interesting distinction. The difference I currently see between the two is that currently a reward signal can be hacked by the AI, while human data cannot. Is that an accurate thing to say?
Are there any resources you could recommend for alignment methods that take into account the distinction you mentioned?
- Charlie Steiner 18 May 2025 16:43 UTC
  4 points
  0
  Parent
  That’s one difference! And probably the most dangerous one, if a clever enough AI notices it.
  Some good things to read would be methods based on not straying too far from a “human distribution”: Quantilization (Jessica Taylor paper), the original RLHF paper (Christiano), Sam Marks’ post about decision transformers.
  They’re important reads, but ultimately, I’m not satisfied with these for the same reason I mentioned about self-other overlap in the other comment a second ago: we want the AI to treat the human how the human wants to be treated, that doesn’t mean we want the AI to act how the human wants to act. If we can’t build AI that reflects this, we’re missing some big insights.