Thane Ruthenis comments on Caleb Biddulph’s Shortform

Thane Ruthenis 31 Jan 2025 3:43 UTC
4 points
2
The problem with this neat picture is reward-hacking. This process wouldn’t optimize for better performance on fuzzy tasks, it would optimize for performance on fuzzy tasks that looks better to the underlying model. And much like RLHF doesn’t scale to superintelligence, this doesn’t scale to superhuman fuzzy-task performance.
It can improve the performance a bit. But once you ramp up the optimization pressure, “better performance” and “looks like better performance” would decouple from each other and the model would train itself into idiosyncratic uselessness. (Indeed: if it were this easy, doesn’t this mean you should be able to self-modify into a master tactician or martial artist by running some simulated scenarios in your mind, improving without bound, and without any need to contact reality?)
… Or so my intuition goes. It’s possible that this totally works for some dumb reason. But I don’t think so. RL has a long-standing history of problems with reward-hacking, and LLMs’ judgement is one of the most easily hackable things out there.
(Note that I’m not arguing that recursive self-improvement is impossible in general. But RLAIF, specifically, just doesn’t look like the way.)
- Caleb Biddulph 31 Jan 2025 17:27 UTC
  5 points
  0
  Parent
  Yeah, it’s possible that CoT training unlocks reward hacking in a way that wasn’t previously possible. This could be mitigated at least somewhat by continuing to train the reward function online, and letting the reward function use CoT too (like OpenAI’s “deliberative alignment” but more general).
  I think a better analogy than martial arts would be writing. I don’t have a lot of experience with writing fiction, so I wouldn’t be very good at it, but I do have a decent ability to tell good fiction from bad fiction. If I practiced writing fiction for a year, I think I’d be a lot better at it by the end, even if I never showed it to anyone else to critique. Generally, evaluation is easier than generation.
  Martial arts is different because it involves putting your body in OOD situations that you are probably pretty poor at evaluating, whereas “looking at a page of fiction” is a situation that I (and LLMs) are much more familiar with.
- Nathan Helm-Burger 2 Feb 2025 3:25 UTC
  4 points
  3
  Parent
  Well… One problem here is that a model could be superhuman at:
  - thinking speed
  - math
  - programming
  - flight simulators
  - self-replication
  - cyberattacks
  - strategy games
  - acquiring and regurgitating relevant information from science articles
  And be merely high-human-level at:
  - persuasion
  - deception
  - real world strategic planning
  - manipulating robotic actuators
  - developing weapons (e.g. bioweapons)
  - wetlab work
  - research
  - acquiring resources
  - avoiding government detection of its illicit activities
  Such an entity as described could absolutely be an existential threat to humanity. It doesn’t need to be superhuman at literally everything to be superhuman enough that we don’t stand a chance if it decides to kill us.
  
  So I feel like “RL may not work for everything, and will almost certainly work substantially better for easy to verify subjects” is… not so reassuring.
  - Thane Ruthenis 2 Feb 2025 3:41 UTC
    4 points
    3
    Parent
    Such an entity as described could absolutely be an existential threat to humanity
    I agree. I think you don’t even need most of the stuff on the “superhuman” list, the equivalent of a competent IQ-130 human upload probably does it, as long as it has the speed + self-copying advantages.