plex comments on Foom & Doom 2: Technical alignment is hard

plex 23 Jun 2025 18:45 UTC
7 points
0
Because in brain-like AGI, the reward function is written in Python (or whatever), not in natural language.
Yup. I’d bet some people will reply with something like “why not define the reward function in natural language, like constitutional AI”. I think this fails due to strong optimization finding the most convenient (for it, not us) settings of free parameters left by fuzzy statistical things like words, and if you give it a chance to feed back into the definitions via training data or do online learning etc gets totally wrecked by semantic drift.
- Charlie Steiner 24 Jun 2025 4:01 UTC
  2 points
  0
  Parent
  And don’t you think 500 lines of Python also “fails due to” having unintended optima?
  I’ve put “fails due to” in scare quotes because what’s failing is not every possible approach, merely almost all samples from approaches we currently know how to take. If we knew how to select python code much more cleverly, suddenly it wouldn’t fail anymore. And ditto for if we knew how to better construct reward functions from big AI systems plus small amounts of human text or human feedback.
  - plex 24 Jun 2025 13:28 UTC
    2 points
    0
    Parent
    Oh no, almost all possible 500 lines of python are also bad.