Daniel Kokotajlo comments on Foom & Doom 2: Technical alignment is hard

Daniel Kokotajlo 24 Jun 2025 22:57 UTC
LW: 8 AF: 6
0
AF
se RL reward functions are written in code, not in natural language.
Often though they involve using LLMs or humans to make fuzzy judgment calls e.g. about what is or isn’t an obedient response to an instruction.
- Steven Byrnes 24 Jun 2025 23:11 UTC
  LW: 4 AF: 2
  0
  AF Parent
  My discussion in §2.4.1 is about making fuzzy judgment calls using trained classifiers, which is not exactly the same as making fuzzy judgment calls using LLMs or humans, but I think everything I wrote still applies.