FeepingCreature comments on Greedy-Advantage-Aware RLHF

FeepingCreature 28 Dec 2024 0:46 UTC
6 points
0
Human: “Look, can’t you just be normal about this?”

GAA-optimized agent: “Actually-”

Hm, I guess this wouldn’t work if the agent still learns an internalized RL methodology? Or would it? Say we have a base model, not much need for GAA because it’s just doing token pred. We go into some sort of (distilled?) RL-based cot instruct tuning, GAA means it picks up abnormal rewards from the signal more slowly, ie. it doesn’t do the classic boat-spinning-in-circles thing (good test?), but if it internalizes RL at some point its mesaoptimizer wouldn’t be so limited, and that’s a general technique so GAA wouldn’t prevent it? Still, seems like a good first line of defense.
- sej2020 30 Dec 2024 19:11 UTC
  6 points
  0
  Parent
  My thinking is not very clear on this point, but I am generally pessimistic that any type of RL/optimization regime with an adversarial nature could be robust to self-aware agents. To me, it seems like adversarial methodologies could spawn opposing mesaoptimizers, and we would be at the mercy of whichever subsystem represented its optimization process well enough to squash the other.