gwern comments on satchlj’s Shortform

gwern 7 May 2025 1:03 UTC
5 points
0
My point is that it seems like the Gaussian assumption is obviously wrong given any actual example of a real task, like standing up without falling and breaking a hip or hitting our heads & dying (both of which are quite common in the elderly—eg my grandfather and my grandmother, respectively). And that the analysis is obviously wrong given any actual example of a real environment more complicated than a bandit. (And I think this is part of what Wentworth is getting at when he says it’s about when you “move out of a regime”. The fact that the error inside the ‘regime’ is, if you squint, maybe not so bad in some way, doesn’t help much when the regime is ultra-narrow and you or I could, ahem, fall out of the regime within a second of actions.) So my reaction is that if that is the expected regret in this scenario, which seems to be just about the best possible scenario, with the tamest errors, and the least RL aspects like having multiple steps, that you are showing that Goodhart’s Curse really is that bad, and I’m confused why you seem to think it’s great news.
- Satya Benson 7 May 2025 16:15 UTC
  1 point
  0
  Parent
  Likely it’s not great news, like I said in the original post I am not sure how to interpret what I had noticed. But right now, after the reflection that’s come from people’s comments, I don’t think it’s bad news, and it might (maybe) even be slightly good news.^[1]
  Could you explain to me where the single step / multiple steps aspect comes in? I don’t see an assumption of only a single step anywhere, but maybe this comes from a lack of understanding.
  1. ^
    Instead of a world where we need our proxy to be close to 100% error-free (this seems totally unrealistic), we just need the error to have ~ no tails (this might also be totally unrealistic but is a weaker requirement).
  - gwern 8 May 2025 17:30 UTC
    4 points
    0
    Parent
    
    Could you explain to me where the single step / multiple steps aspect comes in? I don’t see an assumption of only a single step anywhere, but maybe this comes from a lack of understanding.
    
    Maybe you could explain why you think it covers multiple-steps? Like take my falling example. Falling is the outcome of many successive decisions taken by a humanoid body over a few hundred milliseconds. Each decision builds on the previous one, and is constrained by it: you start in a good position and you make a poor decision about some of your joints (like pivoting a little too quickly), then you are in a less-good position and you don’t make a good enough decision to get yourself out of trouble, then you are in an even less good position, and a bunch of decisions later, you are laying on the ground suddenly dying when literally a second ago you were perfectly healthy and might have lived decades more. This is why most regret bounds include a T term in them, which covers the sequential decision-making aspect of RL and how errors can compound: a small deviation from the optimal policy at the start can snowball into arbitrarily large regrets over sufficient T.