Likely it’s not great news, like I said in the original post I am not sure how to interpret what I had noticed. But right now, after the reflection that’s come from people’s comments, I don’t think it’s bad news, and it might (maybe) even be slightly good news.[1]
Could you explain to me where the single step / multiple steps aspect comes in? I don’t see an assumption of only a single step anywhere, but maybe this comes from a lack of understanding.
Instead of a world where we need our proxy to be close to 100% error-free (this seems totally unrealistic), we just need the error to have ~ no tails (this might also be totally unrealistic but is a weaker requirement).
Could you explain to me where the single step / multiple steps aspect comes in? I don’t see an assumption of only a single step anywhere, but maybe this comes from a lack of understanding.
Maybe you could explain why you think it covers multiple-steps? Like take my falling example. Falling is the outcome of many successive decisions taken by a humanoid body over a few hundred milliseconds. Each decision builds on the previous one, and is constrained by it: you start in a good position and you make a poor decision about some of your joints (like pivoting a little too quickly), then you are in a less-good position and you don’t make a good enough decision to get yourself out of trouble, then you are in an even less good position, and a bunch of decisions later, you are laying on the ground suddenly dying when literally a second ago you were perfectly healthy and might have lived decades more. This is why most regret bounds include a T term in them, which covers the sequential decision-making aspect of RL and how errors can compound: a small deviation from the optimal policy at the start can snowball into arbitrarily large regrets over sufficient T.
Likely it’s not great news, like I said in the original post I am not sure how to interpret what I had noticed. But right now, after the reflection that’s come from people’s comments, I don’t think it’s bad news, and it might (maybe) even be slightly good news.[1]
Could you explain to me where the single step / multiple steps aspect comes in? I don’t see an assumption of only a single step anywhere, but maybe this comes from a lack of understanding.
Instead of a world where we need our proxy to be close to 100% error-free (this seems totally unrealistic), we just need the error to have ~ no tails (this might also be totally unrealistic but is a weaker requirement).
Maybe you could explain why you think it covers multiple-steps? Like take my falling example. Falling is the outcome of many successive decisions taken by a humanoid body over a few hundred milliseconds. Each decision builds on the previous one, and is constrained by it: you start in a good position and you make a poor decision about some of your joints (like pivoting a little too quickly), then you are in a less-good position and you don’t make a good enough decision to get yourself out of trouble, then you are in an even less good position, and a bunch of decisions later, you are laying on the ground suddenly dying when literally a second ago you were perfectly healthy and might have lived decades more. This is why most regret bounds include a T term in them, which covers the sequential decision-making aspect of RL and how errors can compound: a small deviation from the optimal policy at the start can snowball into arbitrarily large regrets over sufficient T.