Could you explain to me where the single step / multiple steps aspect comes in? I don’t see an assumption of only a single step anywhere, but maybe this comes from a lack of understanding.
Maybe you could explain why you think it covers multiple-steps? Like take my falling example. Falling is the outcome of many successive decisions taken by a humanoid body over a few hundred milliseconds. Each decision builds on the previous one, and is constrained by it: you start in a good position and you make a poor decision about some of your joints (like pivoting a little too quickly), then you are in a less-good position and you don’t make a good enough decision to get yourself out of trouble, then you are in an even less good position, and a bunch of decisions later, you are laying on the ground suddenly dying when literally a second ago you were perfectly healthy and might have lived decades more. This is why most regret bounds include a T term in them, which covers the sequential decision-making aspect of RL and how errors can compound: a small deviation from the optimal policy at the start can snowball into arbitrarily large regrets over sufficient T.
Maybe you could explain why you think it covers multiple-steps? Like take my falling example. Falling is the outcome of many successive decisions taken by a humanoid body over a few hundred milliseconds. Each decision builds on the previous one, and is constrained by it: you start in a good position and you make a poor decision about some of your joints (like pivoting a little too quickly), then you are in a less-good position and you don’t make a good enough decision to get yourself out of trouble, then you are in an even less good position, and a bunch of decisions later, you are laying on the ground suddenly dying when literally a second ago you were perfectly healthy and might have lived decades more. This is why most regret bounds include a T term in them, which covers the sequential decision-making aspect of RL and how errors can compound: a small deviation from the optimal policy at the start can snowball into arbitrarily large regrets over sufficient T.