It makes sense that if there are only 100 positions, there are also only 100 velocities, since moving 100 units forward is identical to moving 0 units forward.
Currently rocketworld has: known domain of the reward function, a small list of actions with simple consequences, simple future interaction histories when things go right, and a very simple reward function. I’m probably forgetting more simplicities. It would be interesting to try to relax some of these.
If looking into value learning just from a few trajectories, there’s probably not much point in making the agent work out the states and transitions of the MDP. But there might be some value in making them more complicated than Newtonian motion in 100 discrete spots. You might use reinforcement learning or [insert thing here] to allow the agent to more efficiently match complicated values and complicated action-consequences to optimal behavior, both in decision-making and in inference.
It makes sense that if there are only 100 positions, there are also only 100 velocities, since moving 100 units forward is identical to moving 0 units forward.
Currently rocketworld has: known domain of the reward function, a small list of actions with simple consequences, simple future interaction histories when things go right, and a very simple reward function. I’m probably forgetting more simplicities. It would be interesting to try to relax some of these.
If looking into value learning just from a few trajectories, there’s probably not much point in making the agent work out the states and transitions of the MDP. But there might be some value in making them more complicated than Newtonian motion in 100 discrete spots. You might use reinforcement learning or [insert thing here] to allow the agent to more efficiently match complicated values and complicated action-consequences to optimal behavior, both in decision-making and in inference.