The paper is frustratingly vague about what their context lengths are for the various experiments, but based off of comparing figures 7 and 4, I would guess that the context length for Watermaze was 1-2 times as long as an episode length(=50 steps). (It does indeed look like they were embedding the 2d dark room observations into a 64-dimensional space, which is hilarious.)
I’m not sure I understand your second question. Are you asking about figure 4 in the paper (the same one I copied into this post)? There’s no reward conditioning going on. They’re also not really comparing like to like, since the AD and ED agents were trained on different data (RL learning trajectories vs. expert demonstrations).
Like I mentioned in the post, my story about this is that the AD agents can get good performance by, when the previous episode ends with reward 1, navigating to the position that the previous episode ended in. (Remember, the goal position doesn’t change from episode to episode—these “tasks” are insanely narrow!) On the other hand, the ED agent probably just picks some goal position and repeatedly navigates there, never adjusting to the fact that it’s not getting reward.
Yeah, I’m confused about all their results of the same type as fig 4 (fig 5, fig 6, etc.). But I think I’m figuring it out—they really are just taking the predicted action. They’re “learning” in the sense that the sequence model is simulating something that’s learning. So if I’ve got this right, the thousands of environment steps on the x axis just go in one end of the context window and out the other, and by the end the high-performing sequence model is just operating on the memory of 1-2 high-performing episodes.
I guess this raises another question I had, which is—why is the sequence model so bad at pretending to be bad? If it’s supposed to be learning the distribution of the entire training trajectory, why is it so bad at mimicking an actual training trajectory? Maybe copying the previous run when it performed well is just such an easy heuristic that it skews the output? Or maybe performing well is lower-entropy than performing poorly, so lowering a “temperature” parameter at evaluation time will bias the sequence model towards successful trajectories?
The paper is frustratingly vague about what their context lengths are for the various experiments, but based off of comparing figures 7 and 4, I would guess that the context length for Watermaze was 1-2 times as long as an episode length(=50 steps). (It does indeed look like they were embedding the 2d dark room observations into a 64-dimensional space, which is hilarious.)
I’m not sure I understand your second question. Are you asking about figure 4 in the paper (the same one I copied into this post)? There’s no reward conditioning going on. They’re also not really comparing like to like, since the AD and ED agents were trained on different data (RL learning trajectories vs. expert demonstrations).
Like I mentioned in the post, my story about this is that the AD agents can get good performance by, when the previous episode ends with reward 1, navigating to the position that the previous episode ended in. (Remember, the goal position doesn’t change from episode to episode—these “tasks” are insanely narrow!) On the other hand, the ED agent probably just picks some goal position and repeatedly navigates there, never adjusting to the fact that it’s not getting reward.
Yeah, I’m confused about all their results of the same type as fig 4 (fig 5, fig 6, etc.). But I think I’m figuring it out—they really are just taking the predicted action. They’re “learning” in the sense that the sequence model is simulating something that’s learning. So if I’ve got this right, the thousands of environment steps on the x axis just go in one end of the context window and out the other, and by the end the high-performing sequence model is just operating on the memory of 1-2 high-performing episodes.
I guess this raises another question I had, which is—why is the sequence model so bad at pretending to be bad? If it’s supposed to be learning the distribution of the entire training trajectory, why is it so bad at mimicking an actual training trajectory? Maybe copying the previous run when it performed well is just such an easy heuristic that it skews the output? Or maybe performing well is lower-entropy than performing poorly, so lowering a “temperature” parameter at evaluation time will bias the sequence model towards successful trajectories?