It seemed like the watermaze environment was actually reasonably big, with observation space of a small colored picture, and policy needing to navigate to a specific point in the maze from a random starting point. Is that right? Given context window limitations, wouldn’t this change how many transitions it expects to see, or how informative they are?
Ah, wait, I’m looking through the paper now—did they embed all tasks into the same 64d space? This sounds like about right for the watermaze and massive overkill for looking at a black room.
Anyhow, what’s the meaning of their graphs with sloped lines? The line labeled with their paper was on top of the other lines, but are they actually comparing like to like? That is, are they getting better performance out of the same total data? Does their transformer get an advantage by being pretrained on a bunch of other mazes with the same layout and goal but different starting location? But if so, how is it actually extracting an advantage—is it doing decision transformer style conditioning on reward?
The paper is frustratingly vague about what their context lengths are for the various experiments, but based off of comparing figures 7 and 4, I would guess that the context length for Watermaze was 1-2 times as long as an episode length(=50 steps). (It does indeed look like they were embedding the 2d dark room observations into a 64-dimensional space, which is hilarious.)
I’m not sure I understand your second question. Are you asking about figure 4 in the paper (the same one I copied into this post)? There’s no reward conditioning going on. They’re also not really comparing like to like, since the AD and ED agents were trained on different data (RL learning trajectories vs. expert demonstrations).
Like I mentioned in the post, my story about this is that the AD agents can get good performance by, when the previous episode ends with reward 1, navigating to the position that the previous episode ended in. (Remember, the goal position doesn’t change from episode to episode—these “tasks” are insanely narrow!) On the other hand, the ED agent probably just picks some goal position and repeatedly navigates there, never adjusting to the fact that it’s not getting reward.
Yeah, I’m confused about all their results of the same type as fig 4 (fig 5, fig 6, etc.). But I think I’m figuring it out—they really are just taking the predicted action. They’re “learning” in the sense that the sequence model is simulating something that’s learning. So if I’ve got this right, the thousands of environment steps on the x axis just go in one end of the context window and out the other, and by the end the high-performing sequence model is just operating on the memory of 1-2 high-performing episodes.
I guess this raises another question I had, which is—why is the sequence model so bad at pretending to be bad? If it’s supposed to be learning the distribution of the entire training trajectory, why is it so bad at mimicking an actual training trajectory? Maybe copying the previous run when it performed well is just such an easy heuristic that it skews the output? Or maybe performing well is lower-entropy than performing poorly, so lowering a “temperature” parameter at evaluation time will bias the sequence model towards successful trajectories?
Nice, now I can ask you some questions :D
It seemed like the watermaze environment was actually reasonably big, with observation space of a small colored picture, and policy needing to navigate to a specific point in the maze from a random starting point. Is that right? Given context window limitations, wouldn’t this change how many transitions it expects to see, or how informative they are?
Ah, wait, I’m looking through the paper now—did they embed all tasks into the same 64d space? This sounds like about right for the watermaze and massive overkill for looking at a black room.
Anyhow, what’s the meaning of their graphs with sloped lines? The line labeled with their paper was on top of the other lines, but are they actually comparing like to like? That is, are they getting better performance out of the same total data? Does their transformer get an advantage by being pretrained on a bunch of other mazes with the same layout and goal but different starting location? But if so, how is it actually extracting an advantage—is it doing decision transformer style conditioning on reward?
The paper is frustratingly vague about what their context lengths are for the various experiments, but based off of comparing figures 7 and 4, I would guess that the context length for Watermaze was 1-2 times as long as an episode length(=50 steps). (It does indeed look like they were embedding the 2d dark room observations into a 64-dimensional space, which is hilarious.)
I’m not sure I understand your second question. Are you asking about figure 4 in the paper (the same one I copied into this post)? There’s no reward conditioning going on. They’re also not really comparing like to like, since the AD and ED agents were trained on different data (RL learning trajectories vs. expert demonstrations).
Like I mentioned in the post, my story about this is that the AD agents can get good performance by, when the previous episode ends with reward 1, navigating to the position that the previous episode ended in. (Remember, the goal position doesn’t change from episode to episode—these “tasks” are insanely narrow!) On the other hand, the ED agent probably just picks some goal position and repeatedly navigates there, never adjusting to the fact that it’s not getting reward.
Yeah, I’m confused about all their results of the same type as fig 4 (fig 5, fig 6, etc.). But I think I’m figuring it out—they really are just taking the predicted action. They’re “learning” in the sense that the sequence model is simulating something that’s learning. So if I’ve got this right, the thousands of environment steps on the x axis just go in one end of the context window and out the other, and by the end the high-performing sequence model is just operating on the memory of 1-2 high-performing episodes.
I guess this raises another question I had, which is—why is the sequence model so bad at pretending to be bad? If it’s supposed to be learning the distribution of the entire training trajectory, why is it so bad at mimicking an actual training trajectory? Maybe copying the previous run when it performed well is just such an easy heuristic that it skews the output? Or maybe performing well is lower-entropy than performing poorly, so lowering a “temperature” parameter at evaluation time will bias the sequence model towards successful trajectories?