my next guess would be that they ran the PPO for many more episodes than the 31 shown, and trained the GLA on all that
This was my read too. Unfortunately we don’t have access to the source code but this is the assumption i made after seeing the graph on the left in Figure 3. Around 40 episodes in, their PPO agent is still struggling but their Gap 8 GLA is near optimal. But that Gap 8 GLA was necessarily trained on data from a PPO agent that ran for 8 times longer.
You’re right, I misread the graph.
I also concede that this claim is probably right for Figure 3.
I still don’t think this is true for Figure 5 but i’m less confident now having realised how much my assumptions about the underspecified parts of this paper were based on what I assumed about their GPICL paper.