Really nice paper: This strikes me as a great idea. Both using it to replace regular RL, and using it to investigate the effects of an RL training environment in a legible form. For the latter, I’d suggest running it multiple times with different seeds, and seeing how convergent/divergent the results are.
Some improvements would be:
Can we make GEPA better, so it gets more similar results to RL?
Can we distill from an RL trained set of weights to the before training waights plus a prompt, so we can convert what the RL-trained model learned into a legible prompt? Can we then figure out how much hasn’t transferered, and characterize that in some other way? With an SAE or an oracle, say?
The research directions in this and your other comment are indeed ones I’d want to try! I hadn’t thought of characterizing the untransferred remainder with an SAE or oracle in particular, not sure how it would work exactly but it may be interesting
Say do a mean activation diff between the RL version and the GEPA prompt-optimized version on relevant content, then apply an SAE and/or an activation oracle to the mean difference? Less legible than what you get in the prompted version, but a way of attempting to characterize what you’re missing.
Even fancier might be to use a distillation loss to distill the diff between the RL version and the GEPA-prompted version into a LoRA, then use an SAE/activation oracle to analyze the vectors making up the LoRA.
It might be possible to do this recursively: if this suggests that something important for alignment is missing from the prompt version, try adding a stab at adding that to the prompt and run more rounds of GEPA to improve it, and repeat. Or even feed this sort of analysis into some improved version of the prompt optimization as another data source, so automate this loop?
The really cool thing about this is, the training process here doesn’t train the model to evade the interp: it trains it to use the interp accurately, to make the prompt optimization better. So the “most forbidden technique” becomes fine here.
Really nice paper: This strikes me as a great idea. Both using it to replace regular RL, and using it to investigate the effects of an RL training environment in a legible form. For the latter, I’d suggest running it multiple times with different seeds, and seeing how convergent/divergent the results are.
Some improvements would be:
Can we make GEPA better, so it gets more similar results to RL?
Can we distill from an RL trained set of weights to the before training waights plus a prompt, so we can convert what the RL-trained model learned into a legible prompt? Can we then figure out how much hasn’t transferered, and characterize that in some other way? With an SAE or an oracle, say?
The research directions in this and your other comment are indeed ones I’d want to try! I hadn’t thought of characterizing the untransferred remainder with an SAE or oracle in particular, not sure how it would work exactly but it may be interesting
Say do a mean activation diff between the RL version and the GEPA prompt-optimized version on relevant content, then apply an SAE and/or an activation oracle to the mean difference? Less legible than what you get in the prompted version, but a way of attempting to characterize what you’re missing.
Even fancier might be to use a distillation loss to distill the diff between the RL version and the GEPA-prompted version into a LoRA, then use an SAE/activation oracle to analyze the vectors making up the LoRA.
It might be possible to do this recursively: if this suggests that something important for alignment is missing from the prompt version, try adding a stab at adding that to the prompt and run more rounds of GEPA to improve it, and repeat. Or even feed this sort of analysis into some improved version of the prompt optimization as another data source, so automate this loop?
The really cool thing about this is, the training process here doesn’t train the model to evade the interp: it trains it to use the interp accurately, to make the prompt optimization better. So the “most forbidden technique” becomes fine here.