hmm, i think there could be other structural things.
Maybe you shift around different stages of post-training
Maybe you realise that randomising means you get the same values each time, so you purposefully over-sample certain types of environments or pre-training data earlier on. E.g. first pre-training heavily on math, then heavily on fiction; then reversing that for the next gen. Or first doing RL on lots of SWE envs, then doing general computer-use envs; then reversing the order for the next gen.
You could potentially do experiments to determine what kind of features of the data/​envs influences the values of the AI and then make sure that those specific features differ in the early stages of fine-tuning for different training runs.
hmm, i think there could be other structural things.
Maybe you shift around different stages of post-training
Maybe you realise that randomising means you get the same values each time, so you purposefully over-sample certain types of environments or pre-training data earlier on. E.g. first pre-training heavily on math, then heavily on fiction; then reversing that for the next gen. Or first doing RL on lots of SWE envs, then doing general computer-use envs; then reversing the order for the next gen.
You could potentially do experiments to determine what kind of features of the data/​envs influences the values of the AI and then make sure that those specific features differ in the early stages of fine-tuning for different training runs.