Tom Davidson comments on Making deals with early schemers

Tom Davidson 23 Jun 2025 16:26 UTC
LW: 2 AF: 1
0
AF
Let’s say future post-training involves training in 5 diff RL environments one after the other, but the order doesn’t matter for capabilities. It’s possible that AI picks up it’s goals from the first environment and then plays the training game from there.
Then intentionally mix up the order for subsequent generations.
You could do this today with pre-training (eg randomly shuffling the data order each time), and maybe it’s already happening.
(Tbc, I haven’t thought about whether this would work in practice, but inside view seems worth thinking more about it)
- ryan_greenblatt 23 Jun 2025 16:31 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I think this is basically a special case of changing the random seed which already randomizes env order probably.
  - Tom Davidson 27 Jun 2025 9:02 UTC
    LW: 4 AF: 3
    0
    AF Parent
    hmm, i think there could be other structural things.
    Maybe you shift around different stages of post-training
    Maybe you realise that randomising means you get the same values each time, so you purposefully over-sample certain types of environments or pre-training data earlier on. E.g. first pre-training heavily on math, then heavily on fiction; then reversing that for the next gen. Or first doing RL on lots of SWE envs, then doing general computer-use envs; then reversing the order for the next gen.
    You could potentially do experiments to determine what kind of features of the data/envs influences the values of the AI and then make sure that those specific features differ in the early stages of fine-tuning for different training runs.