Hopefully, this synthetic dataset of superhuman chess bot games would provide higher quality data than human games. Second, I grabbed 16 million games from Lichess’s public chess game database. I trained separate models on individual datasets and various mixes of datasets.
Were the results basically the same in all cases? Or e.g. was the all-stockfish-data engine less robust to off-distribution inputs?
The all stockfish data engine played at a level that was 100-200 Elo higher in my tests, with a couple caveats. First, I benchmarked the LLMs against stockfish, so an all stockfish dataset seems helpful for this benchmark. Secondly, the stockfish LLM would probably have an advantage for robustness because I included a small percentage of stockfish vs random move generator games in the stockfish dataset in the hopes that it would improve its ability.
I haven’t done an in depth qualitative assessment of their abilities to give a more in depth answer unfortunately.
Were the results basically the same in all cases? Or e.g. was the all-stockfish-data engine less robust to off-distribution inputs?
The all stockfish data engine played at a level that was 100-200 Elo higher in my tests, with a couple caveats. First, I benchmarked the LLMs against stockfish, so an all stockfish dataset seems helpful for this benchmark. Secondly, the stockfish LLM would probably have an advantage for robustness because I included a small percentage of stockfish vs random move generator games in the stockfish dataset in the hopes that it would improve its ability.
I haven’t done an in depth qualitative assessment of their abilities to give a more in depth answer unfortunately.