Nice post! This seems like a good application for autonomous research—a good metric, tight feedback loops, and there hasn’t been a ton of research effort directed at it yet.
Karpathy had an auto-research scaffold improve validation loss on LLMs and from what I can see it mostly just tweaks hyperparameters, but LLM pretraining is a much better studied area.
I didn’t know that Llama Scope also annealed the K, but it makes a lot of sense! It seems like a lot of the autoresearch stuff will end up being a fancy hyperparameter sweep, but if it’s cheap to run and occasionally stumbles on something novel/useful maybe that’s good enough.
Nice post! This seems like a good application for autonomous research—a good metric, tight feedback loops, and there hasn’t been a ton of research effort directed at it yet.
Karpathy had an auto-research scaffold improve validation loss on LLMs and from what I can see it mostly just tweaks hyperparameters, but LLM pretraining is a much better studied area.
The Top-K annealing was used in the Llama Scope paper as well: https://arxiv.org/pdf/2410.20526
I didn’t know that Llama Scope also annealed the K, but it makes a lot of sense! It seems like a lot of the autoresearch stuff will end up being a fancy hyperparameter sweep, but if it’s cheap to run and occasionally stumbles on something novel/useful maybe that’s good enough.