The more I learn about measurement, the less seriously I take it
I’m impressed with models that accomplish tasks in zero or one shot with minimal prompting skill. I’m not sure what galaxy brained scaffolds and galaxy brained prompts demonstrate. There’s so much optimization in the measurement space.
I shipped a benchmark recently, but it’s secretly a synthetic data play so regardless of how hard people try in order to score on it, we get synthetic data out of it which leads to finetune jobs which leads to domain specific models that can do such tasks hopefully with minimal prompting effort and no scaffolding.
The more I learn about measurement, the less seriously I take it
I’m impressed with models that accomplish tasks in zero or one shot with minimal prompting skill. I’m not sure what galaxy brained scaffolds and galaxy brained prompts demonstrate. There’s so much optimization in the measurement space.
I shipped a benchmark recently, but it’s secretly a synthetic data play so regardless of how hard people try in order to score on it, we get synthetic data out of it which leads to finetune jobs which leads to domain specific models that can do such tasks hopefully with minimal prompting effort and no scaffolding.