One way I think about how “deep/buried” a latent capability is is to reason about how little fine-tuning it takes to bring to the surface, or how many work hours you have to put into blackbox elicitation to bump up performance. I’d guess the various ways to measure this will tell you slightly different things, but my rough heuristic is something like “if it requires this huge, highly curated prompt or very complicated finetuning setup”, we’d need deliberate effort to elicit it, or it’d come out maybe 2 or 3 models down the line.
One way I think about how “deep/buried” a latent capability is is to reason about how little fine-tuning it takes to bring to the surface, or how many work hours you have to put into blackbox elicitation to bump up performance. I’d guess the various ways to measure this will tell you slightly different things, but my rough heuristic is something like “if it requires this huge, highly curated prompt or very complicated finetuning setup”, we’d need deliberate effort to elicit it, or it’d come out maybe 2 or 3 models down the line.
Related to recent paper I worked on training models to be steering aware https://x.com/joshycodes/status/2031384687760003140?s=20