TurnTrout comments on What’s the short timeline plan?

TurnTrout 8 Jan 2025 23:52 UTC
LW: 28 AF: 15
4
AF
There are many possible ways these could be implemented, e.g. we could have black-box-only monitors such as smaller-but-faster models that constantly run in parallel (like Gemini-flash to monitor Gemini)
Suppose you’re doing speculative decoding on Gemini using Gemini Flash as the cheap model. Train Flash to have a head for each metric of interest (like “is this token part of text which is scheming”). Then you run Flash anyways for speculative decoding, leading to zero amortized monitoring tax (just the fixed cost of training the heads).