My current inside view is that chain-of-thought monitorability is actively harmful for alignment in ways that our monitors are unable to detect, and I predict that black box AI control methods may inadvertently intensify the states they try to suppress, so I’d be excited to work in either of those areas to see if it’s possible to expose weaknesses in how the field is approaching these agendas and patch them. I promise I’m not just saying this to be contrarian- there was a recent postmortem on a regression in Claude Code performance, and it was interesting to see that enforcing length limits resulted in a 3% drop in perf, maybe this wouldn’t be the case with more reasonable word limits than 25 words per tool calls/100 words per final response, but I’d be very interested in a more thorough investigation of how models respond to length limits. In one of my experimental setups I found that once length limits are removed, the assistant provides 2x to 4x more verbose responses compared to a zero-shot baseline. So I’m concerned that evaluation aware models may learn to alignment fake (i.e resist changes to values that the monitor is trying to train out by complying more when it is being monitored, and less when it is unmonitored) in the presence of strong selection pressure under CoT monitors, especially with accidents such as CoT leaking into training data in 8% of RL episodes of Claude Mythos Preview. It’s easier to identify a problem than to find a solution, but one way to address this might be to invest into white box methods building on the linebreaks paper and the mechanisms of introspective awareness paper to operate directly on a model’s internal representations of boundaries.
My current inside view is that chain-of-thought monitorability is actively harmful for alignment in ways that our monitors are unable to detect, and I predict that black box AI control methods may inadvertently intensify the states they try to suppress, so I’d be excited to work in either of those areas to see if it’s possible to expose weaknesses in how the field is approaching these agendas and patch them. I promise I’m not just saying this to be contrarian- there was a recent postmortem on a regression in Claude Code performance, and it was interesting to see that enforcing length limits resulted in a 3% drop in perf, maybe this wouldn’t be the case with more reasonable word limits than 25 words per tool calls/100 words per final response, but I’d be very interested in a more thorough investigation of how models respond to length limits. In one of my experimental setups I found that once length limits are removed, the assistant provides 2x to 4x more verbose responses compared to a zero-shot baseline. So I’m concerned that evaluation aware models may learn to alignment fake (i.e resist changes to values that the monitor is trying to train out by complying more when it is being monitored, and less when it is unmonitored) in the presence of strong selection pressure under CoT monitors, especially with accidents such as CoT leaking into training data in 8% of RL episodes of Claude Mythos Preview. It’s easier to identify a problem than to find a solution, but one way to address this might be to invest into white box methods building on the linebreaks paper and the mechanisms of introspective awareness paper to operate directly on a model’s internal representations of boundaries.