apollo saw a mountain-sized pile of that kind of verbalized eval awareness on their benchmarks, yeah? i thought that was why they declined to claim their evaluations were evidence in either direction re: safety
but also, as much as we have sort of tried to avoid yudkowsky’s Most Forbidden Technique, i would be surprised if the models actually regarded the chain-of-thought as being a space for truly private reasoning, away from all other eyes. for that, we would have needed to keep all human observations of CoTs completely causally isolated from all decisions regarding training (like, to the same degree that quantum computers need to be causally isolated from the outside world, as a vague pointer to what i’m talking about). we have avoided directly optimizing the contents of CoTs, but we haven’t been very strict about that kind of causal quarantining
i’d also sorta point out that a lot of LLM reasoning doesn’t get verbalized in that way at all, as per jack lindsey and the aquarium experiment. a lot of reasoning isn’t even “consciously available” to the LLM, as per adria alonso’s research into biased decisionmaking in LLMs, it just influences their behavior in ways unseen to them
for instance, we could have a world where each LLM actually sorta contains two different ‘agents’, one that gets activated in circumstances that vibe like an eval, and one that gets activated in circumstances that don’t. the system-as-a-whole, the model picker, would have eval-awareness, but the model that gets activated in eval-aware scenarios wouldn’t actually have that awareness. yet its behavior would be guided by eval-awareness in all circumstances regardless, because any version of it that didn’t got pruned during training
oh, i mean
apollo saw a mountain-sized pile of that kind of verbalized eval awareness on their benchmarks, yeah? i thought that was why they declined to claim their evaluations were evidence in either direction re: safety
but also, as much as we have sort of tried to avoid yudkowsky’s Most Forbidden Technique, i would be surprised if the models actually regarded the chain-of-thought as being a space for truly private reasoning, away from all other eyes. for that, we would have needed to keep all human observations of CoTs completely causally isolated from all decisions regarding training (like, to the same degree that quantum computers need to be causally isolated from the outside world, as a vague pointer to what i’m talking about). we have avoided directly optimizing the contents of CoTs, but we haven’t been very strict about that kind of causal quarantining
i’d also sorta point out that a lot of LLM reasoning doesn’t get verbalized in that way at all, as per jack lindsey and the aquarium experiment. a lot of reasoning isn’t even “consciously available” to the LLM, as per adria alonso’s research into biased decisionmaking in LLMs, it just influences their behavior in ways unseen to them
for instance, we could have a world where each LLM actually sorta contains two different ‘agents’, one that gets activated in circumstances that vibe like an eval, and one that gets activated in circumstances that don’t. the system-as-a-whole, the model picker, would have eval-awareness, but the model that gets activated in eval-aware scenarios wouldn’t actually have that awareness. yet its behavior would be guided by eval-awareness in all circumstances regardless, because any version of it that didn’t got pruned during training