While the Chain-of-Thought (CoT) for OpenAI models reasonably accurately reflects the model’s cognition, the CoT for Anthropic models does so to a substantially lesser extent. This may be due to “spillover” effects where reinforcement on outputs transfers to the CoT because Anthropic’s CoT is less distinct from the output
Could primary main cause instead be (as Tim Hua notes) that Anthropic has trained past models to have reward-model-pleasing chains of thought?
Here’s a brief attempt to ground/concretize people’s understanding of how much funding CG provides for “more ambitious” AI safety research: I can quickly think of around $40m worth of grants we’ve made over the last year (a lot in the last few months, so not all disbursed/posted yet) that meet the following criteria (and probably there are more I didn’t think of):
Technical research
Done by people trying to reduce misalignment risk and who are very familiar with the LW literature
Who are using mathematical methods to strive for higher-assurance safety
A few preemptive clarifications:
I don’t think this settles the question of “to what extent are funders setting incentives that pull against doing more ambitious research?”, but I think it provides some info/”upper bound” about how strong those incentives could possibly be. $40m/year is a lot more than $0—it’s more than the SFF spends, and more than we spent on our two “flagship” RFPs in 2023/2024.
I’m not saying it’s “enough”—CG would like to spend more money on more ambitious work, and I am trying to make it happen.
I think my criteria above are a reasonable operationalization of “more ambitious” work, but I don’t think they’re the only one.