I am independent AI researcher focusing on interpretability. My day job is in technical consulting for an AI-enabled financial software firm.
CharlesL
Karma: 29
Abstract/Concrete Axis: Robustness Testing Semantic Feature Separation in Gemma 3 270M
I’ll run these experiments this week and let you know the findings, I appreciate the insight.
A thought about the section here where you describe your more current model for the ‘race to the top’:
Is there any suggestion that frontier labs other than Anthropic are likely to adopt RSPs before some catastrophic bad thing happens, if it does? This premise is the basis for the new model you write out here. And if they do, what makes you optimistic that the same employees who are currently advancing capabilities as fast as they can at other labs decide to reprioritize safety to a greater extent than they do currently? Or that safety-focused employees wouldn’t be pushed out of certain other frontier LLM developers? It’s something we are seemingly see happen already at some places, not to name names.
I have trouble finding the ‘race to the top’ in terms of safety/alignment credible when it seems like far more forces, whether it be investors or current government stakeholders, seem to be encouraging a ‘move as fast as possible no matter what’ regime by their words and actions. We saw it recently with the Anthropic-Pentagon dispute, most other labs do not seem particularly committed to particular principles in the face of political pressure or economic forces.