Joining big labs to work on alignment = actually helping improve capabilities, helping concentrate power at the top
I’m working through my own theory of change right now and would really appreciate any sources that helped you arrive here.
My current prior is weaker. I think the fungibility argument has weight (alignment research feeds back into capability, safety teams give legitimacy to labs that could be misused, commercial pressure bends commitments, e.g. Anthropic’s RSP v3 walking back concrete if-then triggers). But I don’t currently see it as fully fungible. The counterfactual of “Anthropic’s best alignment people go elsewhere” doesn’t leave the frontier of AI safer, and I put some weight on the founders’ OpenAI departure and original RSP as evidence of a real, albeit vulnerable, safety preference.
Examples of cruxes that would move me toward your view would be evidence that safety hires at frontier labs accelerate capability timelines, or accounts of where those people working on alignment would go and why that’s better.
What have you read/observed that formed the cynical view? Would love to weight it against my priors instead of just arguing from mine.
Agreed. Even in non-technical consulting projects (e.g., strategy, change management), I’ve found high ROI by turning our process documentation into skills and turning our project plans into memory layers (e.g., current focus, recent progress). I’ve found that lessons / workflows from software development are useful for pretty much any domain, especially as more work is delegated to the agent as you mentioned.