Hello!
I’m new here, but have been reading through the sequences and other posts for the last few weeks and would love some feedback on a post idea. I’m writing my theory of change for AI safety and how I can help. I’ve defined my priors, identified cruxes, and I’m in the middle of reading papers and blog posts to challenge my priors. I’ve seen a few theory of change posts (e.g., Critch’s healthtech post), but I’m wondering if I should post mine as a working document, starting with an unfinished product and updating as I refine my beliefs.
Is an in-progress theory of change interesting/useful for LessWrong, or should I wait until it’s complete?
Since this is also my first post here, I’m dropping some of my background info below.
Background: 10 years of professional experience, with the first 5 in structural engineering and the last 5 in consulting for US Transportation Agency Data/AI projects. This year I’ve been volunteering for Building Humane Technology to help with HumaneBench (benchmark showing how frontier models can be steered toward anti-humane behavior just by adjusting system prompts).
I’m working through my own theory of change right now and would really appreciate any sources that helped you arrive here.
My current prior is weaker. I think the fungibility argument has weight (alignment research feeds back into capability, safety teams give legitimacy to labs that could be misused, commercial pressure bends commitments, e.g. Anthropic’s RSP v3 walking back concrete if-then triggers). But I don’t currently see it as fully fungible. The counterfactual of “Anthropic’s best alignment people go elsewhere” doesn’t leave the frontier of AI safer, and I put some weight on the founders’ OpenAI departure and original RSP as evidence of a real, albeit vulnerable, safety preference.
Examples of cruxes that would move me toward your view would be evidence that safety hires at frontier labs accelerate capability timelines, or accounts of where those people working on alignment would go and why that’s better.
What have you read/observed that formed the cynical view? Would love to weight it against my priors instead of just arguing from mine.