Joe Carlsmith comments on Can we safely automate alignment research?

Joe Carlsmith 2 May 2025 0:34 UTC
LW: 6 AF: 5
0
AF
I think it’s a fair point that if it turns out that current ML methods are broadly inadequate for automating basically any sophisticated cognitive work (including capabilities research, biology research, etc—though I’m not clear on your take on whether capabilities research counts as “science” in the sense you have in mind), it may be that whatever new paradigm ends up successful messes with various implicit and explicit assumptions in analyses like the one in the essay.
That said, I think if we’re ignorant about what paradigm will succeed re: automating sophisticated cognitive work and we don’t have any story about why alignment research would be harder, it seems like the baseline expectation (modulo scheming) would be that automating alignment is comparably hard (in expectation) to automating these other domains. (I do think, though, that we have reason to expect alignment to be harder even conditional on needing other paradigms, because I think it’s reasonable to expect some of the evaluation challenges I discuss in the post to generalize to other regimes.)