MichaelDickens comments on Why would AI companies use human-level AI to do alignment research?

MichaelDickens 25 Apr 2025 22:20 UTC
1 point
0
I wrote several attempts at a reply and deleted them all because none of them were cruxes for me. I went for a walk and thought more deeply about my cruxes.

I am now back from my walk. Here is what I have determined:

No reply I could write would be cruxy because my original post is not cruxy with respect to my personal behavior.

I believe the correct thing for me to do is to advocate for slowing down AI development, and to donate to orgs that cost-effectively advocate for slowing down AI development. And my post is basically irrelevant to why I believe that.

So why did I write the post? When I wrote it, I wasn’t thinking about cruxes. It was just an argument I had been thinking about that I’d never read before, and I thought someone ought to write it out.

And I’m not sure exactly who this post is a crux for. Perhaps if someone had a particular combination of beliefs about
1. the probability that slowing down AI development will work
2. the probability that bootstrapped alignment will work
where they’re teetering on the edge between “slowing down AI development is good” and “slowing down AI development is bad because it prevents bootstrapped alignment from happening”. My argument might shift that person from the second position to the first. I don’t know if any such person exists.

This is most relevant to slowing down AI development at particular companies—say, if DeepMind slows down and gets significantly surpassed by Meta, then Meta will probably do something that’s even less likely to work than bootstrapped alignment. But a global coordinated slowdown—which is my preferred outcome—does not replace bootstrapped alignment with a worse alignment strategy.

Even though it’s not cruxy, I feel like I should give an object-level response to your comment:

I agree with the denotation of your comment because it is well-hedged—I agree that 5% of resources might be enough to solve alignment. But it probably won’t be.

I think my biggest concern isn’t that AI alignment has no scalable solutions (I agree with you that it probably does have them); my concern is more that alignment is likely to be too hard / get outpaced by capabilities and we will have ASI before alignment is solved.

We can in principle solve large fraction of safety/alignment with fully theoretical safety research without any compute while it seems harder to do purely theoretical capabilities research.

Not to say I disagree (my intuition is that theoretical approaches are underrated), but this contradicts AI companies’ plans (or at least Anthropic’s). Anthropic has claimed that they need to build frontier AI systems on which to do safety research. They seem to think they can’t solve alignment with theoretical approaches. More broadly, if they’re correct, then it seems to me (although it’s not a straightforward contradiction) that alignment bootstrapping won’t have significant advantages to scale because they will need increasing amounts of compute for alignment-related experiments.

FWIW I think your point is more reasonable than Anthropic’s position (I wrote some relevant stuff here). But I thought it was worthwhile to point out the contradiction.