Ebenezer Dukakis comments on Can we safely automate alignment research?

Ebenezer Dukakis 4 May 2025 13:50 UTC
1 point
0
Re: “number go up” research tasks that seem automatable—one idea I had is to use an LLM to process the entire LW archive, and identify alignment research ideas which could be done using the “number go up” approach (or seem otherwise amenable to automation).

“Proactive unlearning”, in particular, strikes me as a quite promising research direction which could be automated. Especially if it is possible to “proactively unlearn” scheming. Gradient routing would be an example of the sort of approach I have in mind.

To elaborate: I think it is ideal to have an automated way to prevent an AI from acquiring undesirable capacities and knowledge in the first place. If your scheming metrics are sufficiently good, and you can keep them running for the entire training run, you might be able to nip scheming in the bud, so that any tendency towards scheming (which might subsequently corrupt the metrics to hide itself if it goes unaddressed) ends up showing up in the metrics before it’s given space to flower.