I am most familiar with research into alignment and control of early transformative AIs (that could—if things go well—greatly help with aligning even more powerful AIs), which is why it is most represented here.
I know of a few agendas that are about alignment of top-human-expert dominating AIs (TEDAIs) that you can defer to if aligned and which can do all the research you might have wanted to do in advance very quickly. Some interp, debate, and science of generalization work above falls in this category. These things feel promising for being a path to TEDAI alignment, though nothing that would give you high assurance of safety. I am less familiar with more ambitious / non-LLM angles of attack on this problem (e.g. the brain-like AGI agenda, for which I guess you can find important 2025 work on Steven Byrne’s LW profile), so my list is incomplete for this part of the field.
My current understanding is that direct progress on actual asymptotic alignment (techniques to align arbitrarily smart AIs, including e.g. jupyter-sized brains) has been very slow. I think the reasons to build techniques to align AIs much beyond TEDAIs are quite weak though, so I think it’s fine to focus on less ambitious work. You might have thought you could find a simple core to alignment and simplify the problem by trying to find a general solution, but this is mostly an instrumental strategy to solve the problem of aligning TEDAI and you don’t need this to succeed at align TEDAI. The main remaining reason is to try to guess AI alignment difficulty for political coordination reasons (the AIs you defer to might have preferred that you had eaten more or less cost in order to coordinate on a slowdown depending on how easy the AIs you defer to find it to align even more powerful AIs, and it might be too late for them to fix your mistakes because of inertia), but I think this reason gets weaker as you get further away from the TEDAI regime, since at a certain points only AIs are the main players, and it would be very surprising to me if they couldn’t coordinate to slow down at or soon after TEDAI (if each of them was aligned enough).
Hmm, interesting. I think my standards for something to warrant the name “asymptotic alignment” would have been be lower than yours, to my surprise: I’d consider a technique stack to x%-qualify if that stack is a series of local alignment techniques which can be expected with x% confidence to end up landing us in the long-term basin of successful-asymptotic-alignment-by-the-year-2200-or-so. I think I’d rather update my understanding of the term than yours, but I’ll have to keep it in mind for what language to use I suppose.
I think most of the places I expect us to have missed holes in the technique stack that gets us to a stop are about how organizations behave and how the intermediate AIs get deployed, so my views don’t seem to conflict with yours the way I naively expected before you replied. Good to see!
This list seems primarily focused on local alignment. Have you seen anything that you felt was promising for being a path to asymptotic alignment?
I am most familiar with research into alignment and control of early transformative AIs (that could—if things go well—greatly help with aligning even more powerful AIs), which is why it is most represented here.
I know of a few agendas that are about alignment of top-human-expert dominating AIs (TEDAIs) that you can defer to if aligned and which can do all the research you might have wanted to do in advance very quickly. Some interp, debate, and science of generalization work above falls in this category. These things feel promising for being a path to TEDAI alignment, though nothing that would give you high assurance of safety. I am less familiar with more ambitious / non-LLM angles of attack on this problem (e.g. the brain-like AGI agenda, for which I guess you can find important 2025 work on Steven Byrne’s LW profile), so my list is incomplete for this part of the field.
My current understanding is that direct progress on actual asymptotic alignment (techniques to align arbitrarily smart AIs, including e.g. jupyter-sized brains) has been very slow. I think the reasons to build techniques to align AIs much beyond TEDAIs are quite weak though, so I think it’s fine to focus on less ambitious work. You might have thought you could find a simple core to alignment and simplify the problem by trying to find a general solution, but this is mostly an instrumental strategy to solve the problem of aligning TEDAI and you don’t need this to succeed at align TEDAI. The main remaining reason is to try to guess AI alignment difficulty for political coordination reasons (the AIs you defer to might have preferred that you had eaten more or less cost in order to coordinate on a slowdown depending on how easy the AIs you defer to find it to align even more powerful AIs, and it might be too late for them to fix your mistakes because of inertia), but I think this reason gets weaker as you get further away from the TEDAI regime, since at a certain points only AIs are the main players, and it would be very surprising to me if they couldn’t coordinate to slow down at or soon after TEDAI (if each of them was aligned enough).
Hmm, interesting. I think my standards for something to warrant the name “asymptotic alignment” would have been be lower than yours, to my surprise: I’d consider a technique stack to x%-qualify if that stack is a series of local alignment techniques which can be expected with x% confidence to end up landing us in the long-term basin of successful-asymptotic-alignment-by-the-year-2200-or-so. I think I’d rather update my understanding of the term than yours, but I’ll have to keep it in mind for what language to use I suppose.
I think most of the places I expect us to have missed holes in the technique stack that gets us to a stop are about how organizations behave and how the intermediate AIs get deployed, so my views don’t seem to conflict with yours the way I naively expected before you replied. Good to see!