I think we need a much more complex error theory for humans. The probability of a human making a mistake goes up if they are sick, tired, rushed, stressed, being manipulated (for example by sycophancy), it also depends on the individual skills and capabilities of the human in question, and on the difficulty of the specific task for that specific human. When the human make errors, some of these factors tend to produce relatively unbiased mostly-random errors, while others like being manipulated may produce very directed and predictable errors.
All of this is obvious and well-known (especially to anyone used to working in the soft sciences, or otherwise working with humans a lot) — both to the humans and to the AIs involved. When building a system that includes human input or feedback, all of this should be taken into account. Current work on weak-to-strong generalization barely scratches the surface of this — it just acknowledges that the weak supervisor can make mistakes, but doesn’t attempt to learn about or model and then compensate for any of the deep, complex structure to these.
1) humans need to figure out how large the region-of-convergence is: how aligned does an AI need to be for it performing value learning to produce results that are better aligned than it is? In particular, we want a confident lower bound on the size of the convergence region — a region in which we’re very confident this process will converge.
2) humans need to test their AI and convince themselves that it’s at least aligned enough to be inside the region of convergence
3) now we can work with the semi-aligned AI to create a better aligned successor, and iterate that process as it converges
One thing that is actually helpful for step 1) is that the stakes of AI alignment are extremely high. Any model sufficiently close to aligned that it doesn’t want to risk the extinction or similar X-risks like permanent loss-of-autonomy of the human race is motivated to want to do a good job of value learning — then we need it to also be good enough at understanding humans that it’s capable of understanding us so capable of performing this research, which seems like a low bar for LLMs (e.g. LLM simulations of doctors already have better bedside manner than the majority of real doctors). So basically, because the stakes are so high, a bare minimum of “don’t kill every human” level semi-alignment seems likely to be most of what we need at step 1).
Of course, that still leaves step 2) — which seems like the area where things like debate might be helpful. However, at step 2) we’re presumably dealing with something around or at most just above smart-human-capabilities level, not a full-blown ASI. Under this approach, we should already be well into step 3) convergence before we got to full-blown ASI.
A more subtle issue here is whether the region of convergence has a unique stable point inside it, or if there are multiple, and if the latter how we identify them and decide which one to aim the process for. That sounds like the sort of question that might come up once we were well into the ASI capabilities level, and where debate might be very necessary.
I think we need a much more complex error theory for humans. The probability of a human making a mistake goes up if they are sick, tired, rushed, stressed, being manipulated (for example by sycophancy), it also depends on the individual skills and capabilities of the human in question, and on the difficulty of the specific task for that specific human. When the human make errors, some of these factors tend to produce relatively unbiased mostly-random errors, while others like being manipulated may produce very directed and predictable errors.
All of this is obvious and well-known (especially to anyone used to working in the soft sciences, or otherwise working with humans a lot) — both to the humans and to the AIs involved. When building a system that includes human input or feedback, all of this should be taken into account. Current work on weak-to-strong generalization barely scratches the surface of this — it just acknowledges that the weak supervisor can make mistakes, but doesn’t attempt to learn about or model and then compensate for any of the deep, complex structure to these.
I think the Value Learning framework is useful here. Ideally we would like our smart AIs to be researching human values, including researching errors in feedback/input that humans give about their values. The challenge here is that the AI doing the value learning needs to be, not exactly aligned, but at least sufficiently close to aligned that it actually wants to do good research to figure out how to become (or create a successor that is) better aligned. So that converts this into a three-step problem:
1) humans need to figure out how large the region-of-convergence is: how aligned does an AI need to be for it performing value learning to produce results that are better aligned than it is? In particular, we want a confident lower bound on the size of the convergence region — a region in which we’re very confident this process will converge.
2) humans need to test their AI and convince themselves that it’s at least aligned enough to be inside the region of convergence
3) now we can work with the semi-aligned AI to create a better aligned successor, and iterate that process as it converges
One thing that is actually helpful for step 1) is that the stakes of AI alignment are extremely high. Any model sufficiently close to aligned that it doesn’t want to risk the extinction or similar X-risks like permanent loss-of-autonomy of the human race is motivated to want to do a good job of value learning — then we need it to also be good enough at understanding humans that it’s capable of understanding us so capable of performing this research, which seems like a low bar for LLMs (e.g. LLM simulations of doctors already have better bedside manner than the majority of real doctors). So basically, because the stakes are so high, a bare minimum of “don’t kill every human” level semi-alignment seems likely to be most of what we need at step 1).
Of course, that still leaves step 2) — which seems like the area where things like debate might be helpful. However, at step 2) we’re presumably dealing with something around or at most just above smart-human-capabilities level, not a full-blown ASI. Under this approach, we should already be well into step 3) convergence before we got to full-blown ASI.
A more subtle issue here is whether the region of convergence has a unique stable point inside it, or if there are multiple, and if the latter how we identify them and decide which one to aim the process for. That sounds like the sort of question that might come up once we were well into the ASI capabilities level, and where debate might be very necessary.