I have seen several proposals for solving alignment (such as OpenAI’s Superalignment initiative) involve harnessing incremental or “near-human level” AI to advance alignment research. I recall from recent interviews that Eliezer is skeptical of this approach at least partially on the grounds that an AI sufficiently advanced to contribute meaningfully to alignment work would 1.) already be dangerous in it’s own right, and 2.) capable of deceiving human alignment researchers with false or insufficient proposals to advance alignment research.
Would it be possible to resolve the second problem by neither holding the AI directly accountable to human researchers or an overseer AI, as is commonly suggested, but instead to a supermajority view of multiple AIs?
I am imagining having several similarly capable AIs, ideally based on fundamentally different models but at least incorporating some random argument, and prevented from directly communicating with each other. They would then be asked to answer the question, “Would the other AIs believe proposal X is a sound* alignment solution/research direction?” Answers would probably have to be limited to Yes/No or multiple choice. The AIs would be penalized for incorrectly coordinating with the other AIs to arrive at the same answer, up to and including termination. Thereby, an otherwise deceitful AI would reason that it’s best chance of not being terminated early would be to provide non-deceptive answers.
Of course, this doesn’t resolve the risk of overshooting the “near-human intelligence” (which might plausibly occur to all or most of the AIs in the pool), one or more AI’s discovering a way to circumvent the air gap and communicate with the other AIs, or the AIs discovering some shelling point that was stronger than truthful answers. I’m merely hoping to mitigate the deceptiveness EY foresees alignment researchers will fall prey to while leveraging incremental improvements in AI.
Outside view tells me this idea will be ripped to shreds. What am I missing?
*I acknowledge the phrasing of this question side-steps the open problem of defining exactly what we want from a successful alignment program. I’m not trying to solve alignment so much as suggest an approach for combating deception specifically.
Not being able to directly communicate with the others would be an issue in the beginning, but I’m guessing you would be able to use the setup to work out what the others think.
A bigger issue is that this would probably result in a very homogeneous group of minds. They’re optimizing not for correct answers, but for consensus answers. It’s the equivalent of studying for the exams. An fun example are the Polish equivalent of SAT exams (this probably generalizes, but I don’t know about other countries). I know quite a few people who went to study biology, and then decided to retake the biology exam (as one can do). Most retakers had worse results the second time round. Because they had more up to date knowledge—the exam is like at least 10 years behind the current state of knowledge, so they give correct (as of today) answers, but have them marked as incorrect. I’d expect the group of AIs to eventually converge on a set of acceptable beliefs, rather than correct ones.
I expect it to be much harder to measure the “smarts” of an AI than it is to measure the smarts of a person (because all people share a large amount of detail in their cognitive architecture), so any approach that employs “near-human level” AI runs the risk that at least one of those AIs is not near human level at all.
I have seen several proposals for solving alignment (such as OpenAI’s Superalignment initiative) involve harnessing incremental or “near-human level” AI to advance alignment research. I recall from recent interviews that Eliezer is skeptical of this approach at least partially on the grounds that an AI sufficiently advanced to contribute meaningfully to alignment work would 1.) already be dangerous in it’s own right, and 2.) capable of deceiving human alignment researchers with false or insufficient proposals to advance alignment research.
Would it be possible to resolve the second problem by neither holding the AI directly accountable to human researchers or an overseer AI, as is commonly suggested, but instead to a supermajority view of multiple AIs?
I am imagining having several similarly capable AIs, ideally based on fundamentally different models but at least incorporating some random argument, and prevented from directly communicating with each other. They would then be asked to answer the question, “Would the other AIs believe proposal X is a sound* alignment solution/research direction?” Answers would probably have to be limited to Yes/No or multiple choice. The AIs would be penalized for incorrectly coordinating with the other AIs to arrive at the same answer, up to and including termination. Thereby, an otherwise deceitful AI would reason that it’s best chance of not being terminated early would be to provide non-deceptive answers.
Of course, this doesn’t resolve the risk of overshooting the “near-human intelligence” (which might plausibly occur to all or most of the AIs in the pool), one or more AI’s discovering a way to circumvent the air gap and communicate with the other AIs, or the AIs discovering some shelling point that was stronger than truthful answers. I’m merely hoping to mitigate the deceptiveness EY foresees alignment researchers will fall prey to while leveraging incremental improvements in AI.
Outside view tells me this idea will be ripped to shreds. What am I missing?
*I acknowledge the phrasing of this question side-steps the open problem of defining exactly what we want from a successful alignment program. I’m not trying to solve alignment so much as suggest an approach for combating deception specifically.
Not being able to directly communicate with the others would be an issue in the beginning, but I’m guessing you would be able to use the setup to work out what the others think.
A bigger issue is that this would probably result in a very homogeneous group of minds. They’re optimizing not for correct answers, but for consensus answers. It’s the equivalent of studying for the exams. An fun example are the Polish equivalent of SAT exams (this probably generalizes, but I don’t know about other countries). I know quite a few people who went to study biology, and then decided to retake the biology exam (as one can do). Most retakers had worse results the second time round. Because they had more up to date knowledge—the exam is like at least 10 years behind the current state of knowledge, so they give correct (as of today) answers, but have them marked as incorrect. I’d expect the group of AIs to eventually converge on a set of acceptable beliefs, rather than correct ones.
I expect it to be much harder to measure the “smarts” of an AI than it is to measure the smarts of a person (because all people share a large amount of detail in their cognitive architecture), so any approach that employs “near-human level” AI runs the risk that at least one of those AIs is not near human level at all.