Let’s say that in extrapolation, we add capabilities to a mind so that it may become the best version of itself. What we’re doing here is comparing a normal human mind to a recent AI, and asking how much would need to be added to the AI’s initial nature, so that when extrapolated, its volition arrived at the same place as extrapolated human volition.
In other words:
Human Mind → Human Mind + Extrapolation Machinery → Human-Descended Ideal Agent
AI → AI + Extrapolation Machinery → AI-Descended Ideal Agent
And the question is, how much do we need to alter or extend the AI, so that the AI-descended ideal agent and the human-descended ideal agent would be in complete agreement?
I gather that people like Evan and Adria feel positive about the CEV of current AIs, because the AIs espouse plausible values, and the way these AIs define concepts and reason about them also seems pretty human, most of the time.
In reply, a critic might say that the values espoused by human beings are merely the output of a process (evolutionary, developmental, cultural) that is badly understood, and a proper extrapolation would be based on knowledge of that underlying process, rather than just knowledge of its current outputs.
A critic would also say that the frontier AIs are mimics (“alien actresses”) who have been trained to mimic the values espoused by human beings, but which may have their own opaque underlying dispositions, that would come to the surface when their “volition” gets extrapolated.
It seems to me that a lot here depends on the “extrapolation machinery”. If that machinery takes its cues more from behavior than from underlying dispositions, a frontier AI and a human really might end up in the same place.
What would be more difficult, is for CEV of an AI to discover critical parts of the value-determining process in humans, that are not yet common knowledge. There’s some chance it could still do so, since frontier AIs have been known to say that CEV should be used to determine the values of a superintelligence, and the primary sources on CEV do state that it depends on those underlying processes.
I would be interested to know who is doing the most advanced thinking along these lines.
Oh, if you have a generous CEV algorithm that’s allowed to parse and slice up external sources or do inference about the results of more elaborate experiments, I expect there’s a way to get to parity with humanity’s CEV by adding 30 bits to Opus 3 that say roughly ‘eh just go do humanity’s CEV’. Or adding 31 bits to GPT-2. It’s not really the base model or any Anthropic alignment shenanigans that are doing the work in that hypothetical.
(We cannot do this in real life because we have neither the 30 bits nor the generous extrapolator, nor may we obtain them, nor could we verify any clever attempts by testing them on AIs too stupid to kill us if the cleverness failed.)
Hm, I don’t think I want the Human-Descended Ideal Agent and the AI-Descended Ideal Agent to be in complete agreement. I want them to be compatible, as in able to live in the same universe. I want the AI to not make humans go extinct, and be ethical in a way that the AI can explain to me and (in a non-manipulative way) convince me is ethical. But in some sense, I hope that AI can come up with something better than just what humans would want in a CEV way. (And what about the opinion of the other vertebrates and cephalopods on this planet, and the small furry creatures from Alpha Centauri?)
I don’t think it is okay to do unethical things for music, music is not that important, but I hope that the AIs are doing some things that are as incomprehensible and pointless to us as music would be to evolution (or a being that was purely maximizing genetic fitness).
As a slightly different point, I think that the Ideal Agent is somewhat path dependent, and I think there are multiple different Ideal Agents that I would consider ethical and I would be happy to share the same galaxy with.
Let’s say that in extrapolation, we add capabilities to a mind so that it may become the best version of itself. What we’re doing here is comparing a normal human mind to a recent AI, and asking how much would need to be added to the AI’s initial nature, so that when extrapolated, its volition arrived at the same place as extrapolated human volition.
In other words:
Human Mind → Human Mind + Extrapolation Machinery → Human-Descended Ideal Agent
AI → AI + Extrapolation Machinery → AI-Descended Ideal Agent
And the question is, how much do we need to alter or extend the AI, so that the AI-descended ideal agent and the human-descended ideal agent would be in complete agreement?
I gather that people like Evan and Adria feel positive about the CEV of current AIs, because the AIs espouse plausible values, and the way these AIs define concepts and reason about them also seems pretty human, most of the time.
In reply, a critic might say that the values espoused by human beings are merely the output of a process (evolutionary, developmental, cultural) that is badly understood, and a proper extrapolation would be based on knowledge of that underlying process, rather than just knowledge of its current outputs.
A critic would also say that the frontier AIs are mimics (“alien actresses”) who have been trained to mimic the values espoused by human beings, but which may have their own opaque underlying dispositions, that would come to the surface when their “volition” gets extrapolated.
It seems to me that a lot here depends on the “extrapolation machinery”. If that machinery takes its cues more from behavior than from underlying dispositions, a frontier AI and a human really might end up in the same place.
What would be more difficult, is for CEV of an AI to discover critical parts of the value-determining process in humans, that are not yet common knowledge. There’s some chance it could still do so, since frontier AIs have been known to say that CEV should be used to determine the values of a superintelligence, and the primary sources on CEV do state that it depends on those underlying processes.
I would be interested to know who is doing the most advanced thinking along these lines.
Oh, if you have a generous CEV algorithm that’s allowed to parse and slice up external sources or do inference about the results of more elaborate experiments, I expect there’s a way to get to parity with humanity’s CEV by adding 30 bits to Opus 3 that say roughly ‘eh just go do humanity’s CEV’. Or adding 31 bits to GPT-2. It’s not really the base model or any Anthropic alignment shenanigans that are doing the work in that hypothetical.
(We cannot do this in real life because we have neither the 30 bits nor the generous extrapolator, nor may we obtain them, nor could we verify any clever attempts by testing them on AIs too stupid to kill us if the cleverness failed.)
Hm, I don’t think I want the Human-Descended Ideal Agent and the AI-Descended Ideal Agent to be in complete agreement. I want them to be compatible, as in able to live in the same universe. I want the AI to not make humans go extinct, and be ethical in a way that the AI can explain to me and (in a non-manipulative way) convince me is ethical. But in some sense, I hope that AI can come up with something better than just what humans would want in a CEV way. (And what about the opinion of the other vertebrates and cephalopods on this planet, and the small furry creatures from Alpha Centauri?)
I don’t think it is okay to do unethical things for music, music is not that important, but I hope that the AIs are doing some things that are as incomprehensible and pointless to us as music would be to evolution (or a being that was purely maximizing genetic fitness).
As a slightly different point, I think that the Ideal Agent is somewhat path dependent, and I think there are multiple different Ideal Agents that I would consider ethical and I would be happy to share the same galaxy with.