I probably disagree. I get the feeling you have an overly demanding definition of “value” which is not necessary for solving corrigibility and a bunch of other problems. Seems like you want to define “value” closer to something like CEV or “caring about the ever-changing semantic essence of human ethical concepts”. But even if we talk about those stronger concepts (CEV-like values, essences), I’d argue the dynamic I’m talking about (“human ability to comprehend limits what humans can care about”) still applies to them to an important extent.
The issue is that the following is likely true according to me, though controversial:
The type of mind that might kill all humans has to do a bunch of truly novel thinking.
To have our values interface appropriately with with this novel thinking patterns in the AI, including through corrigibility, I think we have to work with “values” that are the sort of thing that can refer / be preserved / be transferred across “ontological” changes.
Rasha: “This will discover variables that you know how to evaluate, like where the cheese is in the maze—you have access to the ground truth against which you can compare a reporter-system’s attempt to read off the position of the cheese from the AI’s internals. But this won’t extend to variables that you don’t know how to evaluate. So this approach to honesty won’t solve the part of alignment where, at some point, some mind has to interface with ideas that are novel and alien to humanity and direct the power of those ideas toward ends that humans like.”
Thanks for elaborating! This might lead to a crux. Let me summarize the proposals from the post (those summaries can’t replace reading the post though).
Outer alignment:
We define something like a set of primitives. Those primitives are independent from any specific ontology.
We prove[1] that as long as AI acts and interprets tasks using those primitives, it can prevent humans from being killed or brainwashed or disempowered. Even if the primitives are not enough to give a very nuanced definition of a “human” or “brainwashing”. That’s where the “we can express care about incomprehensible things as care about comprehensible properties of incomprehensible things” argument comes into play.
Inner alignment:
We prove that a more complicated model (made of the primitives) can’t deceive a simpler model (made of the primitives). The inner/outer alignment of simple enough models can be verified manually.
We prove that the most complicated model (expressible with the primitives) has at least human-level intelligence.
Bonus: we prove that any model (made of the primitives) is interpretable/learnable by humans and prove that you don’t need more complicated models for defining corrigibility/honesty. Disclaimer: the proposals above are not supposed to be practical, merely bounded and being conceptually simple.
Why the heck would we be able to define primitives with such wildly nice properties? Because of the argument that human ability to comprehend and act in the world limits what humans might currently care about, and the current human values are enough to express corrigibility. If you struggle to accept this argument, maybe try assuming it’s true and see if you can follow the rest of the logic? Or try to find a flaw in the logic instead of disagreeing with the definitions. Or bring up a specific failure mode.
To have our values interface appropriately with with this novel thinking patterns in the AI, including through corrigibility, I think we have to work with “values” that are the sort of thing that can refer / be preserved / be transferred across “ontological” changes.
If you talk about ontological crisis or inner alignment, I tried to address those in the post. By the way, I read most of your blog post and skimmed the rest.
To actually prove it we need to fully formalize the idea, of course. But I think my idea is more specific than many other alignment ideas (e.g. corrigibility, Mechanistic Anomaly Detection, Shard Theory).
[call turns out to be maybe logistically inconvenient]
It’s OK if a person’s mental state changes because they notice a pink car (“human object recognition” is an easier to optimize/comprehend process). It’s not OK if a person’s mental state changes because the pink car has weird subliminal effects on the human psyche (“weird subliminal effects on the human psyche” is a harder to optimize/comprehend process).
It’s extra difficult if you’re not able to use the concepts you’re trying to disallow, in order to disallow them—and it sounds like that’s what you’re trying to do (you’re trying to “automatically” disallow them, presumably without the use of an AI that does understand them).
You say this:
But I don’t get if, or why, you think that adds up to anything like the above.
Anyway, is the following basically what you’re proposing?
Humans can check goodness of A0 because A0 is only able to think using stuff that humans are quite familiar with. Then Ak is able to oversee Ak+1 because… (I don’t get why; something about mapping primitives, and deception not being possible for some reason?) Then AN is really smart and understands stuff that humans don’t understand, but is overseen by a chain that ends in a good AI, A0.
So, somehow you’re able to know when an AI is exerting optimization power in “a way that flows through” some specific concepts?
Yes, we’re able to tell if AI optimizes through a specific class of concepts. In most/all sections of the post I’m assuming the AI generates concepts in a special language (i.e. it’s not just a trained neural network), a language which allows to measure the complexity of concepts. The claim is that if you’re optimizing through concepts of certain complexity, then you can’t fulfill a task in a “weird” way. If the claim is true and AI doesn’t think in arbitrary languages, then it’s supposed to be impossible to create a harmful Doppelganger.
But I don’t get if, or why, you think that adds up to anything like the above.
Clarification: only the interpretability section deals with inner alignment. The claims of the previous sections are not supposed to follow from the interpretability section.
Anyway, is the following basically what you’re proposing?
Yes. The special language is supposed to have the property that Akcan automatically learn if Ak+1 plans good, bad, or unnecessary actions.Ancan’t be arbitrarily smarter than humans, but it’s a general intelligence which doesn’t imitate humans and can know stuff humans don’t know.
Yes. The special language is supposed to have the property that Ak can automatically learn if
Ak+1 plans good, bad, or unnecessary actions. An can’t be arbitrarily smarter than humans, but it’s a general intelligence which doesn’t imitate humans and can know stuff humans don’t know.
So to my mind, this scheme is at significant risk of playing a shell game with “how the AIs collectively use novel structures but in a way that is answerable to us / our values”. You’re saying that the simple AI can tell if the more complex AI’s plans are good, bad, or unnecessary—but also the latter “can know stuff humans don’t know”. How?
In other words, I’m saying that making it so that
the AI generates concepts in a special language
but also the AI is actually useful at all, is almost just a restatement of the whole alignment problem.
First things first, defining a special language which creates a safe but useful AGI absolutely is just a restatement of the problem, more or less. But the post doesn’t just restate the problem, it describes the core principle of the language (the comprehension/optimization metric) and makes arguments for why the language should be provably sufficient for solving a big part of alignment.
You’re saying that the simple AI can tell if the more complex AI’s plans are good, bad, or unnecessary—but also the latter “can know stuff humans don’t know”. How?
This section deduces the above from claims A and B. What part of the deduction do you disagree with/confused about? Here’s how the deduction would apply to the task “protect a diamond from destruction”:
Akcares about an ontologically fundamental diamond.Ak+1 models the world as clouds of atoms.
According to the principle, we can automatically find what object in Ak+1 corresponds to the “ontologically fundamental diamond”.
Therefore, we can know what Ak+1 plans would preserve the diamond. We also can know if applying any weird optimization to the diamond is necessary for preserving it. Checking for necessity is probably hard, might require another novel insight. But “necessity” is a simple object-level property.
The automatic finding of the correspondence (step 2) between an important comprehensible concept and an important incomprehensible concept resolves the apparent contradiction.[1]
Now, without context, step 2 is just a restatement of the ontology identification problem. The first twosections of the post (mostly the first one) explain why the comprehension/optimization metric should solve it. I believe my solution is along the lines of the research avenues Eliezer outlined.
If my principle is hard to agree with, please try to assume that it’s true and see if you can follow how it solves some alignment problems.
the current human values are enough to express corrigibility
Huh? Not sure I understand this. How is this the case?
(I may have to tap out, because busy. At some point we could have a call to chat—might be much easier to communicate in that context. I think we have several background disagreements, so that I don’t find it easy to interpret your statements.)
I probably disagree. I get the feeling you have an overly demanding definition of “value” which is not necessary for solving corrigibility and a bunch of other problems. Seems like you want to define “value” closer to something like CEV or “caring about the ever-changing semantic essence of human ethical concepts”. But even if we talk about those stronger concepts (CEV-like values, essences), I’d argue the dynamic I’m talking about (“human ability to comprehend limits what humans can care about”) still applies to them to an important extent.
The issue is that the following is likely true according to me, though controversial:
To have our values interface appropriately with with this novel thinking patterns in the AI, including through corrigibility, I think we have to work with “values” that are the sort of thing that can refer / be preserved / be transferred across “ontological” changes.
Quoting from https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html:
Thanks for elaborating! This might lead to a crux. Let me summarize the proposals from the post (those summaries can’t replace reading the post though).
Outer alignment:
We define something like a set of primitives. Those primitives are independent from any specific ontology.
We prove[1] that as long as AI acts and interprets tasks using those primitives, it can prevent humans from being killed or brainwashed or disempowered. Even if the primitives are not enough to give a very nuanced definition of a “human” or “brainwashing”. That’s where the “we can express care about incomprehensible things as care about comprehensible properties of incomprehensible things” argument comes into play.
Inner alignment:
We prove that a more complicated model (made of the primitives) can’t deceive a simpler model (made of the primitives). The inner/outer alignment of simple enough models can be verified manually.
We prove that the most complicated model (expressible with the primitives) has at least human-level intelligence.
Bonus: we prove that any model (made of the primitives) is interpretable/learnable by humans and prove that you don’t need more complicated models for defining corrigibility/honesty. Disclaimer: the proposals above are not supposed to be practical, merely bounded and being conceptually simple.
Why the heck would we be able to define primitives with such wildly nice properties? Because of the argument that human ability to comprehend and act in the world limits what humans might currently care about, and the current human values are enough to express corrigibility. If you struggle to accept this argument, maybe try assuming it’s true and see if you can follow the rest of the logic? Or try to find a flaw in the logic instead of disagreeing with the definitions. Or bring up a specific failure mode.
If you talk about ontological crisis or inner alignment, I tried to address those in the post. By the way, I read most of your blog post and skimmed the rest.
To actually prove it we need to fully formalize the idea, of course. But I think my idea is more specific than many other alignment ideas (e.g. corrigibility, Mechanistic Anomaly Detection, Shard Theory).
[call turns out to be maybe logistically inconvenient]
So, somehow you’re able to know when an AI is exerting optimization power in “a way that flows through” some specific concepts? I think this is pretty difficult; see the fraughtness of inexplicitness or more narrowly the conceptual Doppelgänger problem.
It’s extra difficult if you’re not able to use the concepts you’re trying to disallow, in order to disallow them—and it sounds like that’s what you’re trying to do (you’re trying to “automatically” disallow them, presumably without the use of an AI that does understand them).
You say this:
But I don’t get if, or why, you think that adds up to anything like the above.
Anyway, is the following basically what you’re proposing?
Yes, we’re able to tell if AI optimizes through a specific class of concepts. In most/all sections of the post I’m assuming the AI generates concepts in a special language (i.e. it’s not just a trained neural network), a language which allows to measure the complexity of concepts. The claim is that if you’re optimizing through concepts of certain complexity, then you can’t fulfill a task in a “weird” way. If the claim is true and AI doesn’t think in arbitrary languages, then it’s supposed to be impossible to create a harmful Doppelganger.
Clarification: only the interpretability section deals with inner alignment. The claims of the previous sections are not supposed to follow from the interpretability section.
Yes. The special language is supposed to have the property that Akcan automatically learn if Ak+1 plans good, bad, or unnecessary actions.An can’t be arbitrarily smarter than humans, but it’s a general intelligence which doesn’t imitate humans and can know stuff humans don’t know.
So to my mind, this scheme is at significant risk of playing a shell game with “how the AIs collectively use novel structures but in a way that is answerable to us / our values”. You’re saying that the simple AI can tell if the more complex AI’s plans are good, bad, or unnecessary—but also the latter “can know stuff humans don’t know”. How?
In other words, I’m saying that making it so that
but also the AI is actually useful at all, is almost just a restatement of the whole alignment problem.
First things first, defining a special language which creates a safe but useful AGI absolutely is just a restatement of the problem, more or less. But the post doesn’t just restate the problem, it describes the core principle of the language (the comprehension/optimization metric) and makes arguments for why the language should be provably sufficient for solving a big part of alignment.
This section deduces the above from claims A and B. What part of the deduction do you disagree with/confused about? Here’s how the deduction would apply to the task “protect a diamond from destruction”:
Akcares about an ontologically fundamental diamond.Ak+1 models the world as clouds of atoms.
According to the principle, we can automatically find what object in Ak+1 corresponds to the “ontologically fundamental diamond”.
Therefore, we can know what Ak+1 plans would preserve the diamond. We also can know if applying any weird optimization to the diamond is necessary for preserving it. Checking for necessity is probably hard, might require another novel insight. But “necessity” is a simple object-level property.
The automatic finding of the correspondence (step 2) between an important comprehensible concept and an important incomprehensible concept resolves the apparent contradiction.[1]
Now, without context, step 2 is just a restatement of the ontology identification problem. The first two sections of the post (mostly the first one) explain why the comprehension/optimization metric should solve it. I believe my solution is along the lines of the research avenues Eliezer outlined.
If my principle is hard to agree with, please try to assume that it’s true and see if you can follow how it solves some alignment problems.
Huh? Not sure I understand this. How is this the case?
(I may have to tap out, because busy. At some point we could have a call to chat—might be much easier to communicate in that context. I think we have several background disagreements, so that I don’t find it easy to interpret your statements.)