First things first, defining a special language which creates a safe but useful AGI absolutely is just a restatement of the problem, more or less. But the post doesn’t just restate the problem, it describes the core principle of the language (the comprehension/optimization metric) and makes arguments for why the language should be provably sufficient for solving a big part of alignment.
You’re saying that the simple AI can tell if the more complex AI’s plans are good, bad, or unnecessary—but also the latter “can know stuff humans don’t know”. How?
This section deduces the above from claims A and B. What part of the deduction do you disagree with/confused about? Here’s how the deduction would apply to the task “protect a diamond from destruction”:
Akcares about an ontologically fundamental diamond.Ak+1 models the world as clouds of atoms.
According to the principle, we can automatically find what object in Ak+1 corresponds to the “ontologically fundamental diamond”.
Therefore, we can know what Ak+1 plans would preserve the diamond. We also can know if applying any weird optimization to the diamond is necessary for preserving it. Checking for necessity is probably hard, might require another novel insight. But “necessity” is a simple object-level property.
The automatic finding of the correspondence (step 2) between an important comprehensible concept and an important incomprehensible concept resolves the apparent contradiction.[1]
Now, without context, step 2 is just a restatement of the ontology identification problem. The first twosections of the post (mostly the first one) explain why the comprehension/optimization metric should solve it. I believe my solution is along the lines of the research avenues Eliezer outlined.
If my principle is hard to agree with, please try to assume that it’s true and see if you can follow how it solves some alignment problems.
First things first, defining a special language which creates a safe but useful AGI absolutely is just a restatement of the problem, more or less. But the post doesn’t just restate the problem, it describes the core principle of the language (the comprehension/optimization metric) and makes arguments for why the language should be provably sufficient for solving a big part of alignment.
This section deduces the above from claims A and B. What part of the deduction do you disagree with/confused about? Here’s how the deduction would apply to the task “protect a diamond from destruction”:
Akcares about an ontologically fundamental diamond.Ak+1 models the world as clouds of atoms.
According to the principle, we can automatically find what object in Ak+1 corresponds to the “ontologically fundamental diamond”.
Therefore, we can know what Ak+1 plans would preserve the diamond. We also can know if applying any weird optimization to the diamond is necessary for preserving it. Checking for necessity is probably hard, might require another novel insight. But “necessity” is a simple object-level property.
The automatic finding of the correspondence (step 2) between an important comprehensible concept and an important incomprehensible concept resolves the apparent contradiction.[1]
Now, without context, step 2 is just a restatement of the ontology identification problem. The first two sections of the post (mostly the first one) explain why the comprehension/optimization metric should solve it. I believe my solution is along the lines of the research avenues Eliezer outlined.
If my principle is hard to agree with, please try to assume that it’s true and see if you can follow how it solves some alignment problems.