TsviBT comments on A single principle related to many Alignment subproblems?

TsviBT 28 May 2025 16:39 UTC
3 points
0
[call turns out to be maybe logistically inconvenient]

It’s OK if a person’s mental state changes because they notice a pink car (“human object recognition” is an easier to optimize/comprehend process). It’s not OK if a person’s mental state changes because the pink car has weird subliminal effects on the human psyche (“weird subliminal effects on the human psyche” is a harder to optimize/comprehend process).

So, somehow you’re able to know when an AI is exerting optimization power in “a way that flows through” some specific concepts? I think this is pretty difficult; see the fraughtness of inexplicitness or more narrowly the conceptual Doppelgänger problem.

It’s extra difficult if you’re not able to use the concepts you’re trying to disallow, in order to disallow them—and it sounds like that’s what you’re trying to do (you’re trying to “automatically” disallow them, presumably without the use of an AI that does understand them).

You say this:

But I don’t get if, or why, you think that adds up to anything like the above.

Anyway, is the following basically what you’re proposing?

Humans can check goodness of $A_{0}$ because $A_{0}$ is only able to think using stuff that humans are quite familiar with. Then $A_{k}$ is able to oversee $A_{k + 1}$ because… (I don’t get why; something about mapping primitives, and deception not being possible for some reason?) Then $A_{N}$ is really smart and understands stuff that humans don’t understand, but is overseen by a chain that ends in a good AI, $A_{0}$ .
- Q Home 29 May 2025 5:31 UTC
  1 point
  0
  Parent
  So, somehow you’re able to know when an AI is exerting optimization power in “a way that flows through” some specific concepts?
  Yes, we’re able to tell if AI optimizes through a specific class of concepts. In most/all sections of the post I’m assuming the AI generates concepts in a special language (i.e. it’s not just a trained neural network), a language which allows to measure the complexity of concepts. The claim is that if you’re optimizing through concepts of certain complexity, then you can’t fulfill a task in a “weird” way. If the claim is true and AI doesn’t think in arbitrary languages, then it’s supposed to be impossible to create a harmful Doppelganger.
  But I don’t get if, or why, you think that adds up to anything like the above.
  Clarification: only the interpretability section deals with inner alignment. The claims of the previous sections are not supposed to follow from the interpretability section.
  Anyway, is the following basically what you’re proposing?
  Yes. The special language is supposed to have the property that $A_{k}$ can automatically learn if $A_{k + 1}$ plans good, bad, or unnecessary actions. $A_{n}$ can’t be arbitrarily smarter than humans, but it’s a general intelligence which doesn’t imitate humans and can know stuff humans don’t know.
  - TsviBT 31 May 2025 19:02 UTC
    2 points
    0
    Parent
    
    Yes. The special language is supposed to have the property that Ak can automatically learn if Ak+1 plans good, bad, or unnecessary actions. An can’t be arbitrarily smarter than humans, but it’s a general intelligence which doesn’t imitate humans and can know stuff humans don’t know.
    
    So to my mind, this scheme is at significant risk of playing a shell game with “how the AIs collectively use novel structures but in a way that is answerable to us / our values”. You’re saying that the simple AI can tell if the more complex AI’s plans are good, bad, or unnecessary—but also the latter “can know stuff humans don’t know”. How?
    
    In other words, I’m saying that making it so that
    
    the AI generates concepts in a special language
    
    but also the AI is actually useful at all, is almost just a restatement of the whole alignment problem.
    - Q Home 1 Jun 2025 8:13 UTC
      1 point
      0
      Parent
      First things first, defining a special language which creates a safe but useful AGI absolutely is just a restatement of the problem, more or less. But the post doesn’t just restate the problem, it describes the core principle of the language (the comprehension/optimization metric) and makes arguments for why the language should be provably sufficient for solving a big part of alignment.
      You’re saying that the simple AI can tell if the more complex AI’s plans are good, bad, or unnecessary—but also the latter “can know stuff humans don’t know”. How?
      This section deduces the above from claims A and B. What part of the deduction do you disagree with/confused about? Here’s how the deduction would apply to the task “protect a diamond from destruction”:
      $A_{k}$ cares about an ontologically fundamental diamond. $A_{k + 1}$ models the world as clouds of atoms.
      According to the principle, we can automatically find what object in $A_{k + 1}$ corresponds to the “ontologically fundamental diamond”.
      Therefore, we can know what $A_{k + 1}$ plans would preserve the diamond. We also can know if applying any weird optimization to the diamond is necessary for preserving it. Checking for necessity is probably hard, might require another novel insight. But “necessity” is a simple object-level property.
      The automatic finding of the correspondence (step 2) between an important comprehensible concept and an important incomprehensible concept resolves the apparent contradiction.^[1]
      ^
      Now, without context, step 2 is just a restatement of the ontology identification problem. The first two sections of the post (mostly the first one) explain why the comprehension/optimization metric should solve it. I believe my solution is along the lines of the research avenues Eliezer outlined.
      If my principle is hard to agree with, please try to assume that it’s true and see if you can follow how it solves some alignment problems.