Charlie Steiner comments on A single principle related to many Alignment subproblems?

Charlie Steiner 1 May 2025 9:20 UTC
6 points
0
Thanks, this was pretty interesting.
Big problem is the free choice of “conceptual language” (universal Turing machine) when defining simplicity/comprehensibility. You at various points rely on an assumption that there is one unique scale of complexity (one ladder of $V_{1} . . . V_{k}$ ), and it’ll be shared between the humans and the AI. That’s not necessarily true, which creates a lot of leaks where an AI might do something that’s simple in the AI’s internal representation but complicated in the human’s.
It’s OK to make cars pink by using paint (“spots of paint” is an easier to optimize/comprehend variable). It’s not OK to make cars pink by manipulating individual water droplets in the air to create an elaborate rainbow-like illusion (“individual water droplets” is a harder to optimize/comprehend variable).
This raises a second problem, which is the “easy to optimize” criterion, and how it might depend on the environment and on what tech tree unlocks (both physical and conceptual) the agent already has. Pink paint is pretty sophisticated, even though our current society has commodified it so we can take getting some for granted. Starting from no tech tree unlocks at all, you can probably get to hacking humans before you can recreate the Sherwin Williams supply chain. But if we let environmental availability weigh on “easy to to optimize,” then the agent will be happy to switch from real paint to a hologram or a human-hack once the technology for those becomes developed and commodified.
When the metric is a bit fuzzy and informal, it’s easy to reach convenient/hopeful conclusions about how the human-intended behavior is easy to optimize, but it should be hard to trust those conclusions.
- Q Home 2 May 2025 7:15 UTC
  3 points
  0
  Parent
  You at various points rely on an assumption that there is one unique scale of complexity (one ladder of $V_{1} . . . V_{k}$ ), and it’ll be shared between the humans and the AI. That’s not necessarily true, which creates a lot of leaks where an AI might do something that’s simple in the AI’s internal representation but complicated in the human’s.
  I think there are many somewhat different scales of complexity, but they’re all shared between the humans and the AI, so we can choose any of them. We start with properties ( $X$ ) which are definitely easy to understand for humans. Then we gradually relax those properties. According to the principle, $X$ properties will capture all key variables relevant to the human values much earlier than top human mathematicians and physicists will stop understanding what those properties might describe. (Because most of the time, living a value-filled life doesn’t require using the best mathematical and physical knowledge of the day.) My model: “the entirety of human ontology >>> the part of human ontology a corrigible AI needs to share”.
  This raises a second problem, which is the “easy to optimize” criterion, and how it might depend on the environment and on what tech tree unlocks (both physical and conceptual) the agent already has. Pink paint is pretty sophisticated, even though our current society has commodified it so we can take getting some for granted. Starting from no tech tree unlocks at all, you can probably get to hacking humans before you can recreate the Sherwin Williams supply chain.
  There are three important possibilities relevant to your hypothetical:
  - If technology T and human hacking are equally hard to comprehend, then (a) we don’t want the AI to build technology T or (b) the AI should be able to screen off technology T from humans more or less perfectly. For example, maybe producing paint requires complex manipulations with matter, but those manipulations should be screened off from humans. The last paragraph in this section mentions a similar situation.
  - Technology T is easier to comprehend than human hacking, but it’s more expensive (requires more resources). Then we should be able to allow the AI to use those resources, if we want to. We should be controlling how much resources the AI is using anyway, so I’m not introducing any unnatural epicycles here.^[1]
  - If humans themselves built technology T which affects them in a complicated way (e.g. drugs), it doesn’t mean the AI should build similar types of technology on its own.
  My point here is that I don’t think technology undermines the usefulness of my metric. And I don’t think that’s a coincidence. According to the principle, one or both of the below should be true:
  1. Up to this point in time, technology never affected what’s easy to optimize/comprehend on a deep enough level.
  2. Up to this point in time, humans never used technology to optimize/comprehend (on a deep enough level) most of their fundamental values.
  If neither were true, we would believe that technology radically changed fundamental human values at some point in the past. We would see life without technology as devoid of most non-trivial human values.
  When the metric is a bit fuzzy and informal, it’s easy to reach convenient/hopeful conclusions about how the human-intended behavior is easy to optimize, but it should be hard to trust those conclusions.
  The selling point of my idea is that it comes with a story for why it’s logically impossible for it to fail or why all of its flaws should be easy to predict and fix. Is it easy to come up with such story for other ideas? I agree that it’s too early to buy that story. But I think it’s original and probable enough to deserve attention.
  1. ^
    Remember that I’m talking about a Task-directed AGI, not a Sovereign AGI.