Davidmanheim comments on A single principle related to many Alignment subproblems?

Davidmanheim 27 May 2025 6:21 UTC
LW: 6 AF: 3
0
AF
To pursue their values, humans should be able to reason about them. To form preferences about a thing, humans should be able to consider the thing. Therefore, human ability to comprehend should limit what humans can care about.

You’re conflating can and should! I agree that it would be ideal if this were the case, but am skeptical it is. That’s what I meant when I said I think A is false.
- If learning values is possible at all, there should be some simplicity biases which help to learn them. Wouldn’t it be strange if those simplicity biases were absolutely unrelated to simplicity biases of human cognition?
That’s a very big “if”! And simplicity priors are made questionable, if not refuted, by the fact that we haven’t gotten any convergence about human values despite millennia of philosophy trying to build such an explanation.
You define “values” as ~”the decisions humans would converge to after becoming arbitrarily more knowledgeable”.
No, I think it’s what humans actually pursue today when given the options. I’m not convinced that these values are static, or coherent, much less that we would in fact converge.
You say that values depend on inscrutable brain machinery. But can’t we treat the machinery as a part of “human ability to comprehend”?
No, because we don’t comprehend them, we just evaluate what we want locally using the machinery directly, and make choices based on that. (Then we apply pretty-sounding but ultimately post-hoc reasoning to explain it—as I tweeted partly thinking about this conversation.)
- Q Home 27 May 2025 8:17 UTC
  1 point
  0
  Parent
  Thanks for clarifying! Even if I still don’t fully understand your position, I now see where you’re coming from.
  No, I think it’s what humans actually pursue today when given the options. I’m not convinced that these values are static, or coherent, much less that we would in fact converge.
  Then those values/motivations should be limited by the complexity of human cognition, since they’re produced by it. Isn’t that trivially true? I agree that values can be incoherent, fluid, and not converging to anything. But building Task AGI doesn’t require building an AGI which learns coherent human values. It “merely” requires an AGI which doesn’t affect human values in large and unintended ways.
  No, because we don’t comprehend them, we just evaluate what we want locally using the machinery directly, and make choices based on that.
  This feels like arguing over definitions. If you have an oracle for solving certain problems, this oracle can be defined as a part of your problem-solving ability. Even if it’s not transparent compared to your other problem-solving abilities. Similarly, the machinery which calculates a complicated function from sensory inputs to judgements (e.g. from Mona Lisa to “this is beautiful”) can be defined as a part of our comprehension ability. Yes, humans don’t know (1) the internals of the machinery or (2) some properties of the function it calculates — but I think you haven’t given an example of how human values depend on knowledge of 1 or 2. You gave an example of how human values depend on the maxima of the function (e.g. the desire to find the most delicious food), but that function having maxima is not an unknown property, it’s a trivial property (some foods are worse than others, therefore some foods have the best taste).
  That’s a very big “if”! And simplicity priors are made questionable, if not refuted, by the fact that we haven’t gotten any convergence about human values despite millennia of philosophy trying to build such an explanation.
  I agree that ambitious value learning is a big “if”. But Task AGI doesn’t require it.