More broadly, we might disagree on how scalable certain approaches used in humans are, or how surprising it is that humans solve certain problems in practice.
I want to understand the generators of human alignment properties, so as to learn about the alignment problem and how it “works” in practice, and then use that knowledge of alignment-generators in the AI case. I’m not trying to make an “amplified” human.
when we’re trying to solve the problem for alignment we’re trying to come up with an airtight robust solution, and
I personally am unsure whether this is even a useful frame, or an artifact of conditioning on our own confusion about how alignment works.
humans implement the kludgiest, most naive solution that works often enough
How do you know that?
I think of this as one subproblem of embeddedness that might turn out to be difficult, that falls somewhere between 3rd and 10th place on my list of most urgent alignment problems to fix.
“Get the agent to care about some parts of reality” is not high on my list of problems, because I don’t think it’s a problem, I think it’s the default outcome for the agents we will train. (I don’t have a stable list right now because my list of alignment subproblems is rapidly refactoring as I understand the problem better.)
“Get the agent to care about specific things in the real world” seems important to me, because it’s challenging our ability to map outer supervision signals into internal cognitive structures within the agent. Also, it seems relatively easy to explain, and I also have a good story for why people (and general RL agents) will “bind their values to certain parts of reality” (in a sense which I will later explain).
this is just a preliminary concretization that lets us think about the problem, and substituting this is fine because this isn’t core to the phenomenon we’re poking at
Disagreeing with the second phrase is one major point of this essay. How do we know that substituting this is fine? By what evidence do we know that the problem is even compactly solvablein the AIXI framing?
My model of you is saying “ah, but it is core, because humans don’t fit into this framework and they solve the problem, so by restricting yourself to this rigid framework you exclude the one case where it is known to be solved.”
(Thanks for querying your model of me, btw! Pretty nice model, that indeed sounds like something I would say. :) )
This easy solution generalizing to AGI would exactly correspond to scanning AIXI-tl’s Turing machines for diamond concepts just working without anything special.”
I don’t think you think humans care about diamonds because the genome specifies brian-scanning circuitry which rewards diamond-thoughts. Or am I wrong? So, humans caring about diamonds actually wouldn’t correspond to the AIXI case? (I also am confused if and why you think that this is how it gets solved for other human motivations...?)
Why not? I mean, except for ethics, wouldn’t it be easier to use amplified humans for alignment research if high-level understanding of human cognition is possible?
I want to understand the generators of human alignment properties, so as to learn about the alignment problem and how it “works” in practice, and then use that knowledge of alignment-generators in the AI case. I’m not trying to make an “amplified” human.
I personally am unsure whether this is even a useful frame, or an artifact of conditioning on our own confusion about how alignment works.
How do you know that?
“Get the agent to care about some parts of reality” is not high on my list of problems, because I don’t think it’s a problem, I think it’s the default outcome for the agents we will train. (I don’t have a stable list right now because my list of alignment subproblems is rapidly refactoring as I understand the problem better.)
“Get the agent to care about specific things in the real world” seems important to me, because it’s challenging our ability to map outer supervision signals into internal cognitive structures within the agent. Also, it seems relatively easy to explain, and I also have a good story for why people (and general RL agents) will “bind their values to certain parts of reality” (in a sense which I will later explain).
Disagreeing with the second phrase is one major point of this essay. How do we know that substituting this is fine? By what evidence do we know that the problem is even compactly solvable in the AIXI framing?
(Thanks for querying your model of me, btw! Pretty nice model, that indeed sounds like something I would say. :) )
I don’t think you think humans care about diamonds because the genome specifies brian-scanning circuitry which rewards diamond-thoughts. Or am I wrong? So, humans caring about diamonds actually wouldn’t correspond to the AIXI case? (I also am confused if and why you think that this is how it gets solved for other human motivations...?)
Why not? I mean, except for ethics, wouldn’t it be easier to use amplified humans for alignment research if high-level understanding of human cognition is possible?