The basic idea behind compressed pointers is that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans. In a sense, this means having aligned goals without having the same goals: your goal is to cooperate with “human goals”, but you don’t yet have a full description of what human goals are. Your value function might be much simpler than the human value function.
In machine-learning terms, this is the question of how to specify a loss function for the purpose of learning human values.
Insofar as I understand your point, I disagree. In machine-learning terms, this is the question of how to train an AI whose internal cognition reliably unfolds into caring about people, in whatever form that takes in the AI’s learned ontology (whether or not it has a concept for people). If you commit to the specific view of outer/inner alignment, then now you also want your loss function to “represent” that goal in some way.
humans seem to correctly identify what each other want and believe, quite frequently. Therefore, humans must have prior knowledge which helps in this task. If we can encode those prior assumptions in an AI, we can point it in the right direction.
I doubt this due to learning from scratch. I think the question of “how do I identify what you want, in terms of a utility function?” is a bit sideways due to people not in fact having utility functions.[1] Insofar as the question makes sense, its answer probably takes the form of inductive biases: I might learn to predict the world via self-supervised learning and form concepts around other people having values and emotional states due to that being a simple convergent abstraction relatively pinned down by my training process, architecture, and data over my life, also reusing my self-modelling abstractions. It would be quite unnatural to model myself in one way (as valuing happiness) and others as having “irrational” shards which “value” anti-happiness but still end up behaving as if they value happiness. (That’s not a sensible thing to say, on my ontology.)
This presents a difficulty if another agent wishes to help such an agent, but does not share its ontological commitments.
I think it’s worth considering how I might go about helping a person from an uncontacted tribe who doesn’t share my ontology. Conditional on them requesting help from me somehow, and my wanting to help them, and my deciding to do so—how would I carry out that process, internally?
(Not reading the rest at the moment, may leave more comments later)
On my view: Human values take the form of decision-influences (i.e. shards) which increase or decrease the probability of mental events and decisions (buying ice cream, thinking for another minute). There is no such thing as an anti-ice-cream shard which is perfectly anti-rational in that it bids “against its interests”, bidding for ice cream and against avoiding ice cream. That’s just an ice cream shard. Goals and rationality are not entirely separate, in people.
The basic idea behind compressed pointers is that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans. [...] In machine-learning terms, this is the question of how to specify a loss function for the purpose of learning human values.
You said:
In machine-learning terms, this is the question of how to train an AI whose internal cognition reliably unfolds into caring about people, in whatever form that takes in the AI’s learned ontology (whether or not it has a concept for people).
Thinking about this now, I think maybe it’s a question of precautions, and what order you want to teach things in. Very similarly to the argument that you might want to make a system corrigible first, before ensuring that it has other good properties—because if you make a mistake, later, a corrigible system will let you correct the mistake.
Similarly, it seems like a sensible early goal could be ‘get the system to understand that the sort of thing it is trying to do, in (value) learning, is to pick up human values’. Because once it has understood this point correctly, it is harder for things to go wrong later on, and the system may even be able to do much of the heavy lifting for you.
Really, what makes me go to the meta-level like this is pessimism about the more direct approach. Directly trying to instill human values, rather than first training in a meta-level understanding of that task, doesn’t seem like a very correctible approach. (I think much of this pessimism comes from mentally visualizing humans arguing about what object-level values to try to teach an AI. Even if the humans are able to agree, I do not feel especially optimistic about their choices, even if they’re supposedly informed by neuroscience and not just moral philosophers.)
Really, what makes me go to the meta-level like this is pessimism about the more direct approach. Directly trying to instill human values, rather than first training in a meta-level understanding of that task, doesn’t seem like a very correctible approach.
True, but I’m also uncertain about the relative difficulty of relatively novel and exotic value-spreads like “I value doing the right thing by humans, where I’m uncertain about the referent of humans”, compared to “People should have lots of resources and be able to spend them freely and wisely in pursuit of their own purposes” (the latter being values that at least I do in fact have).
I think it is reasonable as engineering practice to try and make a fully classically-Bayesian model of what we think we know about the necessary inductive biases—or, perhaps more realistically, a model which only violates classic Bayesian definitions where necessary in order to represent what we want to represent.
This is because writing down the desired inductive biases as an explicit prior can help us to understand what’s going on better.
It’s tempting to say that to understand how the brain learns, is to understand how it treats feedback as evidence, and updates on that evidence. Of course, there could certainly be other theoretical frames which are more productive. But at a deep level, if the learning works, the learning works because the feedback is evidence about the thing we want to learn, and the process which updates on that feedback embodies (something like) a good prior telling us how to update on that evidence.
And if that framing is wrong somehow, it seems intuitive to me that the problem should be describable within that ontology, like how I think “utility function” is not a very good way to think about values because what is it a function of; we don’t have a commitment to a specific low-level description of the universe which is appropriate for the input to a utility function. We can easily move beyond this by considering expected values as the “values/preferences” representation, without worrying about what underlying utility function generates those expected values.
(I do not take the above to be a knockdown argument against “committing to the specific division between outer and inner alignment steers you wrong”—I’m just saying things that seem true to me and plausibly relevant to the debate.)
I expect you’ll say I’m missing something, but to me, this sounds like a language dispute. My understanding of your recent thinking holds that the important goal is to understand how human learning reliably results in human values. The Bayesian perspective on this is “figuring out the human prior”, because a prior is just a way-to-learn. You might object to the overly Bayesian framing of that; but I’m fine with that. I am not dogmatic on orthodox bayesianism. I do not even like utility functions.
Insofar as the question makes sense, its answer probably takes the form of inductive biases: I might learn to predict the world via self-supervised learning and form concepts around other people having values and emotional states due to that being a simple convergent abstraction relatively pinned down by my training process, architecture, and data over my life, also reusing my self-modelling abstractions.
I am totally fine with saying “inductive biases” instead of “prior”; I think it indeed pins down what I meant in a more accurate way (by virtue of, in itself, being a more vague and imprecise concept than “prior”).
Insofar as I understand your point, I disagree. In machine-learning terms, this is the question of how to train an AI whose internal cognition reliably unfolds into caring about people, in whatever form that takes in the AI’s learned ontology (whether or not it has a concept for people). If you commit to the specific view of outer/inner alignment, then now you also want your loss function to “represent” that goal in some way.
I doubt this due to learning from scratch. I think the question of “how do I identify what you want, in terms of a utility function?” is a bit sideways due to people not in fact having utility functions.[1] Insofar as the question makes sense, its answer probably takes the form of inductive biases: I might learn to predict the world via self-supervised learning and form concepts around other people having values and emotional states due to that being a simple convergent abstraction relatively pinned down by my training process, architecture, and data over my life, also reusing my self-modelling abstractions. It would be quite unnatural to model myself in one way (as valuing happiness) and others as having “irrational” shards which “value” anti-happiness but still end up behaving as if they value happiness. (That’s not a sensible thing to say, on my ontology.)
I think it’s worth considering how I might go about helping a person from an uncontacted tribe who doesn’t share my ontology. Conditional on them requesting help from me somehow, and my wanting to help them, and my deciding to do so—how would I carry out that process, internally?
(Not reading the rest at the moment, may leave more comments later)
On my view: Human values take the form of decision-influences (i.e. shards) which increase or decrease the probability of mental events and decisions (buying ice cream, thinking for another minute). There is no such thing as an anti-ice-cream shard which is perfectly anti-rational in that it bids “against its interests”, bidding for ice cream and against avoiding ice cream. That’s just an ice cream shard. Goals and rationality are not entirely separate, in people.
I said:
You said:
Thinking about this now, I think maybe it’s a question of precautions, and what order you want to teach things in. Very similarly to the argument that you might want to make a system corrigible first, before ensuring that it has other good properties—because if you make a mistake, later, a corrigible system will let you correct the mistake.
Similarly, it seems like a sensible early goal could be ‘get the system to understand that the sort of thing it is trying to do, in (value) learning, is to pick up human values’. Because once it has understood this point correctly, it is harder for things to go wrong later on, and the system may even be able to do much of the heavy lifting for you.
Really, what makes me go to the meta-level like this is pessimism about the more direct approach. Directly trying to instill human values, rather than first training in a meta-level understanding of that task, doesn’t seem like a very correctible approach. (I think much of this pessimism comes from mentally visualizing humans arguing about what object-level values to try to teach an AI. Even if the humans are able to agree, I do not feel especially optimistic about their choices, even if they’re supposedly informed by neuroscience and not just moral philosophers.)
True, but I’m also uncertain about the relative difficulty of relatively novel and exotic value-spreads like “I value doing the right thing by humans, where I’m uncertain about the referent of humans”, compared to “People should have lots of resources and be able to spend them freely and wisely in pursuit of their own purposes” (the latter being values that at least I do in fact have).
I think it is reasonable as engineering practice to try and make a fully classically-Bayesian model of what we think we know about the necessary inductive biases—or, perhaps more realistically, a model which only violates classic Bayesian definitions where necessary in order to represent what we want to represent.
This is because writing down the desired inductive biases as an explicit prior can help us to understand what’s going on better.
It’s tempting to say that to understand how the brain learns, is to understand how it treats feedback as evidence, and updates on that evidence. Of course, there could certainly be other theoretical frames which are more productive. But at a deep level, if the learning works, the learning works because the feedback is evidence about the thing we want to learn, and the process which updates on that feedback embodies (something like) a good prior telling us how to update on that evidence.
And if that framing is wrong somehow, it seems intuitive to me that the problem should be describable within that ontology, like how I think “utility function” is not a very good way to think about values because what is it a function of; we don’t have a commitment to a specific low-level description of the universe which is appropriate for the input to a utility function. We can easily move beyond this by considering expected values as the “values/preferences” representation, without worrying about what underlying utility function generates those expected values.
(I do not take the above to be a knockdown argument against “committing to the specific division between outer and inner alignment steers you wrong”—I’m just saying things that seem true to me and plausibly relevant to the debate.)
I expect you’ll say I’m missing something, but to me, this sounds like a language dispute. My understanding of your recent thinking holds that the important goal is to understand how human learning reliably results in human values. The Bayesian perspective on this is “figuring out the human prior”, because a prior is just a way-to-learn. You might object to the overly Bayesian framing of that; but I’m fine with that. I am not dogmatic on orthodox bayesianism. I do not even like utility functions.
I am totally fine with saying “inductive biases” instead of “prior”; I think it indeed pins down what I meant in a more accurate way (by virtue of, in itself, being a more vague and imprecise concept than “prior”).
I agree, this does seem like it was a language dispute, I no longer perceive us as disagreeing on this point.