I agree with some of this, but I’d say Story 1 applies only very weakly, and that the majority/supermajority of value learning is online, for example via the self-learning/within lifetime-RL algorithms you describe, without relying on the prior. In essence, I agree with the claim that the genes need to impose a prior, which prevents pure blank-slatism from working. I disagree with the claim that this means that genetics need to impose a very strong prior without relying on the self-learning algorithms you describe for capabilities.
You keep talking about “prior” but not mentioning “reward function”. I’m not sure why. For human children, do you think that there isn’t a reward function? Or there is a reward function but it’s not important? Or do you take the word “prior” to include reward function as a special case?
If it’s the latter, then I dispute that this is an appropriate use of the word “prior”. For example, you can train AlphaZero to be superhumanly skilled at winning at Go, or if you flip the reward function then you’ll train AlphaZero to be superhumanly skilled at losing at Go. The behavior is wildly different, but is the “prior” different? I would say no. It’s the same neural net architecture, with the same initialization and same regularization. After 0 bits of training data, the behavior is identical in each case. So we should say it’s the same “prior”, right?
(As I mentioned in the OP, on my models, there is a human innate reward function, and it’s absolutely critical to human prosocial behavior, and unfortunately nobody knows what that reward function is.)
So what I’m trying to get at here is essentially the question “how much can we offload the complexity of values to the learning system” rather than say, directly specify it via the genome, say. In essence, I’m focused on the a priori complexity of human values and the human innate reward function, since this variable often is a key disagreement between optimists and pessimists on controlling AI, and in particular it especially matters for how likely deceptive alignment is to occur relative to actual alignment, which is both a huge and popular threat model.
Re the reward function, the prior discussion also sort of applies here, because if it is learnable or otherwise is simple to hardcode, then it means that other functions probably will work just as well without relying on the human reward function, and if it’s outright learnable by AI, then it’s almost certainly going to be learned (conditional on the reward function being simple.) before anything else, especially the deceptively aligned algorithm if it’s simpler, and if not, then it’s only slightly more complex, so we can easily provide very little data to distinguish between the 2 algorithms, which is what I view the situation involving the human
My crux is that this statement is probably false, conditioning on either it being very simple to hardcode, as in a few lines say, or is learnable by the self-learning algorithm/within-lifetime RL/online learning algorithms you consider:
“The human innate reward function is absolutely critical to human prosocial behavior.”
Putting it another way, I deny the specialness of the innate reward function in humans being the main driver, because most of that reward function has to be learned, which could be replicated by brain-like AGI/Model-Based RL via online learning, thus most of the complexity does not matter, and that also probably implies that most of the complex prosocial behavior is fundamentally replicable by a brain-like AGI/Model-Based RL agent without having to have the human innate reward function.
The innate function obviously has some things hard-coded a priori, and there is some complexity in the reward function, but not nearly as much as a lot of people think, since IMO a lot of the reward function/human prosocial values are fundamentally learned and almost certainly replicable by a Brain-like AGI paradigm, even if it didn’t use the exact innate reward function the human uses.
Some other generalized updates I made were these, this is quoted from a discord I’m in, credit to TurnTrout for noticing this:
An update of “guess simple functions of sense data can entrain this complicated edifice of human value, along with cultural information” and the update of “alignment to human values is doable by a simple function so it’s probably doable by lots of other functions”,
as well as contextualized updates like “it was probably easy for evolution to find these circuits, which is evidence you don’t need that much precision in your reward specification to get roughly reasonable outputs”.
I find your text confusing. Let’s go step by step.
AlphaZero-chess has a very simple reward function: +1 for getting checkmate, −1 for opponent checkmate, 0 for draw
A trained AlphaZero-chess has extraordinarily complicated “preferences” (value function) — its judgments of which board positions are good or bad contain millions of bits of complexity
If you change the reward function (e.g. flip the sign of the reward), the resulting trained model preferences would be wildly different.
By analogy:
The human brain has a pretty simple innate reward function (let’s say dozens to hundreds of lines of pseudocode).
A human adult has extraordinarily complicated preferences (let’s say millions of bits of complexity, at least)
If you change the innate reward function in a newborn, the resulting adult would have wildly different preferences than otherwise.
Do you agree with all that?
If so, then there’s no getting around that getting the right innate reward function is extremely important, right?
So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you. :)
I find your text confusing. Let’s go step by step.
AlphaZero-chess has a very simple reward function: +1 for getting checkmate, −1 for opponent checkmate, 0 for draw
A trained AlphaZero-chess has extraordinarily complicated “preferences” (value function) — its judgments of which board positions are good or bad contain millions of bits of complexity
If you change the reward function (e.g. flip the sign of the reward), the resulting trained model preferences would be wildly different.
By analogy:
The human brain has a pretty simple innate reward function (let’s say dozens to hundreds of lines of pseudocode).
A human adult has extraordinarily complicated preferences (let’s say millions of bits of complexity, at least)
If you change the innate reward function in a newborn, the resulting adult would have wildly different preferences than otherwise.
I agree with this statement, because the sign change directly inverts the reward, and thus it means the previous reward is now a bad thing to hit for, but my view is that this is probably unreprensentative, and that brains/brain-like AGI are much more robust than you think to changing their value/reward functions (but not infinitely robust.) due to the very simple value function you pointed out.
So I basically disagree with this example representing a major problem with NN/Brain-Like AGI robustness.
To respond to this:
So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you. :)
This doesn’t actually matter for my purposes, as I only need the existence of simple reward functions like you claimed to conclude that deceptive alignment is unlikely to happen, and I am leaving it up to the people that are aligning AI like Nora Belrose to actually implement this ideal.
Essentially, I’m focusing on the implications of the existence of simple algorithms for values, and pointing out that various alignment challenges either go away or are far easier to do if we grant that there is a simple reward function for values, which is very much a contested/disagreed position on LW.
So I think we basically agree that there is a simple reward function for values, but I think this implies some other big changes in alignment which reduces the risk of AI catastrophe drastically, mostly via getting rid of deceptive alignment as an outcome that will happen, but there are various other side benefits I haven’t enumerated because it would make this comment too long.
You keep talking about “prior” but not mentioning “reward function”. I’m not sure why. For human children, do you think that there isn’t a reward function? Or there is a reward function but it’s not important? Or do you take the word “prior” to include reward function as a special case?
If it’s the latter, then I dispute that this is an appropriate use of the word “prior”. For example, you can train AlphaZero to be superhumanly skilled at winning at Go, or if you flip the reward function then you’ll train AlphaZero to be superhumanly skilled at losing at Go. The behavior is wildly different, but is the “prior” different? I would say no. It’s the same neural net architecture, with the same initialization and same regularization. After 0 bits of training data, the behavior is identical in each case. So we should say it’s the same “prior”, right?
(As I mentioned in the OP, on my models, there is a human innate reward function, and it’s absolutely critical to human prosocial behavior, and unfortunately nobody knows what that reward function is.)
So what I’m trying to get at here is essentially the question “how much can we offload the complexity of values to the learning system” rather than say, directly specify it via the genome, say. In essence, I’m focused on the a priori complexity of human values and the human innate reward function, since this variable often is a key disagreement between optimists and pessimists on controlling AI, and in particular it especially matters for how likely deceptive alignment is to occur relative to actual alignment, which is both a huge and popular threat model.
Re the reward function, the prior discussion also sort of applies here, because if it is learnable or otherwise is simple to hardcode, then it means that other functions probably will work just as well without relying on the human reward function, and if it’s outright learnable by AI, then it’s almost certainly going to be learned (conditional on the reward function being simple.) before anything else, especially the deceptively aligned algorithm if it’s simpler, and if not, then it’s only slightly more complex, so we can easily provide very little data to distinguish between the 2 algorithms, which is what I view the situation involving the human
My crux is that this statement is probably false, conditioning on either it being very simple to hardcode, as in a few lines say, or is learnable by the self-learning algorithm/within-lifetime RL/online learning algorithms you consider:
“The human innate reward function is absolutely critical to human prosocial behavior.”
Putting it another way, I deny the specialness of the innate reward function in humans being the main driver, because most of that reward function has to be learned, which could be replicated by brain-like AGI/Model-Based RL via online learning, thus most of the complexity does not matter, and that also probably implies that most of the complex prosocial behavior is fundamentally replicable by a brain-like AGI/Model-Based RL agent without having to have the human innate reward function.
The innate function obviously has some things hard-coded a priori, and there is some complexity in the reward function, but not nearly as much as a lot of people think, since IMO a lot of the reward function/human prosocial values are fundamentally learned and almost certainly replicable by a Brain-like AGI paradigm, even if it didn’t use the exact innate reward function the human uses.
Some other generalized updates I made were these, this is quoted from a discord I’m in, credit to TurnTrout for noticing this:
I find your text confusing. Let’s go step by step.
AlphaZero-chess has a very simple reward function: +1 for getting checkmate, −1 for opponent checkmate, 0 for draw
A trained AlphaZero-chess has extraordinarily complicated “preferences” (value function) — its judgments of which board positions are good or bad contain millions of bits of complexity
If you change the reward function (e.g. flip the sign of the reward), the resulting trained model preferences would be wildly different.
By analogy:
The human brain has a pretty simple innate reward function (let’s say dozens to hundreds of lines of pseudocode).
A human adult has extraordinarily complicated preferences (let’s say millions of bits of complexity, at least)
If you change the innate reward function in a newborn, the resulting adult would have wildly different preferences than otherwise.
Do you agree with all that?
If so, then there’s no getting around that getting the right innate reward function is extremely important, right?
So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you. :)
I agree with this statement, because the sign change directly inverts the reward, and thus it means the previous reward is now a bad thing to hit for, but my view is that this is probably unreprensentative, and that brains/brain-like AGI are much more robust than you think to changing their value/reward functions (but not infinitely robust.) due to the very simple value function you pointed out.
So I basically disagree with this example representing a major problem with NN/Brain-Like AGI robustness.
To respond to this:
This doesn’t actually matter for my purposes, as I only need the existence of simple reward functions like you claimed to conclude that deceptive alignment is unlikely to happen, and I am leaving it up to the people that are aligning AI like Nora Belrose to actually implement this ideal.
Essentially, I’m focusing on the implications of the existence of simple algorithms for values, and pointing out that various alignment challenges either go away or are far easier to do if we grant that there is a simple reward function for values, which is very much a contested/disagreed position on LW.
So I think we basically agree that there is a simple reward function for values, but I think this implies some other big changes in alignment which reduces the risk of AI catastrophe drastically, mostly via getting rid of deceptive alignment as an outcome that will happen, but there are various other side benefits I haven’t enumerated because it would make this comment too long.