I agree that the core question is about how generalization occurs. My two stories involve kinds of generalization, and I think there are also ways generalization could work that could lead to good behavior.
It is important to my intuition that not only can we never train for the “good” generalization, we can’t even evaluate techniques to figure out which generalization “well” (since both of the bad generalizations would lead to behavior that looks good over long horizons).
If there is a disagreement it is probably that I have a much higher probability of the kind of generalization in story 1. I’m not sure if there’s actually a big quantitative disagreement though rather than a communication problem.
I also think it’s quite likely that the story in my post is unrealistic in a bunch of ways and I’m currently thinking more about what I think would actually happen.
Some more detailed responses that feel more in-the-weeds:
you think long-horizon real-world data will play a significant role in training, because we’ll need it to teach agents to do the most valuable tasks. This seems plausible to me; but I think that in order for this type of training to be useful, the agents will need to already have robust motivations (else they won’t be able to find rewards that are given over long time horizons
I might not understand this point. For example, suppose I’m training a 1-day predictor to make good predictions over 10 or 100 days. I expect such predictors to initially fail over long horizons, but to potentially be greatly improved with moderate amounts of fine-tuning. It seems to me that if this model has “robust motivations” then they would most likely be to predict accurately, but I’m not sure about why the model necessarily has robust motivations.
I feel similarly about goals like “plan to get high reward (defined as signals on channel X, you can learn how the channel works).” But even if prediction was a special case, if you learn a model then you can use it for planning/RL in simulation.
But it seems to me by default, during early training periods the AI won’t have much information about either the overseer’s knowledge (or the overseer’s existence), and may not even have the concept of rewards, making alignment with instructions much more natural.
It feels to me like our models are already getting to the point where they respond to quirks of the labeling or evaluation process, and are basically able to build simple models of the oversight process.
my concern is that this underlying concept of “natural generalisation” is doing a lot of work, despite not having been explored in your original post
Definitely, I think it’s critical to what happens and not really explored in the post (which is mostly intended to provide some color for what failure might look like).
That said, a major part of my view is that it’s pretty likely that we get either arbitrary motivations or reward-maximization (or something in between), and it’s not a big deal which since they both seem bad and seem averted in the same way.
I think the really key question is how likely it is that we get some kind of “intended” generalization like friendliness. I’m frequently on the opposite side of this disagreement, arguing that the probability that people will get some nice generalization if they really try is at least 25% or 50%, but I’m also happy being on the pessimistic side and saying that the probability we can get nice generalizations is at most 50% or 75%.
(or anywhere else, to my knowledge)
Two kinds of generalization is an old post on this question (though I wish it had used more tasteful examples).
Turning reflection up to 11 touches on the issue as well, though coming from a very different place than you.
I think there are a bunch of Arbital posts where Eliezer tries to articulate some of his opinions on this but I don’t know pointers offhand. I think most of my sense is
I haven’t written that much about why I think generalizations like “just be helpful” aren’t that likely. I agree with the point that these issues are underexplored by people working on alignment, and even more underdiscussed, given how important they are.
There are some google doc comment threads with MIRI where I’ve written about why I think those are plausible (namely that it plausible-but-challenging for breeding of animals, and that seems like one of our best anchors overall, suggesting that plausible-but-challenging is a good anchor). I think in those cases the key argument was about whether you need this to generalize far, since both me and MIRI think it’s a kind of implausible generalization to go out to infinity rather than becoming distorted at some point along the way, but I am more optimistic about making a series of “short hops” where models generalize helpfully to being moderately smarter and then they can carry out the next step of training for you.
I agree that the core question is about how generalization occurs. My two stories involve kinds of generalization, and I think there are also ways generalization could work that could lead to good behavior.
It is important to my intuition that not only can we never train for the “good” generalization, we can’t even evaluate techniques to figure out which generalization “well” (since both of the bad generalizations would lead to behavior that looks good over long horizons).
If there is a disagreement it is probably that I have a much higher probability of the kind of generalization in story 1. I’m not sure if there’s actually a big quantitative disagreement though rather than a communication problem.
I also think it’s quite likely that the story in my post is unrealistic in a bunch of ways and I’m currently thinking more about what I think would actually happen.
Some more detailed responses that feel more in-the-weeds:
I might not understand this point. For example, suppose I’m training a 1-day predictor to make good predictions over 10 or 100 days. I expect such predictors to initially fail over long horizons, but to potentially be greatly improved with moderate amounts of fine-tuning. It seems to me that if this model has “robust motivations” then they would most likely be to predict accurately, but I’m not sure about why the model necessarily has robust motivations.
I feel similarly about goals like “plan to get high reward (defined as signals on channel X, you can learn how the channel works).” But even if prediction was a special case, if you learn a model then you can use it for planning/RL in simulation.
It feels to me like our models are already getting to the point where they respond to quirks of the labeling or evaluation process, and are basically able to build simple models of the oversight process.
Definitely, I think it’s critical to what happens and not really explored in the post (which is mostly intended to provide some color for what failure might look like).
That said, a major part of my view is that it’s pretty likely that we get either arbitrary motivations or reward-maximization (or something in between), and it’s not a big deal which since they both seem bad and seem averted in the same way.
I think the really key question is how likely it is that we get some kind of “intended” generalization like friendliness. I’m frequently on the opposite side of this disagreement, arguing that the probability that people will get some nice generalization if they really try is at least 25% or 50%, but I’m also happy being on the pessimistic side and saying that the probability we can get nice generalizations is at most 50% or 75%.
Two kinds of generalization is an old post on this question (though I wish it had used more tasteful examples).
Turning reflection up to 11 touches on the issue as well, though coming from a very different place than you.
I think there are a bunch of Arbital posts where Eliezer tries to articulate some of his opinions on this but I don’t know pointers offhand. I think most of my sense is
I haven’t written that much about why I think generalizations like “just be helpful” aren’t that likely. I agree with the point that these issues are underexplored by people working on alignment, and even more underdiscussed, given how important they are.
There are some google doc comment threads with MIRI where I’ve written about why I think those are plausible (namely that it plausible-but-challenging for breeding of animals, and that seems like one of our best anchors overall, suggesting that plausible-but-challenging is a good anchor). I think in those cases the key argument was about whether you need this to generalize far, since both me and MIRI think it’s a kind of implausible generalization to go out to infinity rather than becoming distorted at some point along the way, but I am more optimistic about making a series of “short hops” where models generalize helpfully to being moderately smarter and then they can carry out the next step of training for you.