I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is. I think this is an ITT which a lot of people in the broader LW cluster would fail. I think the basic mistake that’s being made here is failing to recognize that reality doesn’t grade on a curve when it comes to understanding the world—your arguments can be false even if nobody has refuted them. That’s particularly true when it comes to very high-level abstractions, like the ones this field is built around (and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment).
Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that’s fine, this might be necessary, and so it’s good to have some people pushing in this direction, but it seems like a bunch of people around here don’t just ignore the skulls, they seem to lack any awareness that the absence of the key components by which scientific progress has basically ever been made is a red flag at all.
I think it’s possible to criticise work on RLHF while taking seriously the possibility that empirical work on our biggest models is necessary for solving alignment. But criticisms like this one seem to showcase a kind of blindspot. I’d be more charitable if people in the LW cluster had actually tried to write up the arguments for things like “why inner misalignment is so inevitable”. But in general people have put shockingly little effort into doing so, with almost nobody trying to tackle this rigorously. E.g. I was surprised when my debates with Eliezer involved him still using all the same intuition-pumps as he did in the sequences, because to me the obvious thing to do over the next decade is to flesh out the underlying mental models of the key issue, which would then allow you to find high-level intuition pumps that are both more persuasive and more trustworthy.
I’m more careful than John about throwing around aspersions on which people are “actually trying” to solve problems. But it sure seems to me that blithely trusting your own intuitions because you personally can’t imagine how they might be wrong is one way of not actually trying to solve hard problems.
Comments on parts of this other than the ITT thing (response to the ITT part is here)...
(and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment)
I don’t usually focus much on the outer/inner abstraction, and when I do I usually worry about outer alignment. I consider RLHF to have been negative progress on outer alignment, same as inner alignment; I wasn’t relying on that particular abstraction at all.
Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that’s fine, this might be necessary, and so it’s good to have some people pushing in this direction, but it seems like a bunch of people around here don’t just ignore the skulls, they seem to lack any awareness that the absence of the key components by which scientific progress has basically ever been made is a red flag at all.
I think your model here completely fails to predict Descartes, Laplace, Von Neumann & Morgenstern, Shannon, Jaynes, Pearl, and probably many others. Basically all of the people who’ve successfully made exactly the sort of conceptual advances we aim for in agent foundations.
But it is a model under which one could try to make a case for RLHF.
I’d be more charitable if people in the LW cluster had actually tried to write up the arguments for things like “why inner misalignment is so inevitable”.
Speaking for myself, I don’t think inner misalignment is clearly inevitable. I do think outer misalignment is much more clearly inevitable, and I do think inner misalignment is not plausibly sufficiently unlikely that we can afford to ignore the possibility. Similar to this comment: I’m pretty sympathetic to the view that powerful deceptive inner agents are unlikely, but charging ahead assuming that they will not happen is idiotic given the stakes.
A piece which I think is missing from this thread thus far: in order for RLHF to decrease the chance of human extinction, there has to first be some world in which humans go extinct from AI. By and large, it seems like people who think RLHF is useful are mostly also people who think we’re unlikely to die of AI, and that’s not a coincidence: worlds in which the iterative-incremental-empiricism approach suffices for alignment are worlds where we’re unlikely to die in the first place. Humans are good at iterative incremental empiricism. The worlds in which we die are worlds in which that approach is fundamentally flawed for some reason (usually because we are unable to see the problems).
Thus the wording of this claim I made upthread:
If someone on the OpenAI team which worked on RLHF thought humanity had a decent (not necessarily large) chance of going extinct from AI, and they honestly thought implementing and popularizing RLHF made that chance go down, and they chose to work on RLHF because of that, then I would say I was wrong to accuse them of merely paying lip service.
In order for work on RLHF to reduce the chance of humanity going extinct from AI, it has to help in one of the worlds where we otherwise go extinct, not in one of the worlds where alignment by default kicks in and we would probably have been fine anyway.
(In case it was not obvious: I am definitely not saying that one must assign high P(doom) to do actual alignment work. I am saying that one must have some idea of worlds in which we’re actually likely to die.)
I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is.
I don’t want to pour a ton of effort into this, but here’s my 5-paragraph ITT attempt.
“As an analogy for alignment, consider processor manufacturing. We didn’t get to gigahertz clock speed and ten nanometer feature size by trying to tackle all the problems of 10 nm manufacturing processes right out the gate. That would never have worked; too many things independently go wrong to isolate and solve them all without iteration. We can’t get many useful bits out of empirical feedback if the result is always failure, and always for a long list of reasons.
And of course, if you know anything about modern fabs, you know there’d have been no hope whatsoever of identifying all the key problems in advance just based on theory. (Side note: I remember a good post or thread from the past year on crazy shit fabs need to do, but can’t find it; anyone remember that and have a link?)
The way we actually did it was to start with gigantic millimeter-size features, which were relatively easy to manufacture. And then we scaled down slowly. At each new size, new problems came up, but those problems came up just a few at a time as we only scaled down a little bit at each step. We could carry over most of our insights from earlier stages, and isolate new problems empirically.
The analogy, in AI, is to slowly ramp up the capabilities/size/optimization pressure of our systems. Start with low capability, and use whatever simple tricks will help in that regime. Then slowly ramp up, see what new problems come up at each stage, just like we did for chip manufacturing. And to complete the analogy: just like with chips, at each step we can use the products of the previous step to help design the next step.
That’s the sort of plan which has a track record of actually handling the messiness of reality, even when scaling things over many orders of magnitude.”
There, let me know how plausible that was as an ITT attempt for “people who have different views [than I do] about how valuable incremental empiricism is”.
Forgot to reply to this at the time, but I think this is a pretty good ITT. (I think there’s probably some additional argument that people would make about why this isn’t just an isolated analogy, but rather a more generally-applicable argument, but it does seem to be a fairly central example of that generally-applicable argument.)
I think people who value empirical alignment work now probably think that (to some extent) we can predict at a high level what future problems we might face (contrasting with “there’d have been no hope whatsoever of identifying all the key problems in advance just based on theory”). Obviously this is a spectrum, but I think the chip fab analogy is I think further towards people believing there are unknown unknowns in the problem space than people at OpenAI are (e.g. OpenAI people possibly think outer alignment and inner alignment capture all of the kinds of problems we’ll face).
However, they probably don’t believe you can work on solutions to those problems without being able to empirically demonstrate those problems and hence iterate on them (and again one could probably appeal to a track record here of most proposed solutions to problems not working unless they were developed by iterating on the actual problem). We can maybe vaguely postulate what the solutions could look like (they would say), but it’s going to be much better to try and actually implement solutions on versions of the problem we can demonstrate, and iterate from there. (Note that they probably also perhaps try and produce demonstrations of the problems such that they can then work on those solutions, but this is still all empirical).
Otherwise I do think your ITT does seem reasonable to me, although I don’t think I’d put myself in the class of people you’re trying to ITT, so that’s not much evidence.
and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment
I am confused. How does RLHF help with outer alignment? Isn’t optimizing fur human approval the classical outer-alignment problem? (e.g. tiling the universe with smiling faces)
I don’t think the argument for RLHF runs through outer alignment. I think it has to run through using it as a lens to study how models generalize, and eliciting misalignment (i.e. the points about empirical data that you mentioned, I just don’t understand where the inner/outer alignment distinction comes from in this context)
The smiley faces example feels confusing as a “classic” outer alignment problem because AGIs won’t be trained on a reward function anywhere near as limited as smiley faces. An alternative like “AGIs are trained on a reward function in which all behavior on a wide range of tasks is classified by humans as good or bad” feels more realistic, but also lacks the intuitive force of the smiley face example—it’s much less clear in this example why generalization will go badly, given the breadth of the data collected.
I think the smiling example is much more analogous than you are making it out here. I think the basic argument for “this just encourages taking control of the reward” or “this just encourages deception” goes through the same way.
Like, RLHF is not some magical “we have definitely figured out whether a behavior is really good or bad” signal, it’s historically been just some contractors thinking for like a minute about whether a thing is fine. I don’t think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater than the variance in RLHF approval, and so the amount of information conveyed is actually more), so I don’t buy that RLHF conveys more about human preferences in any meaningful way.
I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is. I think this is an ITT which a lot of people in the broader LW cluster would fail. I think the basic mistake that’s being made here is failing to recognize that reality doesn’t grade on a curve when it comes to understanding the world—your arguments can be false even if nobody has refuted them. That’s particularly true when it comes to very high-level abstractions, like the ones this field is built around (and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment).
Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that’s fine, this might be necessary, and so it’s good to have some people pushing in this direction, but it seems like a bunch of people around here don’t just ignore the skulls, they seem to lack any awareness that the absence of the key components by which scientific progress has basically ever been made is a red flag at all.
I think it’s possible to criticise work on RLHF while taking seriously the possibility that empirical work on our biggest models is necessary for solving alignment. But criticisms like this one seem to showcase a kind of blindspot. I’d be more charitable if people in the LW cluster had actually tried to write up the arguments for things like “why inner misalignment is so inevitable”. But in general people have put shockingly little effort into doing so, with almost nobody trying to tackle this rigorously. E.g. I was surprised when my debates with Eliezer involved him still using all the same intuition-pumps as he did in the sequences, because to me the obvious thing to do over the next decade is to flesh out the underlying mental models of the key issue, which would then allow you to find high-level intuition pumps that are both more persuasive and more trustworthy.
I’m more careful than John about throwing around aspersions on which people are “actually trying” to solve problems. But it sure seems to me that blithely trusting your own intuitions because you personally can’t imagine how they might be wrong is one way of not actually trying to solve hard problems.
Comments on parts of this other than the ITT thing (response to the ITT part is here)...
I don’t usually focus much on the outer/inner abstraction, and when I do I usually worry about outer alignment. I consider RLHF to have been negative progress on outer alignment, same as inner alignment; I wasn’t relying on that particular abstraction at all.
I think your model here completely fails to predict Descartes, Laplace, Von Neumann & Morgenstern, Shannon, Jaynes, Pearl, and probably many others. Basically all of the people who’ve successfully made exactly the sort of conceptual advances we aim for in agent foundations.
But it is a model under which one could try to make a case for RLHF.
I still do not think that the team doing RLHF work at OpenAI actually thought about whether this model makes RLHF decrease the chance of human extinction, and deliberated on that in a way which could plausibly have resulted in the project not happening. But I have made that claim maximally easy to falsify if I’m wrong.
Speaking for myself, I don’t think inner misalignment is clearly inevitable. I do think outer misalignment is much more clearly inevitable, and I do think inner misalignment is not plausibly sufficiently unlikely that we can afford to ignore the possibility. Similar to this comment: I’m pretty sympathetic to the view that powerful deceptive inner agents are unlikely, but charging ahead assuming that they will not happen is idiotic given the stakes.
A piece which I think is missing from this thread thus far: in order for RLHF to decrease the chance of human extinction, there has to first be some world in which humans go extinct from AI. By and large, it seems like people who think RLHF is useful are mostly also people who think we’re unlikely to die of AI, and that’s not a coincidence: worlds in which the iterative-incremental-empiricism approach suffices for alignment are worlds where we’re unlikely to die in the first place. Humans are good at iterative incremental empiricism. The worlds in which we die are worlds in which that approach is fundamentally flawed for some reason (usually because we are unable to see the problems).
Thus the wording of this claim I made upthread:
In order for work on RLHF to reduce the chance of humanity going extinct from AI, it has to help in one of the worlds where we otherwise go extinct, not in one of the worlds where alignment by default kicks in and we would probably have been fine anyway.
(In case it was not obvious: I am definitely not saying that one must assign high P(doom) to do actual alignment work. I am saying that one must have some idea of worlds in which we’re actually likely to die.)
I don’t want to pour a ton of effort into this, but here’s my 5-paragraph ITT attempt.
“As an analogy for alignment, consider processor manufacturing. We didn’t get to gigahertz clock speed and ten nanometer feature size by trying to tackle all the problems of 10 nm manufacturing processes right out the gate. That would never have worked; too many things independently go wrong to isolate and solve them all without iteration. We can’t get many useful bits out of empirical feedback if the result is always failure, and always for a long list of reasons.
And of course, if you know anything about modern fabs, you know there’d have been no hope whatsoever of identifying all the key problems in advance just based on theory. (Side note: I remember a good post or thread from the past year on crazy shit fabs need to do, but can’t find it; anyone remember that and have a link?)
The way we actually did it was to start with gigantic millimeter-size features, which were relatively easy to manufacture. And then we scaled down slowly. At each new size, new problems came up, but those problems came up just a few at a time as we only scaled down a little bit at each step. We could carry over most of our insights from earlier stages, and isolate new problems empirically.
The analogy, in AI, is to slowly ramp up the capabilities/size/optimization pressure of our systems. Start with low capability, and use whatever simple tricks will help in that regime. Then slowly ramp up, see what new problems come up at each stage, just like we did for chip manufacturing. And to complete the analogy: just like with chips, at each step we can use the products of the previous step to help design the next step.
That’s the sort of plan which has a track record of actually handling the messiness of reality, even when scaling things over many orders of magnitude.”
There, let me know how plausible that was as an ITT attempt for “people who have different views [than I do] about how valuable incremental empiricism is”.
Forgot to reply to this at the time, but I think this is a pretty good ITT. (I think there’s probably some additional argument that people would make about why this isn’t just an isolated analogy, but rather a more generally-applicable argument, but it does seem to be a fairly central example of that generally-applicable argument.)
I think people who value empirical alignment work now probably think that (to some extent) we can predict at a high level what future problems we might face (contrasting with “there’d have been no hope whatsoever of identifying all the key problems in advance just based on theory”). Obviously this is a spectrum, but I think the chip fab analogy is I think further towards people believing there are unknown unknowns in the problem space than people at OpenAI are (e.g. OpenAI people possibly think outer alignment and inner alignment capture all of the kinds of problems we’ll face).
However, they probably don’t believe you can work on solutions to those problems without being able to empirically demonstrate those problems and hence iterate on them (and again one could probably appeal to a track record here of most proposed solutions to problems not working unless they were developed by iterating on the actual problem). We can maybe vaguely postulate what the solutions could look like (they would say), but it’s going to be much better to try and actually implement solutions on versions of the problem we can demonstrate, and iterate from there. (Note that they probably also perhaps try and produce demonstrations of the problems such that they can then work on those solutions, but this is still all empirical).
Otherwise I do think your ITT does seem reasonable to me, although I don’t think I’d put myself in the class of people you’re trying to ITT, so that’s not much evidence.
I am confused. How does RLHF help with outer alignment? Isn’t optimizing fur human approval the classical outer-alignment problem? (e.g. tiling the universe with smiling faces)
I don’t think the argument for RLHF runs through outer alignment. I think it has to run through using it as a lens to study how models generalize, and eliciting misalignment (i.e. the points about empirical data that you mentioned, I just don’t understand where the inner/outer alignment distinction comes from in this context)
RLHF helps with outer alignment because it leads to rewards which more accurately reflect human preferences than the hard-coded reward functions (including the classic specification gaming examples, but also intrinsic motivation functions like curiosity and empowerment) which are used to train agents in the absence of RLHF.
The smiley faces example feels confusing as a “classic” outer alignment problem because AGIs won’t be trained on a reward function anywhere near as limited as smiley faces. An alternative like “AGIs are trained on a reward function in which all behavior on a wide range of tasks is classified by humans as good or bad” feels more realistic, but also lacks the intuitive force of the smiley face example—it’s much less clear in this example why generalization will go badly, given the breadth of the data collected.
I think the smiling example is much more analogous than you are making it out here. I think the basic argument for “this just encourages taking control of the reward” or “this just encourages deception” goes through the same way.
Like, RLHF is not some magical “we have definitely figured out whether a behavior is really good or bad” signal, it’s historically been just some contractors thinking for like a minute about whether a thing is fine. I don’t think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater than the variance in RLHF approval, and so the amount of information conveyed is actually more), so I don’t buy that RLHF conveys more about human preferences in any meaningful way.