Interesting! I hadn’t come across that. Maybe ChatGPT is right that there is sweetness (perhaps to somebody with trained taste) that doesn’t come from sugar. However, the blatant contradictions remain (ChatGPT certainly wasn’t saying that at the beginning of the transcript).
OpenAI has in the past not been that transparent about these questions, but in this case, the blog post (linked in my post) makes it very clear it’s trained with reinforcement learning from human feedback.
However, of course it was initially pretrained in an unsupervised fashion (it’s based on GPT-3), so it seems hard to know whether this specific behavior was “due to the RL” or “a likely continuation”.
This is a broader criticism of alignment to preferences or intent in general, since these things can change (and sometimes, you can even make choices of whether to change them or not). L.A. Paul wrote a whole book about this sort of thing; if you’re interested, here’s a good talk.
That’s fair. I think it’s a critique of RLHF as it is currently done (just get lots of preferences over outputs and train your model). I don’t think just asking you questions “when it’s confused” is sufficient, it also has to know when to be confused. But RLHF is a pretty general framework, so you could theoretically expose a model to lots of black swan events (not just mildly OOD events) and make sure it reacts to them appropriately (or asks questions). But as far as I know, that’s not research that’s currently happening (though there might be something I’m not aware of).
This has also motivated me to post one of my favorite critiques of RLHF.
I think if they operationalized it like that, fine, but I would find the frame “solving the problem” to be a very weird way of referring to that. Usually, when I hear people saying “solving the problem” they have a vague sense of what they are meaning, and have implicitly abstracted away the fact that there are many continuous problems where progress needs to be made and that the problem can only really be reduced, but never solved, unless there is actually a mathematical proof.
I’m not a big supporter of RLHF myself, but my steelman is something like:
RLHF is a pretty general framework for conforming a system to optimize for something that can’t clearly be defined. If we just did, “if a human looked at this for a second would they like it?” this does provide a reward signal towards deception, but also towards genuinely useful behavior. You can take steps to reduce the deception component, for example by letting the humans red team the model or do some kind of transparency; this can all theoretically fit in the framework of RLHF. One could try to make the human feedback as robust as possible by adding all sorts of bells and whistles, and this would improve reliability and reduce the deception reward signal. It could be argued this still isn’t sufficient because the model will still find some way around it, but beware too much asymptotic reasoning.
I personally do not find RLHF very appealing since I think it complexifies things unnecessarily and is too correlated with capabilities at the moment. I prefer approaches to actually try to isolate the things people actually care about (i.e. their values), add some level of uncertainty (moral uncertainty, etc.) to it, try to make these proxies as robust as possible, and make them adaptive to changes in the models that are trying to constantly exploit and Goodhart them.
I think I essentially agree with respect to your definition of “sanity,” and that it should be a goal. For example, just getting people to think more about tail risk seems like your definition of “sanity” and my definition of “safety culture.” I agree that saying that they support my efforts and say applause lights is pretty bad, though it seems weird to me to discount actual resources coming in.
As for the last bit: trying to figure out the crux here. Are you just not very concerned about outer alignment/proxy gaming? I think if I was totally unconcerned about that, and only concerned with inner alignment/deception, I would think those areas were useless. As it is, I think a lot of the work is actively harmful (because it is mostly just advances capabilities) but it still may help chip away at the proxy gaming problem.
I appreciate this comment, because I think anyone who is trying to do these kinds of interventions needs to be constantly vigilant about exactly what you are mentioning. I am not excited about loads of inexperienced people suddenly trying to suddenly do big things in AI strategy, because downsides can be so high. Even people I trust are likely to make a lot of miscalculations. And the epistemics can be bad.
I wouldn’t be excited about (for example) retreats with undergrads to learn about “how you can help buy more time.” I’m not even sure of the sign of interventions people I trust pursue, let alone people with no experience who have barely thought about the problem. As somebody who is very inexperienced but interested in AI strategy, I will note that you do have to start somewhere to learn anything.
That said—and I don’t think you dispute this—we cannot afford to leave strategy/fieldbuilding/policy off the table. In my view, a huge part of making AI go well is going to depend on forces beyond the purely technical, but I understand that this varies depending on one’s threat models. Societechnical systems are very difficult to influence, and it’s easy to influence them in an incorrect direction. Not everyone should try to do this, and even people who are really smart and thinking clearly may have negative effects. But I think we still have to try.
I want to push back on your suggestion that safety culture is not relevant. I agree that being vaguely “concerned” does seem not very useful. But safety culture seems very important. Things like (paraphrased from this paper):
Preoccupation with failure, especially black swan events and unseen failures.
Reluctance to simplify interpretations and explain failures using only simplistic narratives.
Sensitivity to operations, which involves closely monitoring systems for unexpected behavior.
Commitment to resilience, which means being rapidly adaptable to change and willing to try new ideas when faced with unexpected circumstances.
Under-specification of organizational structures, where new information can travel throughout the entire organization rather than relying only on fixed reporting chains.
In my view, having this kind of culture (which is analagous to having a security mindset) proliferate would be a nearly unalloyed good. Notice there is nothing here that necessarily says “slow down”—one failure mode with telling the most safety conscious people to simply slow down is, of course, that less safety conscious people don’t. Rather, simply awareness and understanding, and taking safety seriously, I think is robustly positive regardless of the strategic situation. Doing as much as possible to extract out the old fashioned “move fast and break things” ethos and replace it with a safety-oriented ethos would be very helpful (though I will emphasize, will not solve everything).
Lastly, regarding this:
with people walking away thinking that AI Alignment people want to teach AIs how to reproduce moral philosophy, or that OpenAIs large language model have been successfully “aligned”, or that we just need to throw some RLHF at the problem and the AI will learn our values fine
I’m curious if you think that these topics have no relevance to alignment, or whether you’re saying it’s problematic that people come away thinking that these things are the whole picture? Because I see all of these as relevant (to varying degrees), but certainly not sufficient or the whole picture.
I feel somewhat conflicted about this post. I think a lot of the points are essentially true. For instance, I think it would be good if timelines could be longer all else equal. I also would love more coordination between AI labs. I also would like more people in AI labs to start paying attention to AI safety.
But I don’t really like the bottom line. The point of all of the above is not reducible to just “getting more time for the problem to be solved.”
First of all, the framing of “solving the problem” is, in my view, misplaced. Unless you think we will someday have a proof of beneficial AI (I think that’s highly unlikely), then there will always be more to do to increase certainty and reliability. There isn’t a moment when the problem is solved.
Second, these interventions are presented as a way of giving “alignment researchers” more time to make technical progress. But in my view, things like more coordination could lead to labs actually adopting any alignment proposals at all. The same is the case for racing. In terms of concern about AI safety, I’d expect labs would actually devote resources to this themselves if they are concerned. This shouldn’t be a side benefit, it should be a mainline benefit.
I don’t think high levels of reliability of beneficial AI will come purely from people who post on LessWrong, because the community is just so small and doesn’t have that much capital behind it. DeepMind/OpenAI, not to mention Google Brain and Meta AI research, could invest significantly more in safety than they do. So could governments (yes, the way they do this might be bad—but it could in principle be good).
You say that you thought buying time was the most important frame you found to backchain with. To me this illustrates a problem with backchaining. Dan Hendrycks and I discussed similar kinds of interventions, and we called this “improving contributing factors” which is what it’s called in complex systems theory. In my view, it’s a much better and less reductive frame for thinking about these kinds of interventions.
As far as I understand it, “intelligence” is the ability to achieve one’s goals through reasoning and making plans, so a highly intelligent system is goal-directed by definition. Less goal-directed AIs are certainly possible, but they must necessarily be considered less intelligent—the thermometer example illustrates this. Therefore, a less goal-directed AI will always lose in competition against a more goal-directed one.
Your argument seems to be:
Definitionally, intelligence is the ability to achieve one’s goals.
Less goal-directed systems are less intelligent.
Less intelligent systems will always lose in competition.
Less goal directed systems will always lose in competition.
Defining intelligence as goal-directedness doesn’t do anything for your argument. It just kicks the can down the road. Why will less intelligent (under your definition, goal directed) always lose in competition?
Imagine you had a magic wand or a genie in a bottle that would fulfill every wish you could dream of. Would you use it? If so, you’re incentivized to take over the world, because the only possible way of making every wish come true is absolute power over the universe. The fact that you normally don’t try to achieve that may have to do with the realization that you have no chance. If you had, I bet you’d try it. I certainly would, if only so I could stop Putin. But would me being all-powerful be a good thing for the rest of the world? I doubt it.
Romance is a canonical example of where you really don’t want to be all powerful (if real romance is what you want). Romance could not exist if your romantic partner always predictably did everything you ever wanted. The whole point is they are a different person, with different wishes, and you have to figure out how to navigate that and its unpredictabilities. That is the “fun” of romance. So no, I don’t think everyone would really use that magic wand.
I think it’s tangentially relevant in certain cases. Here’s something I wrote in another context, where I think it’s perhaps useful to understand what we really mean when we say “intelligence.”
We consider humans intelligent not because they do better on all possible optimization problems (they don’t, due to the no free lunch theorem), but because they do better on a subset of problems that are actually encountered in the real world. For instance, humans have particular cognitive architectures that allow us to understand language well, and language is something that humans need to understand. This can be seen even more clearly with “cognitive biases”, which are errors in logical reasoning that are nevertheless advantageous for surviving in the ancestral human world. Recently, people have tried to rid themselves of such biases, because of their belief that they no longer help in the modern world, a perfect example of the fact that human intelligence is highly domain-dependent.We can’t do arithmetic nearly as well as machines, but we don’t view them as intelligent because we can do many more things than they can, more flexibly and better. The machine might reply that it can do many things as well: it can quickly multiply 3040 by 2443, but also 42323 by 3242, and 379 by 305, and in an absolutely huge number of these calculations. The human might respond that these are all “just multiplication problems”; the machine might say that human problems are “just thriving-on-planet-earth problems”. When we say intelligence, we essentially just mean “ability to excel at thriving-on-planet-earth problems,” which requires knowledge and cognitive architectures that are specifically good at thriving-on-planet-earth. Thinking too much about “generality” tends to confuse; instead consider performance on thriving-on-planet-earth problems, or particular subsets of those problems.
We consider humans intelligent not because they do better on all possible optimization problems (they don’t, due to the no free lunch theorem), but because they do better on a subset of problems that are actually encountered in the real world. For instance, humans have particular cognitive architectures that allow us to understand language well, and language is something that humans need to understand. This can be seen even more clearly with “cognitive biases”, which are errors in logical reasoning that are nevertheless advantageous for surviving in the ancestral human world. Recently, people have tried to rid themselves of such biases, because of their belief that they no longer help in the modern world, a perfect example of the fact that human intelligence is highly domain-dependent.
We can’t do arithmetic nearly as well as machines, but we don’t view them as intelligent because we can do many more things than they can, more flexibly and better. The machine might reply that it can do many things as well: it can quickly multiply 3040 by 2443, but also 42323 by 3242, and 379 by 305, and in an absolutely huge number of these calculations. The human might respond that these are all “just multiplication problems”; the machine might say that human problems are “just thriving-on-planet-earth problems”. When we say intelligence, we essentially just mean “ability to excel at thriving-on-planet-earth problems,” which requires knowledge and cognitive architectures that are specifically good at thriving-on-planet-earth. Thinking too much about “generality” tends to confuse; instead consider performance on thriving-on-planet-earth problems, or particular subsets of those problems.
Agree it’s mostly not relevant.
Link is course.mlsafety.org.
(Not reviewed by Dan Hendrycks.)
This post is about epistemics, not about safety techniques, which are covered in later parts of the sequence. Machine learning, specifically deep learning, is the dominant paradigm that people believe will lead to AGI. The researchers who are advancing the machine learning field have proven quite good at doing so, insofar as they have created rapid capabilities advancements. This post sought to give an overview of how they do this, which is in my view extremely useful information! We strongly do not favor advancements of capabilities in the name of safety, and that is very clear in the rest of this sequence. But it seems especially odd to say that one should not even talk about how capabilities have been advanced.
The amount of capabilities research is simply far greater than safety research. Thus to answer the question “what kind of research approaches generally work for shaping machine learning systems?” it is quite useful to engage with how they have worked in capabilities advancements. In machine learning, theoretical (in “math proofs” sense of the word) approaches to advancing capabilities have largely not worked. This suggests deep learning is not amenable to these kinds of approaches.
It sounds like you believe that these approaches will be necessary for AI safety, so no amount of knowledge of their inefficacy should persuade us in favor of more iterative, engineering approaches. To put it another way: if iterative engineering practices will never ensure safety, then it does not matter if they worked for capabilities, so we should be working mainly on theory.
I pretty much agree with the logic of this, but I don’t agree with the premise: I don’t think it’s the case that iterative engineering practices will never ensure safety. The reasons for this are covered in this sequence. Theory, on the other hand, is a lot more ironclad than iterative engineering approaches, if useable theory could actually be produced. Knowledge that useable theory has not really been produced in deep learning suggests to me that it’s unlikely to for safety, either. Thus, to me, an interative engineering approach appears to be more favorable despite the fact that it leaves more room for error. However, I think that if you believe that iterative engineering approachs will never work, then indeed you should work on theory despite what should be a very strong prior (based on what is said in this post) that theory will also not work.
From my comments on the MLSS project submission (which aren’t intended to be comprehensive):
Quite enjoyed reading this, thanks for writing!
My guess is that the factors combine to create a roughly linear model. Even if progress is unpredictable and not linear, the average rate of progress will still be linear.
I’m very skeptical that this is a linear interpolation. It’s the core of your argument, but I didn’t think it was really argued. I would be very surprised if moving from 50% to 49% risk took similar time as moving from 2% to 1% risk, even if there are more researchers, unless the research pool grows exponentially. I don’t really think you’ve justified this linear trend.
The report also seems to just assume aligned AI would reduce other x-risk to zero. I’m not sure why this should be assumed. I can see a case for a large reduction in it, but it’s not necessarily obvious.
Lastly, it felt strange to me to not explore risks of cognitively enhanced humans: for instance, risks that cognitively enhanced humans tould have different values, or risks that cognitively enhanced humans would subjugate unenhanced humans.
Thanks! I really appreciate it, and think it’s a lot more accurate now. Nitpicks:
I think the MLSS link is currently broken. Also, in the headline table, it still emphasizes model robustness perhaps more than is warranted.
As somebody who used to be an intern at CHAI, but certainly isn’t speaking for the organization:
CHAI seems best approximated as a collection of researchers doing a bunch of different things. There is more reinforcement learning at CHAI than elsewhere, and it’s ML research, but it’s not top down at all so it doesn’t feel that unified. Stuart Russell has an agenda, but his students have their own agendas which only sometimes overlap with his.