This isn’t an objection to the research direction, just a response to how you’re framing it:
If you think GPT-3 is “narrowly superhuman” at medical advice, what topic don’t you think it’s narrowly superhuman in? It seems like you could similarly argue that GPT-3 knows more than the average human about mechanics, chemistry, politics, and just about anything that language is good at describing. (EG, not walking, riding a bike, the concrete skills needed for painting, etc.)
A tool capable of getting GPT-3 to give good medical advice would, probably, be a tool to get GPT-3 to give good advice.
(I am not denying that give good medical advice is a better initial goal/framing.)
This seems to imply that GPT-3 is broadly superhuman, IE, GPT-3 knows more than the average human about a very broad range of things (although GPT-3 might not know more than the best human in any domain). Going further: the implication is that GPT is a kind of mild superintelligence, currently misaligned in a benign way (it just wants to mimic humans) which hides an unknown portion of its intelligence (making it seem subhuman).
I’m not saying this is exactly true. Maybe GPT-3 really is only narrowly superhuman, in the sense that it basically only knows what it needs to know to mimic humans to this level, and essentially doesn’t know anything about medicine etc. In this world, its apparent knowledge of medicine is so mixed with all its other ideas that you can’t extract the truth: it’s not operating on a “true medical stuff + mistakes” model, it just has models of a bunch of possible statements with no way to differentiate good advice from nonsense. In that case, you can only train GPT-3 to give good medical advice by providing an external truth filter of some kind; your project would be basically doomed.
(I think the truth is some unknown point between those two extremes, and I’m quite curious to know exactly where.)
You consider whether AlphaGo could serve a similar role as a test case of aligning narrowly superhuman models, and you reject this idea. I think AlphaGo really is a narrowly superhuman model, and I think your rejection of it is related to this. Because it really is narrowly superhuman, it doesn’t seem like it has this kind of hidden knowledge you want to bring out—it only knows about Go.
So it seems like “narrowly superhuman” might be the wrong framing.
This seems like it’s using the wrong ontology to me.
Like, in my mind, there are things like medical diagnostics or predictions of pharmaceutical reactions, which are much easier cognitive tasks than general conversation, but which humans are specialized away from.
For example, imagine the severity of side effects from a specific medication. can be computed by figuring out 15 variables about the person and putting them into a neural network with 5000 parameters, and the output is somewhere in a six-dimensional space, and this model is part of a general model of human reactions to chemicals.
Then GPT-3 would be in a great position to use people’s reddit posts talking about medication side effects to find this network. I doubt that medical science in our current world could figure that out meaningfully. It would be strongly superhuman in this important medical task, but nowhere near superhuman in any other conversational task.
My intuition is that most professional occupations are dominated by problems like this, that are complex enough that we as humans can only capture them as intuitions, but simple enough that the “right” computational solution would be profoundly superhuman in that narrow domain, without being broadly superhuman in any autonomous sense.
Maybe a different reading of your comment is something like, there are so many of these things that if a human had access to superhuman abilities across all these individual narrow domains, that human could use it to create a decisive strategic advantage for themself, which does seem possibly very concerning.
Let’s see if I can properly state the nature of the disagreement.
I stated that there’s a spectrum between “GPT knows more than the average human across a broad variety of domains, but only uses this knowledge to imitate humans, so it’s not obvious” and “GPT really knows very little, and its apparent stupidity is stupidity-in-fact”.
I somewhat operationalized the difference as one of internal representation: to what extent is GPT using a truth+noise model (where it knows a lot of stuff about reality, and then filters it through the biases of particular perspectives) vs a model where everything is thrown together and it’s not very possible to extract truth without having more information yourself to know what is truth vs noise.
This model has an implication, that Ajeya’s project will work to the extent that we’re toward the smart-GPT end of the spectrum and won’t work to the extent that we’re toward the other end.
I think you’re disagreeing with this implication?
So you’re saying: even if GPT doesn’t internally use anything like a truth+noise model, it’s possible to extract a great deal of useful information about the world by observing the statistics of GPT’s imitation of internet users. For example, because people talk a lot about diseases online, it should be possible to extract statistics about this from GPT. This can produce a useful diagnostic model, even if GPT isn’t internally representing something so useful.
Is this roughly what you are saying?
If that’s what you’re saying, then I agree that such a thing could be possible, but I am unsure if this should count as success in Ajeya’s terms.
If GPT knows a lot of stuff but isn’t telling us because it’s not trying to be helpful, that’s misalignment. Getting it to try to communicate those things to us would be a kind of alignment work.
If the statistics of GPT’s text model can be used to infer useful things about the world, this doesn’t seem related to alignment.
But maybe I’m totally mis-identifying the disagreement you were trying to point at.
My intuition is that most professional occupations are dominated by problems like this, that are complex enough that we as humans can only capture them as intuitions, but simple enough that the “right” computational solution would be profoundly superhuman in that narrow domain, without being broadly superhuman in any autonomous sense.
Your phrase “in any autonomous sense” makes me think that perhaps you think GPT does have an internal model like the medical model you describe, plus similar models in many different domains, but lacks an “autonomy” property which would be required to make it broadly superhuman in a significant sense. Under this hypothesis, your disagreement with me is that you think I think GPT has “autonomy”.
I guess my response to that would be that GPT probably does lack some kind of “autonomy” (if it means independently pursuing goals by planning, anticipating the consequences of its words) but does have significant planning capacity if asked (eg could construct coherent plans involving its medical knowledge, and in doing so, fluidly match up its narrow medical knowledge with its narrow knowledge in a variety of different areas).
I think this is obscuring (my perception of) the disagreement a little bit.
I think what I’m saying is, GPT-3 probably doesn’t have any general truth+noise models. But I would expect it to copy a truth+noise model from people, when the underlying model is simple.
I then expect GPT-3 to “secretly” have something like an interesting diagnostic model, and probably a few other narrowly superhuman skills.
But I would expect it to not have any kind of significant planning capacity, because that planning capacity is not simple.
In particular my expectation is that coherently putting knowledge from different domains together in generally useful ways is MUCH, MUCH harder than being highly superhuman in narrow domains. Therefore I expect Ajeya’s approach to be both effective, because “narrowly superhuman” can exist, and reasonably safe, because the gap between “narrowly superhuman” or even “narrowly superhuman in many ways” and “broadly superhuman” is large so GPT-3 being broadly superhuman is unlikely.
Phrased differently, I am rejecting your idea of smartness-spectrum. My intuition is that levels of GPT-N competence will scale the way computers have always scaled at AI tasks—becoming usefully superhuman at a few very quickly, while taking much much longer to exhibit the kinds of intelligence that are worrying, like modeling human behavior for manipulation.
(It would be nice if you flagged a little better which things you think I think / which things you think I disagree with)
I think what I’m saying is, GPT-3 probably doesn’t have any general truth+noise models. But I would expect it to copy a truth+noise model from people, when the underlying model is simple.
OK, that makes sense. So you’re not saying that GPT contains useful diagnostic models in the overall statistics of its models of Reddit users (EG that someone complaining of one symptom will often complain of another), nor are you saying that GPT contains a good model of disease which it then feeds through noise (EG it decides that a particular user is a diabetic, which shapes how it plays that character going forward, but the character itself doesn’t know it is diabetic, so may say some confused things); indeed, you are denying the latter. But what you are saying is that GPT plays the role of users who do have their own internal models, so it must mimic those models (in cases where that’s not too hard to learn).
I find this hard to square with your earlier statement:
For example, imagine the severity of side effects from a specific medication. can be computed by figuring out 15 variables about the person and putting them into a neural network with 5000 parameters, and the output is somewhere in a six-dimensional space, and this model is part of a general model of human reactions to chemicals.
Then GPT-3 would be in a great position to use people’s reddit posts talking about medication side effects to find this network. I doubt that medical science in our current world could figure that out meaningfully.
Where it sounds like you think GPT will know something medical science does not know.
As for me, I find all of these to be broadly possible. I’d have to think more to give a meaningful plausibility ranking.
I then expect GPT-3 to “secretly” have something like an interesting diagnostic model, and probably a few other narrowly superhuman skills.
How many? I am thinking of “medical diagnostics” as just one example of many many areas of expertise which border on GPT’s competence. I wasn’t thinking there was any special reason to single out medicine in particular as something GPT might have implicit knowledge about.
On my model, if GPT contains implicit medical competence, it probably contains similar competence in “every area”, although I’m not sure how to quantify. Maybe a similar hidden competence in at least 50% of professions at least as numerous as, say, physicist? (Really, what matters is how much discussion of a profession there is online, not how numerous that profession is, but maybe it’s an OK proxy.)
My crux would be something special about medical diagnosis such that we especially expect GPT to have implicit talent there.
But I would expect it to not have any kind of significant planning capacity, because that planning capacity is not simple.
It seems like you think planning capacity might be some important difference in our positions?
Personally, I think it’s plausible that GPT does something to plan ahead: it seems broadly useful to think about what could come later in the text (eg where a sentence is going), and potentially, it’s useful to think about that in some detail (to notice when options which seem consistent at the high level are actually not consistent when you try to put all the pieces together (where by “consistent” I mean plausible in terms of everything GPT knows about text)).
But I don’t see this as fundamental to the view I’m expressing in any way.
In particular my expectation is that coherently putting knowledge from different domains together in generally useful ways is MUCH, MUCH harder than being highly superhuman in narrow domains. Therefore I expect Ajeya’s approach to be both effective, because “narrowly superhuman” can exist, and reasonably safe, because the gap between “narrowly superhuman” or even “narrowly superhuman in many ways” and “broadly superhuman” is large so GPT-3 being broadly superhuman is unlikely.
Phrased differently, I am rejecting your idea of smartness-spectrum. My intuition is that levels of GPT-N competence will scale the way computers have always scaled at AI tasks—becoming usefully superhuman at a few very quickly, while taking much much longer to exhibit the kinds of intelligence that are worrying, like modeling human behavior for manipulation.
I think I agree with a heuristic that says something like “GPT isn’t magic, GPT-n will scale the way things usually scale, the highest-probability projection for the near future is the smooth extrapolation from the past”. Not as something I’m confident about, but as the default.
But I do have a big disagreement with what you wrote above.
First I’m going to try to make a very general argument in favor of my spectrum. Then I’m going to give a very concrete scenario, which I think argues for “putting knowledge together” competence.
General Argument
Let’s forget the difference between the truth+noise model and various other models, and just deal with whether GPT has “implicit knowledge”. What exactly “implicit knowledge” means will depend on the extraction technology we invent; I define “implicit knowledge” functionally as any expertise which can be brought out (by something broadly in line with Ajeya’s research program).
My broad argument is just that absent any specific reason to expect implicit knowledge about medicine in particular, conditional on such knowledge, we should expect similar implicit knowledge across a broad variety of domains.
My smartness-spectrum is just the latent variable of how much implicit knowledge GPT has. The argument for the existence of such a spectrum is just the argument that if we see it in one domain, we would expect it in others. If we don’t see it in one, we less expect to see it in others.
Specific Scenario
Suppose the general alignment technology we develop resembles “learning to summarize from human feedback”, the example Ajeya cited of work that looks like what Ajeya wants to point toward.
More specifically, suppose it works like this:
We collect a lot of data of humans judging GPT-3 as being smart and helpful, vs dumb or not helpful.
We train a model (using features from GPT-3 to give the network a good start) to replicate those human judgments. Call this JUDGE.
We fine-tune GPT-3 using JUDGE as our training signal; ie, fine-tune it to be as smart and helpful as possible. Let’s call this GPT-nice.
This procedure may not extract all implicit knowledge, or extract it well, etc etc. However, I fully expect that this procedure will extract some. I just see no reason to think this procedure wouldn’t work. Simply put, I expect GPT-nice to be a legitimately more helpful and intelligent fine-tuning of GPT-3.
(Whether this procedure is safe for, say, GPT-7 is another question.)
Let’s say for the sake of argument that this procedure brings out the kind of medical competence we’ve been discussing, plus similar competence in at least a few other domains.
I generally expect that GPT-nice will have decent “putting knowledge together” skills, mainly because GPT-3 is already not too bad at this. Yes, sometimes GPT misses common-sense implications. However, by and large, if you put facts from different domains into the text history, GPT will come up with continuations which make sense. So I would postulate that GPT-nice will be at least as good as GPT-3 with the relevant facts placed in history.
Suppose for the sake of argument that GPT-nice is good at medical diagnosis and separately good at giving dietary advice. Further suppose for the sake of argument that GPT-3 is OK at telling you what dietary changes are implied by medical conditions. Then I would suppose GPT-3 is at least OK at giving dietary advice tailored to your medical conditions.
Or suppose GPT-nice is good at diagnosing psychological disorders, and good at giving social advice. Then suppose GPT-3 is already halfway decent at anticipating social problems associated with psychological disorders, when prompted correctly. Then I would suppose that GPT-nice would be halfway decent at tailoring its social advice to any psychological problems a person has.
I’m replying on my phone right now because I can’t stop thinking about it. I will try to remember to follow up when I can type more easily.
I think the vague shape of what I think I disagree about is how dense GPT-3′s sets of implicit knowledge are.
I do think we agree that GPT-5000 will be broadly superhuman, even if it just has a grab bag of models in this way, for approximately the reasons you give.
I’m thinking about “intelligent behavior” as something like the set of real numbers, and “human behavior” as covering something like rational numbers, so we can get very close to most real numbers but it takes some effort to fill in the decimal expansion. Then I’m thinking of GPT-N as being something like integers+1/N. As N increases, this becomes close enough to the rational numbers to approximate real numbers, and can be very good at approximating some real numbers, but can’t give you incomputable numbers (unaligned outcomes) and usually won’t give you duplicitous behavior (numbers that look very simple at first approximation but actually aren’t, like .2500000000000004, which seems to be 1⁄4 but secretly isn’t).
I’m not sure where that intuition comes from but I do think I endorse it with moderate confidence.
Basically I think for minimal circuit reasons that if “useful narrowly” emerges in GPT-N, then “useful in that same domain but capable of intentionally doing a treacherous turn” emerges later. My intuition is that this won’t be until GPT-(N+3) or more, so if you are able to get past unintentional turns like “the next commenter gives bad advice” traps, this alignment work is very safe, and important to do as fast as possible (because attempting it later is dangerous!)
In a world where GPT-(N+1) can do a treacherous turn, this is very dangerous, because you might accidentally forget to check if GPT-(N-1) can do it, and get the treacherous turn.
My guess is that you would agree that “minimal circuit that gives good advice” is smaller than “circuit that gives good advice but will later betray you”, and therefore there exist two model sizes where one is dangerous and one is safe but useful. I know I saw posts on this a while back, so there may be relevant math about what that gap might be, or it might be unproven but with some heuristics of what the best result probably is.
My intuition is that combining narrow models is multiplicative, so that adding a social manipulation model will always add an order of magnitude of complexity. My guess is that you don’t share this intuition. You may think of model combination as additive, in which case any model bigger than a model that can betray you is very dangerous, or you might think the minimal circuit for betrayal is not very large, or you might think that GPT-2-nice would be able to give good advice in many ways so GPT-3 is already big enough to contain good advice plus betrayal in many ways.
In particular if combining models is multiplicative in complexity, a model could easily learn two different skills at the same time, while being many orders of magnitude away from being able to use those skills together.
My guess is that you would agree that “minimal circuit that gives good advice” is smaller than “circuit that gives good advice but will later betray you”, and therefore there exist two model sizes where one is dangerous and one is safe but useful. I know I saw posts on this a while back, so there may be relevant math about what that gap might be, or it might be unproven but with some heuristics of what the best result probably is.
There was indeed a post posing this question a while back, and discussion in the comments included a counterexample: a construction of a minimal circuit that would be malign.
To my eye, the whole crux of the inner alignment problem is that we have no results saying things like:
The simplest program which solves a problem is not an inner optimizer
The minimal circuit which solves a problem is not an inner optimizer
The fastest program solving a problem is not an inner optimizer
Or any such thing. If we had such a result, then we’d have a grip on the problem. But we don’t currently have any result like that, nor any plausible direction for proving such a result. And indeed, thought on the problem suggests that these hypotheses are probably not true; rather, it seems surprisingly plausible, once you think about it, that indeed minimal solutions may sometimes be inner optimizers.
My intuition is that combining narrow models is multiplicative, so that adding a social manipulation model will always add an order of magnitude of complexity. My guess is that you don’t share this intuition. You may think of model combination as additive, in which case any model bigger than a model that can betray you is very dangerous, or you might think the minimal circuit for betrayal is not very large, or you might think that GPT-2-nice would be able to give good advice in many ways so GPT-3 is already big enough to contain good advice plus betrayal in many ways.
My thinking is that it’s probably somewhere between the two. Multiplicative complexity suggests memorizing a lookup table. But there is regularity in the universe. There is transfer learning.
In particular if combining models is multiplicative in complexity, a model could easily learn two different skills at the same time, while being many orders of magnitude away from being able to use those skills together.
Right. I think transfer learning speaks pretty strongly against this multiplicative model.
Looks like the initial question was here and a result around it was posted here. At a glance I don’t see the comments with counterexamples, and I do see a post with a formal result, which seems like a direct contradiction to what you’re saying, though I’ll look in more detail.
Coming back to the scaling question, I think I agree that multiplicative scaling over the whole model size is obviously wrong. To be more precise, if there’s something like a Q-learning inner optimizer for two tasks, then you need the cross product of the state spaces, so the size of the Q-space could scale close-to-multiplicatively. But the model that condenses the full state space into the Q-space scales additively, and in general I’d expect the model part to be much bigger—like the Q-space has 100 dimensions and the model has 1 billion parameters, so going adding a second model of 1 billion parameters and increasing the Q-space to 10k dimensions is mostly additive in practice, even if it’s also multiplicative in a technical sense.
I’m going to update my probability that “GPT-3 can solve X, Y implies GPT-3 can solve X+Y,” and take a closer look at the comments on the linked posts. This also makes me think that it might make sense to try to find simpler problems, even already-mostly-solved problems like Chess or algebra, and try to use this process to solve them with GPT-2, to build up the architecture and search for possible safety issues in the process.
I do see a post with a formal result, which seems like a direct contradiction to what you’re saying, though I’ll look in more detail.
If you mean to suggest this post has a positive result, then I think you’re just mis-reading it; the key result is
The conclusion of this post is the following: if there exists some set of natural tasks for which the fastest way to solve them is to do some sort of machine learning to find a good policy, and there is some task for which that machine learning results in deceptive behavior, then there exists a natural task such that the minimal circuit that solves that task also produces deceptive behavior.
which says that under some assumptions, there exists a task for which the minimal circuit will engage in deceptive behavior (IE is a malign inner optimizer).
The comment with a counterexample on the original post is here.
Yeah, you’re definitely pointing at an important way the framing is awkward. I think the real thing I want to say is “Try to use some humans to align a model in a domain where the model is better than the humans at the task”, and it’d be nice to have a catchy term for that. Probably a model which is better than some humans (e.g. MTurkers) at one task (e.g. medical advice) will also be better than those same humans at many other tasks (e.g. writing horror stories); but at the same time for each task, there’s some set of humans (e.g. doctors in the first case and horror authors in the second) where the model does worse.
I don’t want to just call it “align superhuman AI today” because people will be like “What? We don’t have that”, but at the same time I don’t want to drop “superhuman” from the name because that’s the main reason it feels like “practicing what we eventually want to do.” I considered “partially superhuman”, but “narrowly” won out.
I’m definitely in the market for a better term here.
I don’t want to drop “superhuman” from the name because that’s the main reason it feels like “practicing what we eventually want to do.”
One response I generated was, “maybe it’s just not so much about practicing what we eventually want to do, and that part is an illusion of the poor framing. We should figure out the right framing first and then ask whether it seems like practice, not optimize the framing to make it sound like practice.”
But I think my real response is: why is the superhuman part important, here? Maybe what’s really important is being able to get answers (eg medical advice) without putting them in (eg without fine-tuning on medical advice filtered for high quality), and asking for superhuman ability is just a way of helping ensure that? Or perhaps more generally, there are other things like this which you expect people to do wrong if they’re not dealing with a superhuman case, because you want the technology to eventually work for superhuman cases.
In my head the point of this proposal is very much about practicing what we eventually want to do, and seeing what comes out of that; I wasn’t trying here to make something different sound like it’s about practice. I don’t think that a framing which moved away from that would better get at the point I was making, though I totally think there could be other lines of empirical research under other framings that I’d be similarly excited about or maybe more excited about.
In my mind, the “better than evaluators” part is kind of self-evidently intriguing for the basic reason I said in the post (it’s not obvious how to do it, and it’s analogous to the broad, outside view conception of the long-run challenge which can be described in one sentence/phrase and isn’t strongly tied to a particular theoretical framing):
I’m excited about tackling this particular type of near-term challenge because it feels like a microcosm of the long-term AI alignment problem in a real, non-superficial sense. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.[4] So it seems like a promising form of practice to figure out how to get particular humans to oversee models that are more capable than them in specific ways, if this is done with an eye to developing scalable and domain-general techniques.
A lot of people in response to the draft were pushing in the direction that I think you were maybe gesturing at (?) -- to make this more specific to “knowing everything the model knows” or “ascription universality”; the section “Why not focus on testing a long-term solution?” was written in response to Evan Hubinger and others. I think I’m still not convinced that’s the right way to go.
I might be on board if “narrowly superhuman” were simply defined differently.
“Try to use some humans to align a model in a domain where the model is better than the humans at the task”
Isn’t it something more like “the model has information sufficient to do better”? EG, in the GPT example, you can’t reliably get good medical advice from it right now, but you strongly suspect it’s possible. That’s a key feature of the whole idea, right?
Is your suggested research program better described as: find (highly capable) models with inaccessible information and get them to reveal that information? (Especially: get them to reveal the inaccessible information without using domain expertise to do so?)
I don’t feel confident enough in the frame of “inaccessible information” to say that the whole agenda is about it. It feels like a fit for “advice”, but not a fit for “writing stories” or “solving programming puzzles” (at least not an intuitive fit—you could frame it as “the model has inaccessible information about [story-writing, programming]” but it feels more awkward to me). I do agree it’s about “strongly suspecting it has the potential to do better than humans” rather than about “already being better than humans.” Basically, it’s about trying to find areas where lackluster performance seems to mostly be about “misalignment” rather than “capabilities” (recognizing those are both fuzzy terms).
Basically, it’s about trying to find areas where lackluster performance seems to mostly be about “misalignment” rather than “capabilities” (recognizing those are both fuzzy terms).
Right, ok, I like that framing better (it obviously fits, but I didn’t generate it as a description before).
This isn’t an objection to the research direction, just a response to how you’re framing it:
If you think GPT-3 is “narrowly superhuman” at medical advice, what topic don’t you think it’s narrowly superhuman in? It seems like you could similarly argue that GPT-3 knows more than the average human about mechanics, chemistry, politics, and just about anything that language is good at describing. (EG, not walking, riding a bike, the concrete skills needed for painting, etc.)
A tool capable of getting GPT-3 to give good medical advice would, probably, be a tool to get GPT-3 to give good advice.
(I am not denying that give good medical advice is a better initial goal/framing.)
This seems to imply that GPT-3 is broadly superhuman, IE, GPT-3 knows more than the average human about a very broad range of things (although GPT-3 might not know more than the best human in any domain). Going further: the implication is that GPT is a kind of mild superintelligence, currently misaligned in a benign way (it just wants to mimic humans) which hides an unknown portion of its intelligence (making it seem subhuman).
I’m not saying this is exactly true. Maybe GPT-3 really is only narrowly superhuman, in the sense that it basically only knows what it needs to know to mimic humans to this level, and essentially doesn’t know anything about medicine etc. In this world, its apparent knowledge of medicine is so mixed with all its other ideas that you can’t extract the truth: it’s not operating on a “true medical stuff + mistakes” model, it just has models of a bunch of possible statements with no way to differentiate good advice from nonsense. In that case, you can only train GPT-3 to give good medical advice by providing an external truth filter of some kind; your project would be basically doomed.
(I think the truth is some unknown point between those two extremes, and I’m quite curious to know exactly where.)
You consider whether AlphaGo could serve a similar role as a test case of aligning narrowly superhuman models, and you reject this idea. I think AlphaGo really is a narrowly superhuman model, and I think your rejection of it is related to this. Because it really is narrowly superhuman, it doesn’t seem like it has this kind of hidden knowledge you want to bring out—it only knows about Go.
So it seems like “narrowly superhuman” might be the wrong framing.
This seems like it’s using the wrong ontology to me.
Like, in my mind, there are things like medical diagnostics or predictions of pharmaceutical reactions, which are much easier cognitive tasks than general conversation, but which humans are specialized away from.
For example, imagine the severity of side effects from a specific medication. can be computed by figuring out 15 variables about the person and putting them into a neural network with 5000 parameters, and the output is somewhere in a six-dimensional space, and this model is part of a general model of human reactions to chemicals.
Then GPT-3 would be in a great position to use people’s reddit posts talking about medication side effects to find this network. I doubt that medical science in our current world could figure that out meaningfully. It would be strongly superhuman in this important medical task, but nowhere near superhuman in any other conversational task.
My intuition is that most professional occupations are dominated by problems like this, that are complex enough that we as humans can only capture them as intuitions, but simple enough that the “right” computational solution would be profoundly superhuman in that narrow domain, without being broadly superhuman in any autonomous sense.
Maybe a different reading of your comment is something like, there are so many of these things that if a human had access to superhuman abilities across all these individual narrow domains, that human could use it to create a decisive strategic advantage for themself, which does seem possibly very concerning.
Let’s see if I can properly state the nature of the disagreement.
I stated that there’s a spectrum between “GPT knows more than the average human across a broad variety of domains, but only uses this knowledge to imitate humans, so it’s not obvious” and “GPT really knows very little, and its apparent stupidity is stupidity-in-fact”.
I somewhat operationalized the difference as one of internal representation: to what extent is GPT using a truth+noise model (where it knows a lot of stuff about reality, and then filters it through the biases of particular perspectives) vs a model where everything is thrown together and it’s not very possible to extract truth without having more information yourself to know what is truth vs noise.
This model has an implication, that Ajeya’s project will work to the extent that we’re toward the smart-GPT end of the spectrum and won’t work to the extent that we’re toward the other end.
I think you’re disagreeing with this implication?
So you’re saying: even if GPT doesn’t internally use anything like a truth+noise model, it’s possible to extract a great deal of useful information about the world by observing the statistics of GPT’s imitation of internet users. For example, because people talk a lot about diseases online, it should be possible to extract statistics about this from GPT. This can produce a useful diagnostic model, even if GPT isn’t internally representing something so useful.
Is this roughly what you are saying?
If that’s what you’re saying, then I agree that such a thing could be possible, but I am unsure if this should count as success in Ajeya’s terms.
If GPT knows a lot of stuff but isn’t telling us because it’s not trying to be helpful, that’s misalignment. Getting it to try to communicate those things to us would be a kind of alignment work.
If the statistics of GPT’s text model can be used to infer useful things about the world, this doesn’t seem related to alignment.
But maybe I’m totally mis-identifying the disagreement you were trying to point at.
Your phrase “in any autonomous sense” makes me think that perhaps you think GPT does have an internal model like the medical model you describe, plus similar models in many different domains, but lacks an “autonomy” property which would be required to make it broadly superhuman in a significant sense. Under this hypothesis, your disagreement with me is that you think I think GPT has “autonomy”.
I guess my response to that would be that GPT probably does lack some kind of “autonomy” (if it means independently pursuing goals by planning, anticipating the consequences of its words) but does have significant planning capacity if asked (eg could construct coherent plans involving its medical knowledge, and in doing so, fluidly match up its narrow medical knowledge with its narrow knowledge in a variety of different areas).
I think this is obscuring (my perception of) the disagreement a little bit.
I think what I’m saying is, GPT-3 probably doesn’t have any general truth+noise models. But I would expect it to copy a truth+noise model from people, when the underlying model is simple.
I then expect GPT-3 to “secretly” have something like an interesting diagnostic model, and probably a few other narrowly superhuman skills.
But I would expect it to not have any kind of significant planning capacity, because that planning capacity is not simple.
In particular my expectation is that coherently putting knowledge from different domains together in generally useful ways is MUCH, MUCH harder than being highly superhuman in narrow domains. Therefore I expect Ajeya’s approach to be both effective, because “narrowly superhuman” can exist, and reasonably safe, because the gap between “narrowly superhuman” or even “narrowly superhuman in many ways” and “broadly superhuman” is large so GPT-3 being broadly superhuman is unlikely.
Phrased differently, I am rejecting your idea of smartness-spectrum. My intuition is that levels of GPT-N competence will scale the way computers have always scaled at AI tasks—becoming usefully superhuman at a few very quickly, while taking much much longer to exhibit the kinds of intelligence that are worrying, like modeling human behavior for manipulation.
Thanks for trying further to bridge the gap!
(It would be nice if you flagged a little better which things you think I think / which things you think I disagree with)
OK, that makes sense. So you’re not saying that GPT contains useful diagnostic models in the overall statistics of its models of Reddit users (EG that someone complaining of one symptom will often complain of another), nor are you saying that GPT contains a good model of disease which it then feeds through noise (EG it decides that a particular user is a diabetic, which shapes how it plays that character going forward, but the character itself doesn’t know it is diabetic, so may say some confused things); indeed, you are denying the latter. But what you are saying is that GPT plays the role of users who do have their own internal models, so it must mimic those models (in cases where that’s not too hard to learn).
I find this hard to square with your earlier statement:
Where it sounds like you think GPT will know something medical science does not know.
As for me, I find all of these to be broadly possible. I’d have to think more to give a meaningful plausibility ranking.
How many? I am thinking of “medical diagnostics” as just one example of many many areas of expertise which border on GPT’s competence. I wasn’t thinking there was any special reason to single out medicine in particular as something GPT might have implicit knowledge about.
On my model, if GPT contains implicit medical competence, it probably contains similar competence in “every area”, although I’m not sure how to quantify. Maybe a similar hidden competence in at least 50% of professions at least as numerous as, say, physicist? (Really, what matters is how much discussion of a profession there is online, not how numerous that profession is, but maybe it’s an OK proxy.)
My crux would be something special about medical diagnosis such that we especially expect GPT to have implicit talent there.
It seems like you think planning capacity might be some important difference in our positions?
Personally, I think it’s plausible that GPT does something to plan ahead: it seems broadly useful to think about what could come later in the text (eg where a sentence is going), and potentially, it’s useful to think about that in some detail (to notice when options which seem consistent at the high level are actually not consistent when you try to put all the pieces together (where by “consistent” I mean plausible in terms of everything GPT knows about text)).
But I don’t see this as fundamental to the view I’m expressing in any way.
I think I agree with a heuristic that says something like “GPT isn’t magic, GPT-n will scale the way things usually scale, the highest-probability projection for the near future is the smooth extrapolation from the past”. Not as something I’m confident about, but as the default.
But I do have a big disagreement with what you wrote above.
First I’m going to try to make a very general argument in favor of my spectrum. Then I’m going to give a very concrete scenario, which I think argues for “putting knowledge together” competence.
General Argument
Let’s forget the difference between the truth+noise model and various other models, and just deal with whether GPT has “implicit knowledge”. What exactly “implicit knowledge” means will depend on the extraction technology we invent; I define “implicit knowledge” functionally as any expertise which can be brought out (by something broadly in line with Ajeya’s research program).
My broad argument is just that absent any specific reason to expect implicit knowledge about medicine in particular, conditional on such knowledge, we should expect similar implicit knowledge across a broad variety of domains.
My smartness-spectrum is just the latent variable of how much implicit knowledge GPT has. The argument for the existence of such a spectrum is just the argument that if we see it in one domain, we would expect it in others. If we don’t see it in one, we less expect to see it in others.
Specific Scenario
Suppose the general alignment technology we develop resembles “learning to summarize from human feedback”, the example Ajeya cited of work that looks like what Ajeya wants to point toward.
More specifically, suppose it works like this:
We collect a lot of data of humans judging GPT-3 as being smart and helpful, vs dumb or not helpful.
We train a model (using features from GPT-3 to give the network a good start) to replicate those human judgments. Call this JUDGE.
We fine-tune GPT-3 using JUDGE as our training signal; ie, fine-tune it to be as smart and helpful as possible. Let’s call this GPT-nice.
This procedure may not extract all implicit knowledge, or extract it well, etc etc. However, I fully expect that this procedure will extract some. I just see no reason to think this procedure wouldn’t work. Simply put, I expect GPT-nice to be a legitimately more helpful and intelligent fine-tuning of GPT-3.
(Whether this procedure is safe for, say, GPT-7 is another question.)
Let’s say for the sake of argument that this procedure brings out the kind of medical competence we’ve been discussing, plus similar competence in at least a few other domains.
I generally expect that GPT-nice will have decent “putting knowledge together” skills, mainly because GPT-3 is already not too bad at this. Yes, sometimes GPT misses common-sense implications. However, by and large, if you put facts from different domains into the text history, GPT will come up with continuations which make sense. So I would postulate that GPT-nice will be at least as good as GPT-3 with the relevant facts placed in history.
Suppose for the sake of argument that GPT-nice is good at medical diagnosis and separately good at giving dietary advice. Further suppose for the sake of argument that GPT-3 is OK at telling you what dietary changes are implied by medical conditions. Then I would suppose GPT-3 is at least OK at giving dietary advice tailored to your medical conditions.
Or suppose GPT-nice is good at diagnosing psychological disorders, and good at giving social advice. Then suppose GPT-3 is already halfway decent at anticipating social problems associated with psychological disorders, when prompted correctly. Then I would suppose that GPT-nice would be halfway decent at tailoring its social advice to any psychological problems a person has.
I’m replying on my phone right now because I can’t stop thinking about it. I will try to remember to follow up when I can type more easily.
I think the vague shape of what I think I disagree about is how dense GPT-3′s sets of implicit knowledge are.
I do think we agree that GPT-5000 will be broadly superhuman, even if it just has a grab bag of models in this way, for approximately the reasons you give.
I’m thinking about “intelligent behavior” as something like the set of real numbers, and “human behavior” as covering something like rational numbers, so we can get very close to most real numbers but it takes some effort to fill in the decimal expansion. Then I’m thinking of GPT-N as being something like integers+1/N. As N increases, this becomes close enough to the rational numbers to approximate real numbers, and can be very good at approximating some real numbers, but can’t give you incomputable numbers (unaligned outcomes) and usually won’t give you duplicitous behavior (numbers that look very simple at first approximation but actually aren’t, like .2500000000000004, which seems to be 1⁄4 but secretly isn’t). I’m not sure where that intuition comes from but I do think I endorse it with moderate confidence.
Basically I think for minimal circuit reasons that if “useful narrowly” emerges in GPT-N, then “useful in that same domain but capable of intentionally doing a treacherous turn” emerges later. My intuition is that this won’t be until GPT-(N+3) or more, so if you are able to get past unintentional turns like “the next commenter gives bad advice” traps, this alignment work is very safe, and important to do as fast as possible (because attempting it later is dangerous!)
In a world where GPT-(N+1) can do a treacherous turn, this is very dangerous, because you might accidentally forget to check if GPT-(N-1) can do it, and get the treacherous turn.
My guess is that you would agree that “minimal circuit that gives good advice” is smaller than “circuit that gives good advice but will later betray you”, and therefore there exist two model sizes where one is dangerous and one is safe but useful. I know I saw posts on this a while back, so there may be relevant math about what that gap might be, or it might be unproven but with some heuristics of what the best result probably is.
My intuition is that combining narrow models is multiplicative, so that adding a social manipulation model will always add an order of magnitude of complexity. My guess is that you don’t share this intuition. You may think of model combination as additive, in which case any model bigger than a model that can betray you is very dangerous, or you might think the minimal circuit for betrayal is not very large, or you might think that GPT-2-nice would be able to give good advice in many ways so GPT-3 is already big enough to contain good advice plus betrayal in many ways.
In particular if combining models is multiplicative in complexity, a model could easily learn two different skills at the same time, while being many orders of magnitude away from being able to use those skills together.
There was indeed a post posing this question a while back, and discussion in the comments included a counterexample: a construction of a minimal circuit that would be malign.
To my eye, the whole crux of the inner alignment problem is that we have no results saying things like:
The simplest program which solves a problem is not an inner optimizer
The minimal circuit which solves a problem is not an inner optimizer
The fastest program solving a problem is not an inner optimizer
Or any such thing. If we had such a result, then we’d have a grip on the problem. But we don’t currently have any result like that, nor any plausible direction for proving such a result. And indeed, thought on the problem suggests that these hypotheses are probably not true; rather, it seems surprisingly plausible, once you think about it, that indeed minimal solutions may sometimes be inner optimizers.
My thinking is that it’s probably somewhere between the two. Multiplicative complexity suggests memorizing a lookup table. But there is regularity in the universe. There is transfer learning.
Right. I think transfer learning speaks pretty strongly against this multiplicative model.
Looks like the initial question was here and a result around it was posted here. At a glance I don’t see the comments with counterexamples, and I do see a post with a formal result, which seems like a direct contradiction to what you’re saying, though I’ll look in more detail.
Coming back to the scaling question, I think I agree that multiplicative scaling over the whole model size is obviously wrong. To be more precise, if there’s something like a Q-learning inner optimizer for two tasks, then you need the cross product of the state spaces, so the size of the Q-space could scale close-to-multiplicatively. But the model that condenses the full state space into the Q-space scales additively, and in general I’d expect the model part to be much bigger—like the Q-space has 100 dimensions and the model has 1 billion parameters, so going adding a second model of 1 billion parameters and increasing the Q-space to 10k dimensions is mostly additive in practice, even if it’s also multiplicative in a technical sense.
I’m going to update my probability that “GPT-3 can solve X, Y implies GPT-3 can solve X+Y,” and take a closer look at the comments on the linked posts. This also makes me think that it might make sense to try to find simpler problems, even already-mostly-solved problems like Chess or algebra, and try to use this process to solve them with GPT-2, to build up the architecture and search for possible safety issues in the process.
If you mean to suggest this post has a positive result, then I think you’re just mis-reading it; the key result is
which says that under some assumptions, there exists a task for which the minimal circuit will engage in deceptive behavior (IE is a malign inner optimizer).
The comment with a counterexample on the original post is here.
I see, I definitely didn’t read that closely enough.
Yeah, you’re definitely pointing at an important way the framing is awkward. I think the real thing I want to say is “Try to use some humans to align a model in a domain where the model is better than the humans at the task”, and it’d be nice to have a catchy term for that. Probably a model which is better than some humans (e.g. MTurkers) at one task (e.g. medical advice) will also be better than those same humans at many other tasks (e.g. writing horror stories); but at the same time for each task, there’s some set of humans (e.g. doctors in the first case and horror authors in the second) where the model does worse.
I don’t want to just call it “align superhuman AI today” because people will be like “What? We don’t have that”, but at the same time I don’t want to drop “superhuman” from the name because that’s the main reason it feels like “practicing what we eventually want to do.” I considered “partially superhuman”, but “narrowly” won out.
I’m definitely in the market for a better term here.
One response I generated was, “maybe it’s just not so much about practicing what we eventually want to do, and that part is an illusion of the poor framing. We should figure out the right framing first and then ask whether it seems like practice, not optimize the framing to make it sound like practice.”
But I think my real response is: why is the superhuman part important, here? Maybe what’s really important is being able to get answers (eg medical advice) without putting them in (eg without fine-tuning on medical advice filtered for high quality), and asking for superhuman ability is just a way of helping ensure that? Or perhaps more generally, there are other things like this which you expect people to do wrong if they’re not dealing with a superhuman case, because you want the technology to eventually work for superhuman cases.
In my head the point of this proposal is very much about practicing what we eventually want to do, and seeing what comes out of that; I wasn’t trying here to make something different sound like it’s about practice. I don’t think that a framing which moved away from that would better get at the point I was making, though I totally think there could be other lines of empirical research under other framings that I’d be similarly excited about or maybe more excited about.
In my mind, the “better than evaluators” part is kind of self-evidently intriguing for the basic reason I said in the post (it’s not obvious how to do it, and it’s analogous to the broad, outside view conception of the long-run challenge which can be described in one sentence/phrase and isn’t strongly tied to a particular theoretical framing):
A lot of people in response to the draft were pushing in the direction that I think you were maybe gesturing at (?) -- to make this more specific to “knowing everything the model knows” or “ascription universality”; the section “Why not focus on testing a long-term solution?” was written in response to Evan Hubinger and others. I think I’m still not convinced that’s the right way to go.
I might be on board if “narrowly superhuman” were simply defined differently.
Isn’t it something more like “the model has information sufficient to do better”? EG, in the GPT example, you can’t reliably get good medical advice from it right now, but you strongly suspect it’s possible. That’s a key feature of the whole idea, right?
Is your suggested research program better described as: find (highly capable) models with inaccessible information and get them to reveal that information? (Especially: get them to reveal the inaccessible information without using domain expertise to do so?)
I don’t feel confident enough in the frame of “inaccessible information” to say that the whole agenda is about it. It feels like a fit for “advice”, but not a fit for “writing stories” or “solving programming puzzles” (at least not an intuitive fit—you could frame it as “the model has inaccessible information about [story-writing, programming]” but it feels more awkward to me). I do agree it’s about “strongly suspecting it has the potential to do better than humans” rather than about “already being better than humans.” Basically, it’s about trying to find areas where lackluster performance seems to mostly be about “misalignment” rather than “capabilities” (recognizing those are both fuzzy terms).
Right, ok, I like that framing better (it obviously fits, but I didn’t generate it as a description before).