Yup, thanks, edited
Planned summary for the Alignment Newsletter:
The biggest <@GPT-2 model@>(@Better Language Models and Their Implications@) had 1.5 billion parameters, and since its release people have trained language models with up to 17 billion parameters. This paper reports GPT-3 results, where the largest model has _175 billion_ parameters, a 10x increase over the previous largest language model. To get the obvious out of the way, it sets a new state of the art (SOTA) on zero-shot language modeling (evaluated only on Penn Tree Bank, as other evaluation sets were accidentally a part of their training set).The primary focus of the paper is on analyzing the _few-shot learning_ capabilities of GPT-3. In few-shot learning, after an initial training phase, at test time models are presented with a small number of examples of a new task, and then must execute that task for new inputs. Such problems are usually solved using _meta-learning_ or _finetuning_, e.g. at test time MAML takes a few gradient steps on the new examples to produce a model finetuned for the test task. In contrast, the key hypothesis with GPT-3 is that language is so diverse, that doing well on it already requires adaptation to the input, and so the learned language model will _already be a meta-learner_. This implies that they can simply “”prime”“ the model with examples of a task they care about, and the model can _learn_ what task is supposed to be performed, and then perform that task well.For example, consider the task of generating a sentence using a new-made up word whose meaning has been explained. In one notable example, the prompt for GPT-3 is:_A “”whatpu”” is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:__We were traveling in Africa and we saw these very cute whatpus.__To do a “”farduddle”″ means to jump up and down really fast. An example of a sentence that uses the word farduddle is:_Given this prompt, GPT-3 generates the following output:_One day when I was playing tag with my little sister, she got really excited and shestarted doing these crazy farduddles._The paper tests on several downstream tasks for which benchmarks exist (e.g. question answering), and reports zero-shot, one-shot, and few-shot performance on all of them. On some tasks, the few-shot version sets a new SOTA, _despite not being finetuned using the benchmark’s training set_; on others, GPT-3 lags considerably behind finetuning approaches.The paper also consistently shows that few-shot performance increases as the number of parameters increase, and the rate of increase is faster than the corresponding rate for zero-shot performance. While they don’t outright say it, we might take this as suggestive evidence that as models get larger, they are more incentivized to learn “general reasoning abilities”.The most striking example of this is in arithmetic, where the smallest 6 models (up to 6.7 billion parameters) have poor performance (< 20% on 2-digit addition), then the next model (13 billion parameters) jumps to > 50% on 2-digit addition and subtraction, and the final model (175 billion parameters) achieves > 80% on 3-digit addition and subtraction and a perfect 100% on 2-digit addition (all in the few-shot regime). They explicitly look for their test problems in the training set, and find very few examples, suggesting that the model really is learning “how to do addition”; in addition, when it is incorrect, it tends to be mistakes like “forgetting to carry a 1”.On broader impacts, the authors talk about potential misuse, fairness and bias concerns, and energy usage concerns, and say about what you’d expect. One interesting note: “To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed.” They find that while there was significant discussion of misuse, they found no successful deployments. They also consulted with professional threat analysts about the possibility of well-resourced actors misusing the model. According to the paper: “The assessment was that language models may not be worth investing significant resources in because there has been no convincing demonstration that current language models are significantly better than current methods for generating text, and because methods for “targeting” or “controlling” the content of language models are still at a very early stage.”
For a long time, I’ve heard people quietly hypothesizing that with a sufficient diversity of tasks, regular gradient descent could lead to general reasoning abilities allowing for quick adaptation to new tasks. This is a powerful demonstration of this hypothesis.One critique is that GPT-3 still takes far too long to “identify” a task—why does it need 50 examples of addition in order to figure out that what it should do is addition? Why isn’t 1 sufficient? It’s not like there are a bunch of other conceptions of “addition” that need to be disambiguated. I’m not sure what’s going on mechanistically, but we can infer from the paper that as language models get larger, the number of examples needed to achieve a given level of performance goes down, so it seems like there is some “strength” of general reasoning ability that goes up. Still, it would be really interesting to figure out mechanistically how the model is “reasoning”.This also provides some empirical evidence in support of the threat model underlying <@inner alignment concerns@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@): they are predicated on neural nets that implicitly learn to optimize. (To be clear, I think it provides empirical support for neural nets learning to “reason generally”, not neural nets learning to implicitly “perform search” in pursuit of a “mesa objective”—see also <@Is the term mesa optimizer too narrow?@>.
This post describes eleven “full” AI alignment proposals (where the goal is to build a powerful, beneficial Ai system using current techniques), and evaluates them on four axes:1. **Outer alignment:** Would the optimal policy for the specified loss function be aligned with us? See also this post .2. **Inner alignment:** Will the model that is _actually produced_ by the training process be aligned with us?3. **Training competitiveness:** Is this an efficient way to train a powerful AI system? More concretely, if one team had a “reasonable lead” over other teams, would they keep at least some of the lead if they used this algorithm?4. **Performance competitiveness:** Will the trained model have good performance (relative to other models that could be trained)?Seven of the eleven proposals are of the form “recursive outer alignment technique” plus “<@technique for robustness@>(@Worst-case guarantees (Revisited)@)”. The recursive outer alignment technique is either <@debate@>(@AI safety via debate@), <@recursive reward modeling@>(@Scalable agent alignment via reward modeling@), or some flavor of <@amplification@>(@Capability amplification@). The technique for robustness is either transparency tools to “peer inside the model”, <@relaxed adversarial training@>(@Relaxed adversarial training for inner alignment@), or intermittent oversight by a competent supervisor. An additional two proposals are of the form “non-recursive outer alignment technique” plus “technique for robustness”—the non-recursive techniques are vanilla reinforcement learning in a multiagent environment, and narrow reward learning.Another proposal is Microscope AI, in which we train AI systems to simply understand vast quantities of data, and then by peering into the AI system we can learn the insights that the AI system learned, leading to a lot of value. We wouldn’t have the AI system act in the world, thus eliminating a large swath of potential bad outcomes. Finally, we have STEM AI, where we try to build an AI system that operates in a sandbox and is very good at science and engineering, but doesn’t know much about humans. Intuitively, such a system would be very unlikely to deceive us (and probably would be incapable of doing so).The post contains a lot of additional content that I didn’t do justice to in this summary. In particular, I’ve said nothing about the analysis of each of these proposals on the four axes listed above; the full post talks about all 44 combinations.
I’m glad this post exists: while most of the specific proposals could be found by patching together content spread across other blog posts, there was a severe lack of a single article laying out a full picture for even one proposal, let alone all eleven in this post.I usually don’t think about outer alignment as what happens with optimal policies, as assumed in this post—when you’re talking about loss functions _in the real world_ (as I think this post is trying to do), _optimal_ behavior can be weird and unintuitive, in ways that may not actually matter. For example, arguably for any loss function, the optimal policy is to hack the loss function so that it always outputs zero (or perhaps negative infinity).
Why aren’t studios already spending twice as much for 10% better movies? It doesn’t seem like it ought to be that hard; it’s not like the tropes on the bad writing index are hurting for examples. I’d guess that this is mainly a case of the difficulty of hiring experts: it’s hard to hire people with better taste than whoever’s in charge.
Another hypothesis is that only a small fraction of the population would appreciate better writing, and the additional constraints that poses would reduce quality for the remaining people.
(I wouldn’t be surprised if I fell into the latter category, though I’m not sure.)
Seems like that only makes sense because you specified that “increasing production capacity” and “upgrading machines” are the things that I’m not allowed to do, and those are things I have a conceptual grasp on. And even then—am I allowed to repair machines that break? What about buying a new factory? What if I force workers to work longer hours? What if I create effective propaganda that causes other people to give you paperclips? What if I figure out that by using a different source of steel I can reduce the defect rate? I am legitimately conceptually uncertain whether these things count as “increasing production capacity / upgrading machines”.
As another example, what does it mean to optimize for “curing cancer” without becoming more able to optimize for “curing cancer”?
Interpretability seems to be useful for a wide variety of AI alignment proposals. Presumably, different proposals require different kinds of interpretability. This post analyzes this question to allow researchers to prioritize across different kinds of interpretability research.At a high level, interpretability can either make our current experiments more informative to help us answer _research questions_ (e.g. “when I set up a <@debate@>(@AI safety via debate@) in this particular way, does honesty win?”), or it could be used as part of an alignment technique to train AI systems. The former only have to be done once (to answer the question), and so we can spend a lot of effort on them, while the latter must be efficient in order to be competitive with other AI algorithms.They then analyze how interpretability could apply to several alignment techniques, and come to several tentative conclusions. For example, they suggest that for recursive techniques like iterated amplification, we may want comparative interpretability, that can explain the changes between models (e.g. between distillation steps, in iterated amplification). They also suggest that by having interpretability techniques that can be used by other ML models, we can regularize a trained model to be aligned, without requiring a human in the loop.
I like this general direction of thought, and hope that people continue to pursue it, especially since I think interpretability will be necessary for inner alignment. I think it would be easier to build on the ideas in this post if they were made more concrete.
I don’t know what it means. How do you optimize for something without becoming more able to optimize for it? If you had said this to me and I hadn’t read your sequence and so knew what you were trying to say, I’d have given you a blank stare—the closest thing I have to an interpretation is “be myopic / greedy”, but that limits your AI system to the point of uselessness.
Like, “optimize for X” means “do stuff over a period of time such that X goes up as much as possible”. “Becoming more able to optimize for X” means “do a thing such that in the future you can do stuff such that X goes up more than it otherwise would have”. The only difference between these two is actions that you can do for immediate reward.
(This is just saying in English what I was arguing for in the math comment.)
If the conceptual version is “we keep A’s power low”, then that probably works.
If the conceptual version is “tell A to optimize R without becoming more able to optimize R”, then I have the same objection.
Some thoughts on this discussion:
1. Here’s the conceptual comment and the math comment where I’m pessimistic about replacing the auxiliary set with the agent’s own reward.
However, the agent’s reward is usually not the true human utility, or a good approximation of it. If the agent’s reward was the true human utility, there would be no need to use an impact measure in the first place.
You seem to have misunderstood. Impact to a person is change in their AU. The agent is not us, and so it’s insufficient for the agent to preserve its ability to do what we want – it has to preserve our ability to do we want!
Hmm, I think you’re misunderstanding Vika’s point here (or at least, I think there is a different point, whether Vika was saying it or not). Here’s the argument, spelled out in more detail:
1. Impact to an arbitrary agent is change in their AU.
2. Therefore, to prevent catastrophe via regularizing impact, we need to have an AI system that is penalized for changing a human’s AU.
3. By assumption, the AI’s utility function RA is different from the human’s RH (otherwise there wouldn’t be any problem).
4. We need to ensure H can pursue RH, but we’re regularizing A pursuing RA. Why should we expect the latter to cause the former to happen?
One possible reason is there’s an underlying factor which is how much power A has, and as long as this is low it implies that any agent (including H) can pursue their own reward about as much as they could in A‘s absence (this is basically CCC). Then, if we believe that regularizing A pursuing RA keeps A’s power low, we would expect it also means that H remains able to pursue RH. I don’t really believe the premise there (unless you regularize so strongly that the agent does nothing).
I finally read Rational preference: Decision theory as a theory of practical rationality, and it basically has all of the technical content of this post; I’d recommend it as a more in-depth version of this post. (Unfortunately I don’t remember who recommended it to me, whoever you are, thanks!) Some notable highlights:
It is, I think, very misleading to think of decision theory as telling you to maximize your expected utility. If you don’t obey its axioms, then there is no utility function constructable for you to maximize the expected value of. If you do obey the axioms, then your expected utility is always maximized, so the advice is unnecessary. The advice, ‘Maximize Expected Utility’ misleadingly suggests that there is some quantity, definable and discoverable independent of the formal construction of your utility function, that you are supposed to be maximizing. That is why I am not going to dwell on the rational norm, Maximize Expected Utility! Instead, I will dwell on the rational norm, Attend to the Axioms!
Very much in the spirit of the parent comment.
Unfortunately, the Fine Individuation solution raises another problem, one that looks deeper than the original problems. The problem is that Fine Individuation threatens to trivialize the axioms.
(Fine Individuation is basically the same thing as moving from preferences-over-snapshots to preferences-over-universe-histories.)
All it means is that a person could not be convicted of intransitive preferences merely by discovering things about her practical preferences. [...] There is no possible behavior that could reveal an impractical preference
His solution is to ask people whether they were finely individuating, and if they weren’t, then you can conclude they are inconsistent. This is kinda sorta acknowledging that you can’t notice inconsistency from behavior (“practical preferences” aka “choices that could actually be made”), though that’s a somewhat inaccurate summary.
There is no way that anyone could reveal intransitive preferences through her behavior. Suppose on one occasion she chooses X when the alternative was Y, on another she chooses Y when the alternative was Z, and on a third she chooses g when the alternative was X. But that is nonsense; there is no saying that the Y she faced in the first occasion was the same as the Y she faced on the second. Those alternatives could not have been just the same, even leaving aside the possibility of individuating them by reference to what else could have been chosen. They will be alternatives at different times, and they will have other potentially significant differentia.
Basically making the same point with the same sort of construction as the OP.
Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel).
Maybe? How do you decide where to start the inaction baseline? In RL the episode start is an obvious choice, but it’s not clear how to apply that for humans.
(I only have this objection when trying to explain what “impact” means to humans; it seems fine in the RL setting. I do think we’ll probably stop relying on the episode abstraction eventually, so we would eventually need to not rely on it ourselves, but plausibly that can be dealt with in the future.)
Also, under this inaction baseline, the roads are perpetually empty, and so you’re always feeling impact from the fact that you can’t zoom down the road at 120 mph, which seems wrong.
I agree that counterfactuals are hard, but I’m not sure that difficulty can be avoided. Your baseline of “what the human expected the agent to do” is also a counterfactual, since you need to model what would have happened if the world unfolded as expected.
Sorry, what I meant to imply was “baselines are counterfactuals, and counterfactuals are hard, so maybe no ‘natural’ baseline exists”. I certainly agree that my baseline is a counterfactual.
On the other hand, since (as you mentioned) this is not intended as a baseline for impact penalization, maybe it doesn’t need to be well-defined or efficient in terms of human input, and it is a good source of intuition on what feels impactful to humans.
Yes, that’s my main point. I agree that there’s no clear way to take my baseline and implement it in code, and that it depends on fuzzy concepts that don’t always apply (even when interpreted by humans).
Thank you for reading closely enough to notice the 5 characters used to mark the occasion :)
My main note is that my comment was just about the concept of rigging a learning process given a fixed prior over rewards. I certainly agree that the general strategy of “update a distribution over reward functions” has lots of as-yet-unsolved problems.
2 Rigging learning or reward-maximisation?
I agree that you can cast any behavior as reward maximization with a complicated enough reward function. This does imply that you have to be careful with your prior / update rule when you specify an assistance game / CIRL game.
I’m not arguing “if you write down an assistance game you automatically get safety”; I’m arguing “if you have an optimal policy for some assistance game you shouldn’t be worried about it rigging the learning process relative to the assistance game’s prior”. Of course, if the prior + update rule themselves lead to bad behavior, you’re in trouble; but it doesn’t seem like I should expect that to be via rigging as opposed to all the other ways reward maximization can go wrong.
3 AI believing false facts is bad
Tbc I agree with this and was never trying to argue against it.
4 Changing preferences or satisfying them
Thus there is no distinction between “compound” and “non-compound” rewards; we can’t just exclude the first type. So saying a reward is “fixed” doesn’t mean much.
I agree that updating on all reward functions under the assumption that humans are rational is going to be very strange and probably unsafe.
5 Humans learning new preferences
I agree this is a challenge that assistance games don’t even come close to addressing.
6 The AI doesn’t trick any part of itself
Your explanation in this section involves a compound reward function, instead of rigged learning process. I agree that these are problems; I was really just trying to make a point about rigged learning processes.
The key point is that AUPconceptual relaxes the problem:If we could robustly penalize the agent for intuitively perceived gains in power (whatever that means), would that solve the problem?This is not trivial.
Probably I’m just missing something, but I don’t see why you couldn’t say something similar about:
“preserve human autonomy”, “be nice”, “follow norms”, “do what I mean”, “be corrigible”, “don’t do anything I wouldn’t do”, “be obedient”
If we could robustly reward the agent for intuitively perceived nice actions (whatever that means), would that solve the problem?
It seems like the main difference is that for power in particular is that there’s more hope that we could formalize power without reference to humans (which seems harder to do for e.g. “niceness”), but then my original point applies.
has to do with easily exploitable opportunities in a given situation
Sorry, I don’t understand what you mean here.
However, still note that this solution doesn’t have anything to do with human values in particular.
I feel like I can still generate lots of solutions that have that property. For example, “preserve human autonomy”, “be nice”, “follow norms”, “do what I mean”, “be corrigible”, “don’t do anything I wouldn’t do”, “be obedient”.
All of these depend on the AI having some knowledge about humans, but so does penalizing power.
If you are interested in the impact on the human that is caused by the agent (where the agent is the source), the natural choice would be the stepwise inaction baseline (comparing to the agent doing nothing).
To the extent that there is a natural choice (counterfactuals are hard), I think it would be “what the human expected the agent to do” (the same sort of reasoning that led to the previous state baseline).
This gives the same answer as the stepwise inaction baseline in your example (because usually we don’t expect a specific person to step on our feet or to steal our wallet).
An example where it gives a different answer is in driving. The stepwise inaction baseline says “impact is measured relative to all the other drivers going comatose”, so in the baseline state many accidents happen, and you get stuck in a huge traffic jam. Thus, all the other drivers are constantly having a huge impact on you by continuing to drive!
In contrast, the baseline of “what the human expected the agent to do” gets the intuitive answer—the human expected all the other drivers to drive normally, and so normal driving has ~zero impact, whereas if someone actually did fall comatose and cause an accident, that would be quite impactful.
EDIT: Tbc, I think this is the “natural choice” if you want to predict what humans would say is impactful; I don’t have a strong opinion on what the “natural choice” would be if you wanted to successfully prevent catastrophe via penalizing “impact”. (Though in this case the driving example still argues against stepwise inaction.)
If I’m an optimal agent with perfect beliefs about what the (deterministic) world will do, even intuitively I would never say that my power changes. Can you give me an example of what such an agent could do that would change its power?
If by “intuitive” you mean “from the perspective of real humans, even if the agent is optimal / superintelligent”, then I feel like there are lots of conceptual solutions to AI alignment, like “do what I mean”, “don’t do bad things”, “do good things”, “promote human flourishing”, etc.
The thing that I believe (irrespective of whether RI says it or not) is:
“Humans find new information ‘impactful’ to themselves when it changes how good they expect their future to be.” (In practice, there’s a host of complications because humans are messy, e.g. uncertainty about how good the future is also tends to feel impactful.)
In particular, if humans had perfect beliefs and knew exactly what would happen at all times, no information could never change how good they expect their future to be, and so nothing could ever be impactful.
Since this is tied to new information changing what you expect, it seems like the natural baseline is the previous state.
Separately, I also think that RI was trying to argue for this conclusion, but I’ll defer to Alex about what he was or wasn’t trying to claim / argue for.