Same person as nostalgebraist2point0, but now I have my account back.
My tumblr blog
My GPT bot
My fiction writing
because stabler optimization tends to be more powerful / influential / able-to-skillfully-and-forcefully-steer-the-future
I personally doubt that this is true, which is maybe the crux here.
This seems like a possibly common assumption, and I’d like to see a more fleshed-out argument for it. I remember Scott making this same assumption in a recent conversation:
I agree humans aren’t like that, and that this is surprising.Maybe this is because humans aren’t real consequentialists, they’re perceptual control theory agents trying to satisfy finite drives? [...] Might gradient descent produce a PCT agent instead of a mesa-optimizer? I don’t know. My guess is maybe, but that optimizers would be more, well, optimal [...]
I agree humans aren’t like that, and that this is surprising.
Maybe this is because humans aren’t real consequentialists, they’re perceptual control theory agents trying to satisfy finite drives? [...] Might gradient descent produce a PCT agent instead of a mesa-optimizer? I don’t know. My guess is maybe, but that optimizers would be more, well, optimal [...]
But is it true that “optimizers are more optimal”?
When I’m designing systems or processes, I tend to find that the opposite is true—for reasons that are basically the same reasons we’re talking about AI safety in the first place.
A powerful optimizer, with no checks or moderating influences on it, will tend to make extreme Goodharted choices that look good according to its exact value function, and very bad (because extreme) according to almost any other value function.
Long before things reach the point where the outer optimizer is developing a superintelligent inner optimizer, it has plenty of chances to learn the general design principle that “putting all the capabilities inside an optimizing outer loop ~always does something very far from what you want.”
Some concrete examples from real life:
Using gradient descent. I use gradient descent to make things literally every day. But gradient descent is never the outermost loop of what I’m doing.That would look like “setting up a single training run, running it, and then using the model artifact that results, without giving yourself freedom to go back and do it over again (unless you can find a way to automate that process itself with gradient descent).” This is a peculiar policy which no one follows. The individual artifacts resulting from individual training runs are quite often bad—they’re overfit, or underfit, or training diverged, or they got great val metrics but the output sucks and it turns out your val set has problems, or they got great val metrics but the output isn’t meaningfully better and the model is 10x slower than the last one and the improvement isn’t worth it, or they are legitimately the best thing you can get on your dataset but that causes you to realize you really need to go gather more data, or whatever.All the impressive ML artifacts made “by gradient descent” are really outputs of this sort of process of repeated experimentation, refining of targets, data gathering and curation, reframing of the problem, etc. We could argue over whether this process is itself a form of “optimization,” but in any case we have in our hands a (truly) powerful thing that very clearly is optimization, and yet to leverage it effectively without getting Goodharted, we have to wrap it inside some other thing.
Delegating to other people. To quote myself from here: “How would I want people to behave if I – as in actual me, not a toy character like Alice or Bob – were managing a team of people on some project? I wouldn’t want them to be ruthless global optimizers; I wouldn’t want them to formalize the project goals, derive their paperclip-analogue, and go off and do that. I would want them to take local iterative steps, check in with me and with each other a lot, stay mostly relatively close to things already known to work but with some fraction of time devoted to far-out exploration, etc.”There are of course many Goodhart horror stories about organizations that focus too hard on metrics. The way around this doesn’t seem to be “find the really truly correct metrics,” since optimization will always find a way to trick you. Instead, it seems crucial to include some mitigating checks on the process of optimizing for whatever metrics you pick.
Checks against dictatorship as a principle of government design, as opposed to the alternative of just trying to find a really good dictator.Mostly self-explanatory. Admittedly a dictator is not likely to be a coherent optimizer, but I expect a dictatorship to behave more like one than a parliamentary democracy.If coherence is a convergent goal, why don’t all political sides come together and build a system that coherently does something, whatever that might be? In this context, at least, it seems intuitive enough that no one really wants this outcome.
In brief, I don’t see how to reconcile
“in the general case, coherent optimizers always end up doing some bad, extreme Goodharted thing” (which matches both my practical experience and a common argument in AI safety), and
“outer optimizers / deliberating agents will tend to converge on building (more) coherent (inner) optimizers, because they expect this to better satisfy their own goals,” i.e. the “optimizers are more optimal” assumption.
EDIT: an additional consideration applies in the situation where the AI is already at least as smart as us, and can modify itself to become more coherent. Because I’d expect that AI to notice the existence of the alignment problem just as much as we do (why wouldn’t it?). I mean, would you modify yourself into a coherent EU-maximizing superintelligence with no alignment guarantees? If that option became available in real life, would you take it? Of course not. And our hypothetical capable-but-not-coherent AI is facing the exact same question.
Sure. Although before I do, I want to qualify the quoted claim a bit.
When I say “our goals change over time,” I don’t mean “we behave something like EU maximizers with time-dependent utility functions.” I think we don’t behave like EU maximizers, in the sense of having some high-level preference function that all our behavior flows from in a top-down manner.
If we often make choices that are rational in a decision-theoretic sense (given some assumption about the preferences we are trying to satisfy), we are doing so via a “subgoal capacity.” This kind of decision-making is available to our outermost loop, and our outermost loop sometimes uses it.
But I don’t think the logic of our outermost loop actually is approximate EU maximization—as evidenced by all the differences between how we deploy our capabilities, and how a “smart EU maximizer” would deploy its own. For instance, IMO we are less power-seeking than an EU maximizer would be (and when humans do seek power, it’s often as an end in itself, whereas power-seeking is convergent across goals for EU maximizers).
Saying that “our goals change over time, and we permit/welcome this” is meant to gesture at how different we are from a hypothetical EU maximizer with our capabilities. But maybe this concedes too much, because it implies we have some things called “goals” that play a similar role to the EU maximizer’s utility function. I am pretty sure that’s not in fact true. We have “goals” in the informal sense, but I don’t really know what it is we do with them, and I don’t think mapping them on to the “preferences” in a decision-theoretic story about human behavior is likely to yield accurate predictions.
Some people grow up in cultures where meat eating is normal, and eat meat themselves until some point, then become convinced to stop out of moral concern for animals. In rare cases, the person might have “always felt on some level” the same moral concern that later drives them to veganism, but in the median case (it seems to me) this moral concern is actually acquired over time—the person really does not care as much about animals earlier as they do later. (This is a particular case of a more general thing, where one’s “moral circle” widens over time; various kinds of individual change toward more “progressive” attitudes fall in this category.)
When I was younger, I felt very strongly that I ought to “be moral” while also feeling very uncertain about what this entailed; I felt sure there was some “true ethics” (something like a single, objectively true ethical theory) which was important to follow, without knowing what it was, and when I tried to “be moral” this often involved trying to figure out what the true ethics was. Over time, I have lost this emotion towards “the true ethics” (I am not sure it exists and not sure the question matters), while gaining a tendency to have more moral emotions/motivations about concrete instances of harm. I am not sure how glad I am that this has occurred. I find my former self strange and even “silly,” but I have difficulty coming up with a fully satisfying argument that I am “right” while he is “wrong.”
When a person loses or gains religious faith, this changes their world model so drastically that their previous goals/values (or at least the things they consciously articulated as goals/values) are not even on the map anymore. If you think of yourself as fundamentally striving to “serve God,” and then you decide God doesn’t exist, you need a new thing to do.
Some people have a single, consuming passion that drives them for much/most/all of their adult life. For each such person, there was a time before they became passionate about this subject, and indeed a time before they had even heard of it. So it was not always their “goal.” Mathematicians seem like an especially clear case of this, since higher math is so remote from everyday life.
New parents sometimes report that their children provide them with a kind of value which they did not anticipate, or perhaps could not even have understood, before parenthood. And, the belief that this occurs is widespread enough that people sometimes have children, not because they want to on their current values, but because they expect to become different people with different values as a result, and wish this change to occur.
Sorry, you asked for three, that was five. I wanted to cover a range of areas, since one can look at any one of these on its own and imagine a way it might fit inside a decision-theoretic story.
EU maximization with a non-constant world model (“map”) might look like 3 and 4, while 5 involves a basic biological function of great interest to natural selection, so we might imagine it a hardwired “special case” not representative of how we usually work. But the ubiquity and variety of this kind of thing, together with our lack of power-seeking etc., does strike me a problem for decision-theoretic interpretations of human behavior at the lifetime level.
Meta-comment of my own: I’m going to have to tap out of this conversation after this comment. I appreciate that you’re asking questions in good faith, and this isn’t your fault, but I find this type of exchange stressful and tiring to conduct.
Specifically, I’m writing at the level of exactness/explicitness that I normally expect in research conversations, but it seems like that is not enough here to avoid misunderstandings. It’s tough for me to find the right level of explicitness while avoiding the urge to put thousands of very pedantic words in every comment, just in case.
Re: non-RL training data.
Above, I used “RL policies” as a casual synecdoche for “sources of Gato training data,” for reasons similar to the reasons that this post by Oliver Sourbut focuses on RL/control.Yes, Gato had other sources of training data, but (1) the RL/control results are the ones everyone is talking about, and (2) the paper shows that the RL/control training data is driving those results (they get even better RL/control outcomes when they drop the other data sources).
Re: gains from transfer..
Yes, if Gato outperforms a particular RL/control policy that generated training data for it, then having Gato is better than merely having that policy, in the case where you want to do its target task.
However, training a Gato is not the only way of reaping gains from transfer. Every time we finetune any model, or use multi-task training, we are reaping gains from transfer. The literature (incl. this paper) robustly shows that we get the biggest gains from transfer when transferring between similar tasks, while distant or unrelated tasks yield no transfer or even negative transfer.
So you can imagine a spectrum ranging from
“pretrain only on one very related task” (i.e. finetuning a single narrow task model), to
“pretraining on a collection of similar tasks” (i.e. multi-task pretraining followed by finetuning), to
“pretrain on every task, even those where you expect no or negative transfer” (i.e. Gato)
The difference between Gato (3) and ordinary multi-task pretraining (2) is that, where the latter would only train with a few closely related tasks, Gato also trains on many other less related tasks.
It would be cool if this helped, and sometimes it does help, as in this paper about training on many modalities at once for multi-modal learning with small transformers. But this is not what the Gato authors found—indeed it’s basically the opposite of what they found.
We could use a bigger model in the hope that will get us some gains from distant transfer (and there is some evidence that this will help), but with the same resources, we could also restrict ourselves to less-irrelevant data and then train a smaller (or same-sized) model on more of it. Gato is at one extreme end of this spectrum, and everything suggests the optimum is somewhere in the interior.
Oliver’s post, which I basically I agree with, has more details on the transfer results.
I think the reason why it being a unified agent matters is that we should expect significant positive transfer to happen eventually as we scale up the model and train it longer on more tasks. Do you not?
Sure, this might happen.
But remember, to train “a Gato,” we have to first train all the RL policies that generate its training data. So we have access to all of them too. Instead of training Gato, we could just find the one policy that seems closest to the target task, and spend all our compute on just finetuning it. (Yes, related tasks transfer—and the most related tasks transfer most!)
This approach doesn’t have to spend any compute on the “train Gato” step before finetuning, which gives it a head start. Plus, the individual policy models are generally much smaller than Gato, so they take less compute per step.
Would this work? In the case of the Lee et al robot problem, yes (this is roughly what Lee et al originally did, albeit with various caveats). In general, I don’t know, but this is the baseline that Gato should be comparing itself against.
The question isn’t “will it improve with scale?”—it’s 2022, anything worth doing improves with scale—but “will it ever reach the Pareto frontier? will I ever have a reason to do it?”
As an ML practitioner, it feels like the paper is telling me, “hey, think of a thing you can already do. What if I told you a way to do the same thing, equally well, with an extra step in the middle?” Like, uh, sure, but . . . why?
By contrast, when I papers like AlphaGo, BERT, CLIP, OpenAI diffusion, Chinchilla . . . this is a type of paper where I say, “holy shit, this Fucking Works™, this moves the Pareto frontier.” In several of these cases I went out and immediately used the method in the real world and reaped great rewards.
IMO, the “generalist agent” framing is misleading, insofar as it obscures this second-best quality of Gato. It’s not really any more an “agent” than my hypothetical cloud drive with a bunch of SOTA models on it. Prompting GATO is the equivalent of picking a file from the drive; if I want to do a novel task, I still have to finetune, just as I would with the drive. (A real AGI, even a weak one, would know how to finetune itself, or do the equivalent.)
We are not talking about an autonomous thing; we’re still in the world where there’s a human practitioner and “Gato” is one method they can use or not use. And I don’t see why I would want to use it.
For what it’s worth, I was thoroughly underwhelmed by Gato, to the point of feeling confused what the paper was even trying to demonstrate.
I’m not the only ML researcher who had this reaction. In the Eleuther discord server, I said “i don’t get what i’m supposed to take away from this gato paper,” and responses from regulars included
“nothing, this was 3 years over-due”
“Yep. I didn’t update much on this paper. I think the ‘general’ in the title is making people panic lol” (with two “this” reacts)
Or see this tweet. I’m not trying to convince you by saying “lots of people agree with me!”, but I think this may be useful context.
A key thing to remember when evaluating Gato is that it was trained on data from many RL models that were themselves very impressive. So there are 2 very different questions we can ask:
Does Gato successively distill a large number of learned RL policies into a single, small collection of params?
Does Gato do anything except distillation? Is there significant beneficial transfer between tasks or data types? Is Gato any more of a “generalist agent” than, like, a big cloud storage bucket with all of those RL models in it, and a little script that lets you pick which one to load and run?
And the answers are a pretty clear, stark “yes” and “no,” respectively.
For #2, note that every time the paper investigates transfer, it gets results that are mostly or entirely negative (see Figs 9 and 17). For example, including stuff like text data makes Gato seem more sexily “generalist” but does not actually seem to help anything—it’s like uploading a (low-quality) LM to the same cloud bucket as the RL policies. It just sits there.
In the particular case of the robot stacking experiment, I don’t think your read is accurate, for reasons related to the above. Neither the transfer to real robotics, nor the effectiveness of offline finetuning, are new to Gato—the researchers are sticking as close as they can to what was done in Lee et al 2022, which used the same stacking task + offline finetuning + real robots, and getting (I think?) broadly similar results. That is, this is yet another success of distillation, without a clear value-add beyond distillation.
In the specific case of Lee et al’s “Skill Generalization” task, it’s important to note that the “expert” line is not reflective of “SOTA RL expert models.”
The stacking task is partitioned here (over object shapes/colors) into train and test subsets. The “expert” is trained only on the train subset, and then Lee et al (and the Gato authors) investigate models that are additionally tuned on the test subset in some way or other. So the “expert” is really a baseline here, and the task consists of trying to beat it.
(This distinction made somewhat clearer in an appendix of the Gato paper—see Fig. 17, and note that the “expert” lines there match the “Dataset” lines from Fig. 3 in Lee et al 2022.)
I’ve tried the method from that paper (typical sampling), and I wasn’t hugely impressed with it. In fact, it was worse than my usual sampler to a sufficient extent that users noticed the difference, and I switched back after a few days. See this post and these tweets.
(My usual sampler one I came up with myself, called Breakruns. It works the best in practice of any I’ve tried.)
I’m also not sure I really buy the argument behind typical sampling. It seems to conflate “there are a lot of different ways the text could go from here” with “the text is about to get weird.” In practice, I noticed it would tend to do the latter at points where the former was true, like the start of a sample or of a new paragraph or section.
Deciding how you sample is really important for avoiding the repetition trap, but I haven’t seen sampling tweaks yield meaningful gains outside of that area.
If you’re on mobile, try it on desktop. Here’s what the sentences look like on my laptop, at 100% zoom.
Hmm… what moral are you drawing from that result?
Apparently, CLIP text vectors are very distinguishable from CLIP image vectors. I don’t think this should be surprising. Text vectors aren’t actually expressing images, after all, they’re expressing probability distributions over images.
They are more closely analogous to the outputs of a GPT model’s final layer than they are to individual tokens from its vocab. The output of GPT’s final layer doesn’t “look like” the embedding of a single token, nor should it. Often the model wants to spread its probability mass across a range of alternatives.
Except even that analogy isn’t quite right, because CLIP’s image vectors aren’t “images,” either—they’re probability distributions over captions. It’s not obvious that distributions-over-captions would be a better type of input for your image generator than distributions-over-images.
Also note that CLIP has a trainable parameter, the “logit scale,” which multiplies the cosine similarities before the softmax. So the overall scale of the cosine similarities is arbitrary (as is their maximum value). CLIP doesn’t “need” the similarities to span any particular range. A similarity value like 0.4 doesn’t mean anything on its about about how close CLIP thinks the match is. That’s determined by (similarity * logit scale).
I completely agree that the effects of using unCLIP are mysterious, in fact the opposite of what I’d predict them to be.
I wish the paper had said more about why they tried unCLIP in the first place, and what improvements they predicted they would get from it. It took me a long time just to figure out why the idea might be worth trying at all, and even now, I would never have predicted the effects it had in practice. If OpenAI predicted them, then they know something I don’t.
For instance, it seems like maybe the model that produced the roses on the left-hand side of the diversity-fidelity figure was also given a variable-length encoding of the caption? I’m having a hard time telling from what’s written in the paper
Yes, that model did get to see a variable-length encoding of the caption. As far as I can tell, the paper never tries a model that only has a CLIP vector available, with no sequential pathway.
Again, it’s very mysterious that (GLIDE’s pathway + unCLIP pathway) would increase diversity over GLIDE, since these models are given strictly more information to condition on!
(Low-confidence guess follows. The generator view the sequential representation in its attention layers, and in the new model, these layers are also given a version of the CLIP vector, as four “tokens,” each a different projection of the vector. [The same vector is also, separately, added to the model’s more global “embedding” stream.] In attention, there is competitive inhibition between looking at one position, and looking at another. So, it’s conceivable that the CLIP “tokens” are so information-rich that the attention fixates on them, ignoring the text-sequence tokens. If so, it would ignore some information that GLIDE does not ignore.)
It’s also noteworthy that they mention the (much more obvious) idea of conditioning solely on CLIP text vectors, citing Katherine Crawson’s work:
Building on this observation, another approach would be to train the decoder to condition on CLIP text embeddings  instead of CLIP image embeddings
...but they never actually try this out in a head-to-head comparison. For all we know, a model conditioned on CLIP text vectors, trained with GLIDE’s scale and data, would do better than GLIDE and unCLIP. Certainly nothing in the paper rules out this possibility.
Ah, I now realize that I was kind of misleading in the sentence you quoted. (Sorry about that.)
I made it sound like CLIP was doing image compression. And there are ML models that are trained, directly and literally to do image compression in a more familiar sense, trying to get the pixel values as close to the original as possible. These are the image autoencoders.
DALLE-2 doesn’t use an autoencoder, but many other popular image generators do, such as VQGAN and the original DALLE.
So for example, the original DALLE has an autoencoder component which can compress and decompress 256x256 images. Its compressed representation is a 32x32 array, where each cell takes a discrete value from 8192 possible values. This is 13 bits per cell (if you don’t do any further compression like RLE on it), so you end up with 13 KiB per image. And then DALLE “writes” in this code the same way GPT writes text.
CLIP, though, is not an autoencoder, because it never has to decompress its representation back into an image. (That’s what unCLIP does, but the CLIP encoding was not made “with the knowledge” that unCLIP would later come along and try to do this; CLIP was never encouraged to make is code to be especially suitable for this purpose.)
Instead, CLIP is trying to capture . . . “everything about an image that could be relevant to matching it with a caption.”
In some sense this is just image compression, because in principle the caption could mention literally any property of the image.But lossy compression always has to choose something to sacrifice, and CLIP’s priorities are very different from the compressors we’re more familiar with. They care about preserving pixel values, so they care a lot about details. CLIP cares about matching with short (<= ~70 word) captions, so it cares almost entirely about high-level semantic features.
In general all writing I’ve seen is bad. I think this is less likely to be about safety, and more that it’s hard to learn language by looking at a lot of images. However, since DE2 is trained on text, it clearly knows a lot about language at some level—I would expect there’s plenty of data to put out coherent text. Instead it outputs nonsense, focusing on getting the fonts and the background right.
It’s definitely possible to get a diffusion model to write the text from a prompt into an image. I made a model that does this late last year. (blogpost / example outputs*)
The text-conditioning mechanism (cross-attention) I use is a little different from the ones in GLIDE and DALLE-2, but I doubt this makes a huge difference.
I’m actually a little surprised that the OpenAI models don’t learn to write coherent text, since they’re bigger than mine, trained for longer on more data.
But then, I’m much more focused on this one specific capability, so I make it easy for the model: an entire ~50% of my training images have text in them, and the “prompt” in my setup always contains an automatic transcript of the text in the image (if any), never a description, or a description that happens to quote a transcript, or a description that merely summarizes the text, etc.
The OpenAI models have to solve a more abstract version of the problem, and the problem is relevant to (I would imagine) a much smaller fraction of their training examples.
*check the alt text if you want to know what text the model is attempting to write
Thinking back to the “inconsistency” from the Kaplan et al papers...
In Appendix E of the new paper, we see the loss-vs-compute frontier start to “bend” from a straight line on a log-log plot, with returns to additional compute getting smaller at large scales.
I suspect this bending is the transition from the faster “L(C) law” to the slower “L(D) law.”
A brief recap of that below:
Adding more params can help in two ways: it makes your model’s loss decline toward its asymptotic minimum faster, and it can lower that minimum itself.
As models get bigger, the first effect dies off—the loss curves converge to a fixed shape, rather than getter ever steeper. The second effect keeps going, but with it alone, the overall rate of return is lower.
Presumably, the learning rate issue in Kaplan et. al. also affected their estimated L(D) law.
The issue made Kaplan et al underestimate optimal model performance. The underestimate was worst when considering models for which the optimal number of training steps was small.
The L(D) law came from early stopping experiments. The early stopping step is lower for smaller data sizes.
So the L(D) experiments with smaller D values look artificially bad, relative to the ones with large D values. Thus the estimated L(D) curve declines faster than the true L(D) curve.
If this is correct, then L(D) improves more slowly with data than we had believed.
Note that this does contradict the “use more data!” result from the paper—that is about the relative rate at which N and D affect L(N, D).
It’s strongly implied to be 2048, as in Gopher.
If you just want a big parameter-count to wave around, you use a MoE like everyone else optimizing for clickbait. (Or even better, use n-grams so you can talk about having hundreds of trillions of parameters. It’ll work awful compared to NNs, but you’ll have the most parameters!)
On that note, did you see that recent Chinese MoE with 174T params, 3 layers, and 96000 experts?
It ought to shorten actual timelines, for the reason you say. (Except insofar as data sourcing could actually become a practical problem.)
However, it lengthens the Bio Anchors timeline, because the parameter count in Bio Anchors is fixed. (It’s the parameter count of a model that uses about as much inference compute as the brain.)
This is a weird thing about Bio Anchors—it asks when models will cross a threshold for the compute required to run them, so efficiency improvements of various kinds will lengthen its timeline. It’s always waiting for its “sufficiently expensive model” (and it does not care that this model keeps “getting better” in terms of loss/etc as the efficiency improvements roll in).
Anyway, I’d forgotten the prior used for dataset scaling in Bio Anchors, but it’s pretty broad (page 39 of part 2), with substantial mass on linear/super-linear scaling. So this news is less relevant than I had thought.
I don’t agree with your read of the MuZero paper.
The training routine of MuZero (and AlphaZero etc) uses explicit tree search as a source of better policies than the one the model currently spits out, and the model is adapted to output these better policies.
The model is trying to predict the output of the explicit tree search. There’s room to argue over whether or not it “learns implicit tree search” (ie learns to actually “run a search” internally in some sense), but certainly the possibility is not precluded by the presence of the explicit search; the only reason the explicit search is there at all is to give the model a signal about what it should aspire to do without explicit search.
It’s also true that, when the trained models are run in practice, they are usually run with explicit search on top, and this improves their performance. This does not mean they haven’t learned implicit search—only that a single forward pass of the model cannot do as well as a search guided by many forward passes of the same model, which is not a surprising outcome for any model (even models which do some kind of search inside each forward pass).