Compute the small change in data dx which would induce a small change in trained parameter values d\theta along each of the narrowest directions of the ridge in the loss landscape (i.e. eigenvectors of the Hessian with largest eigenvalue).
Can you unroll that?
“Small change in data” = one additional training sample is slightly modified? “Induce” = via an SGD update step on that additional training sample? Why is there a ridge in the loss landscape? What are “the narrowest directions”?
Is the difference mostly the learning rate schedule? I read it was also AdamW and it is at least conceivable that AdamW somehow gets better results for smaller models using more data but maxes out on the benefits of model size quicker than just plain Adam. So it could in theory be the case that scaling continues for the old scaling laws beyond what the new scaling laws say is possible, because Adam and AdamW just work differently enough. Of course that’s not very plausible and for different learning rate schedules it is maybe even less plausible.
Another way to phrase the question: Are the old and the new scaling laws roughly compatible? I.e. do the old scaling laws drop out of the new scaling laws if you use the old compute-optimal data/params distribution? I interpret your answer as that being roughly the case for the current models, but maybe not when you extrapolate further along the old scaling laws?
If the old scaling laws are still correct for a fixed dataset with a correspondingly fixed learning rate schedule, then we can reasonably say that the new scaling laws show us where the old scaling would have hit a wall.
Some more questions:
Meanwhile, the resulting model would not be nearly as big as PaLM. The optimal compute law actually puts it at 63B params.
How come PaLM_opt is smaller than Chinchilla? Isn’t Chinchilla supposed to be Gopher_opt?
Insofar as we trust our equation, this entire line of research—which includes GPT-3, LaMDA, Gopher, Jurassic, and MT-NLG—could never have beaten Chinchilla, no matter how big the models got.
These models where trained differently, which is why they had different scaling laws. Can we suppose that the new scaling laws tell us where the old scaling would have broken down?
I think it would be a great follow-up post to explain why you think repeating data is not going to be the easy way out for the scaling enthusiasts at Deepmind and OpenAI.
I find the Figure 4 discussion at your first link quite confusing. They study repeated data i.e. disbalanced datasets to then draw conclusions about repeating data i.e. training for several epochs. The performance hit they observe seems to not be massive (when talking about scaling a couple of OOMs) and they keep the number of training tokens constant.
I really can’t tell how this informs me about what would happen if somebody tried to scale compute 1000-fold and had to repeat data to do it compute-optimally, which seems to be the relevant question.
3. Stop worrying about finding “outer objectives” which are safe to maximize. I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function). Instead, focus on building good cognition within the agent. In my ontology, there’s only an inner alignment problem: How do we grow good cognition inside of the trained agent?
3. Stop worrying about finding “outer objectives” which are safe to maximize. I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function).
Instead, focus on building good cognition within the agent.
In my ontology, there’s only an inner alignment problem: How do we grow good cognition inside of the trained agent?
This vibes well with what I’ve been thinking about recently.
There a post in the back of my mind called “Character alignment”, which is about how framing alignment in terms of values, goals, reward etc is maybe not always ideal, because at least introspectively for me these seem to be strongly influenced by a more general structure of my cognition, i.e. my character.
Where character can be understood as a certain number of specific strategic priors, which might make good optimisation targets because they drop out of game theoretic considerations, and therefore are possibly quite generally and robustly modelled by sufficiently advanced agents.
I see many incorrect assumptions about what it takes to be a good conceptual researcher floating around [...] you can pick up the relevant part [of ML] and just work on approaches different to pure prosaic alignment
This seemed to imply that you might be a conceptual alignment researcher, but also work on pure prosaic alignment, which was the point were I thought: Ok, maybe I don’t know what “conceptual alignment research” means. But the link definitely clears it up, thank you!
Could you maybe add a paragraph (or comment) how exactly you define “conceptual” alignment research? What would be an example of alignment research that is not conceptual?
Well, I don’t, so I don’t.
Seriously, there is absolutely nothing about strong commitment being bad in my comment, or is there?
I think there are many marriages where one side defects (be it cheating, alcohol, abuse, …) and the other side has very good reason to get out and conversely there are many cases where it is advantageous for one side to hold on to the marriage long after things have gone south (for example for financial reasons).
That’s why I think allowing one side to veto divorce is a bad idea. Making divorce harder is very different from making it impossible.
The marriage vow allows one partner to veto divorce, which seems like a bad idea.
In that case, did death part you?
Not more than sleep, I would say.
Do you think that the advice she gives describes an open ended decision making process that is going to help people figure out what is the right choice for them?
I think most people should have kids. I also think most people will be led away from having kids if they follow her advice. I certainly would have been.
I think it is pretty clear that all the advice she gives is strongly colored by her own eventual conclusion. So even if her conclusion wasn’t post-hoc, her advice—the content, the framing, the situations she describes, the media she recommends—is.
I consider the interview to be large passive negative impact—to use her own terminology.
I wrote a comment on the blog, because to me Kat Wood’s reasoning is so off the mark.
To me the upside of having kids is the deep emotional connection to them. Everything else is comparatively irrelevant. Or maybe a better phrasing would be that almost all upsides of having kids are downstream of that emotional connection.
Babysitting is like parenting except without the same emotional connection. If you remove the emotional connection from an experience does the result allow you to evaluate the original experience? No, it becomes something very different.
It boggles my mind that so many women who have strong emotional reactions to little children manage to convince themselves that they don’t want any. Like, if you react like that to a random toddler … the emotion is going to be a hundred times stronger when it’s your kid.
The entire interview is not about decision making about having kids, it’s entirely post-hoc rationalisations for not having kids.
It reminds me of a mentally handicapped guy I used to play chess with. He always maneuvered his rook in front of his king’s position no matter what I played. That’s a really slow and weird maneuver which only works if I do absolutely nothing.
It was just the only plan he knew.
No, I didn’t try few-shot prompting yet because it uses much more tokens i.e. credits.
But I also don’t think that few-shot prompting reveals a different kind of understanding. The main advantage seems to be that you can point more robustly towards what you want.
But maybe you can give an example of what you have in mind.
I agree. But I wanted to avoid burning through the credits too quickly. I also wonder whether Joe and Jack would be more realistically assessed with a few-shot prompt.
What I am talking about is the dimension from cooperation to conflict. I.e. jointly optimising the preferences of all interacting agents or optimising for one set of preferences at the expense of the preferences of the other involved agents.
This is a dimension any sufficiently intelligent agent that is trained on situations involving multi-agent interaction will learn. Even if it is only trained to maximize it’s own set of preferences. It’s a concept that is independent from the preferences of the agents in the single instances of interactions, so the definition of “best” is really not relevant at this level of abstraction.
It’s probably one of the most basic, the most general and one of only very few dimension of actions in multi-agent settings that is always salient.
That’s why I say that the two poles of that dimension in agent behavior are wells. (They are both very wide wells I would think. When I said “deep” I meant something more like “hard to dislodge from”.)
No, not as far as I know. But take a look at AlephAlpha, it seems they offer a similar functionality with some initial free credits.
I think alignment might well be very hard and the stakes are certainly very high, but I must say that I find this post only partly convincing. Here are some thoughts I had, reading the post:
In primates making them smarter probably mostly required scaling the cortex. Making the smarter version more aligned with IGF would have required a rewiring of the older brain parts. One of these is much harder to do than the other.
But in machines both alignment and capabilities will likely be learned with the same architecture, making it less likely that capabilities outstrip alignment for architecture reasons than in primates.
That there is a capabilities well is certainly true. But shouldn’t there be value wells as well? One might think “I do what is best for me” is a deeper well than “I do what is best for all” and more aligned with instrumental power seeking.
But “I do what is best for me” isn’t really a value well, because it is begging the question. If I do what is best for me, I still have to decide what is good. The second well is actually providing values, because other entities already have preferences. Helping them to fulfill those is an actual value, i.e. directs actions, in a way that only “I do what is best for me” isn’t.
Is the structure underlying “do what is best for all” less simple and logical than arithmetic?
I often see it assumed that the ultimate values of an AGI will be kind of random. Some mesa-optimiser stumbled into during training. If this is true, than there is little optimisation pressure towards those values and it seems possible to train for “do what is best for all” instead.
Yeah, I thought that was just what you did. I expected Sarno to have some kind of program. But ok, if that’s it, that’s it.