It’s only 1.2 billion parameters.
Indeed but to slightly counter balance this, at the same time, it looks like it was trained on ~500B tokens (while ~300B were used for GPT-3 and something like ~50B for GPT-2).
It’s only 1.2 billion parameters.
Indeed but to slightly counter balance this, at the same time, it looks like it was trained on ~500B tokens (while ~300B were used for GPT-3 and something like ~50B for GPT-2).
I am curious to hear/read more about the issue of spikes and instabilities in training large language model (see the quote / page 11 of the paper). If someone knows a good reference about that, I am interested!
5.1 Training Instability
For the largest model, we observed spikes in the loss roughly 20 times during training, despite the fact that gradient clipping was enabled. These spikes occurred at highly irregular intervals, sometimes happening late into training, and were not observed when training the smaller models. Due to the cost of training the largest model, we were not able to determine a principled strategy to mitigate these spikes.
Instead, we found that a simple strategy to effectively mitigate the issue: We re-started training from a checkpoint roughly 100 steps before the spike started, and skipped roughly 200–500 data batches, which cover the batches that were seen before and during the spike. With this mitigation, the loss did not spike again at the same point. We do not believe that the spikes were caused by “bad data” per se, because we ran several ablation experiments where we took the batches of data that were surrounding the spike, and then trained on those same data batches starting from a different, earlier checkpoint. In these cases, we did not see a spike. This implies that spikes only occur due to the combination of specific data batches with a particular model parameter state. In the future, we plan to study more principled mitigation strategy for loss spikes in very large language models.
FYI, the “Evaluating Alignment Evaluations” project of the current AI Safety Camp is working on studying and characterizing alignment(propensity) evaluations. We hope to contribute to the science of evals, and we will contact you next month. (Somewhat deprecated project proposal)
“The training algorithm has found a better representation”?? That seems strange to me since the loss should be lower in that case, not spiking. Or maybe you mean that the training broke free of a kind of local minima (without telling that he found a better one yet). Also I guess people training the models observed that waiting after these spike don’t lead to better performances or they would not have removed them from the training.
Around this idea, and after looking at the “grokking” paper, I would guess that it’s more likely caused by the weight decay (or similar) causing the training to break out of a kind of local minima. An interesting point may be that larger/better LM may have significantly sharper internal models and thus are more prone to this phenomenon (The weight decay (or similar) more easily breaking the more sensitive/better/sharper models).
It should be very easy to check if these spikes are caused by the weight decay “damaging” very sharp internal models. Like replay the spiky part several times with less and less weight decay… (I am curious of similar tests with varying the momentum, dropout… At looking if the spikes are initially triggered by some subset of the network, during how many training steps long are the spikes...)
If the following correlations are true, then the opposite may be true (slave morality being better for improving the world through history):
Improving the world being strongly correlated with economic growth (this is probably less true when X-risk are significant)
Economic growth being strongly correlated with Entrepreneurship incentives (property rights, autonomy, fairness, meritocracy, low rents)
Master morality being strongly correlated with acquiring power and thus decreasing the power of others and decreasing their entrepreneurship incentives
The title is clearly an overstatement. It expresses more that I updated in that direction, than that I am confident in it.
Also, since learning from other comments that decentralized learning is likely solved, I am now even less confident in the claim, like only 15% chance that it will happen in the strong form stated in the post.
Maybe I should edit the post to make it even more clear that the claim is retracted.
Thanks a lot for the summary at the start!
If you look at the logit given a range that is not [0.0, 1.0] but [low perf, high perf], then you get a bit more predictive power, but it is still confusingly low.
A possible intuition here is that the scaling is producing a transition from non-zero performance to non-perfect performance. This seems right since the random baseline is not 0.0 and reaching perfect accuracy is impossible.
I tried this only with PaLM on NLU and I used the same adjusted range for all tasks:
[0.9 * overall min. acc., 1.0 − 0.9 * (1.0 - overall max acc.)] ~ [0.13, 0.95]
Even if this model was true, they are maybe other additional explanations like the improvement on one task are not modeled by one logit function but by several of them. A task would be composed of sub-tasks each modelizable by one logit function. And if this make sense, one could try to model the improvements in all of the tasks using only a small number of logit curves associated to each sub-tasks (decomposing each tasks into a set of sub-tasks with a simple trend).
(Also Gopher looks like less predictable and the data more sparse (no data point in the X0 B parameters))
Some quick thoughts about “Content we aren’t (yet) discussing”:
SL (Cloning) is more important than RL. Humans learn a world model by SSL, then they bootstrap their policies through behavioural cloning and finally they finetune their policies thought RL.
Why? Because of theoretical reasons and from experimental data points, this is the cheapest why to generate good general policies…
SSL before SL because you get much more frequent and much denser data about the world by trying to predict it. ⇒ SSL before SL because of a bottleneck on the data from SL.
SL before RL because this remove half (in log scale) of the search space by removing the need to discover|learn your reward function at the same time than your policy function. Because in addition, this remove the need do to the very expensive exploration and the temporal and “agential”(when multiagent) credit assignments. ⇒ SL before RL because of the cost of doing RL.
In cloning, the behaviour comes first and then the biological reward is observed or not. Behaviours that gives no biological reward to the subject can be learned. The subject will still learn some kind of values associated to these behaviours.
Learning with SL, instead of RL, doesn’t rely as much on credit assignment and exploration. What are the consequences of that?
The learned values known by the previous generation.
Why?
Because it is costly to explore by yourself your reward function space
Because it is beneficiary to the community to help you improve your policies quickly
Some instrument goals are learned as final goal, they are “internalised”.
Why?
exploration is too costly
finding an instrumental goal is too rare or too costly
exploitation is too costly
having to make the choice of pursuing an instrumental goal in every situation is too costly or not quick enough (reaction time)
when being highly credible is beneficial
implicit commitments to increase your credibility
Why?
Because it is beneficiary to the community to help you improve your policies quickly
We have here 3 level of rewards function:
Hardcoded in our body
Optimisation process creating it: Evolution
Universe + Evolution ⇒ Biological rewards
Not really flexible
Without “drugs” and advanced biotechnologies
Almost no generalization power
Physical scope: We feel stuff when we are directly involved
Temporal scope: We feel stuff when they are happening
Similarity scope: We fell stuff when we are directly involved
Called sensations, pleasure, pain
Learned through life
Optimisation process creating it: SL and RL relying on biological rewards
Biological rewards + SL and RL ⇒ Learned values in the brain
Flexible in term of years
Medium generalization power
Physical scope: We learn to care for even in case where we are not involved (our close circle)
Temporal scope: We learn to feel emotions about the future and the past
Similarity scope: We learn to feel emotions for other kind of beings
Called intuitions, feelings
Shard theory may be explaining only this part
Decided upon reflection
Optimisation process creating it: Thinking relying on the brain
Learned values in the brain + Thinking ⇒ Chosen values “on paper” | “in ideas”
Flexible in term of minutes
Can have up to very high generalization power
Physical scope: We can chose to care without limits of distances in space
Temporal scope: We can chose to care without limits of distances in time
Similarity scope: We can chose to care without limits in term of similarity to us
Called values, moral values
In short, to get more utility OOD.
A bit more details:
Because we want to design policies far OOD (out of our space of lived experiences). To do that, we know that we need to have a value function|reward model|utility function that generalizes very far. Thanks to this chosen general reward function, we can plan and try to reach a desired outcome far OOD. After reaching it, we will update our learned utility function (lvl 2).
Thanks to lvl 3, we can design public policies, dedicate our life to exploring the path towards a larger reward that will never be observed in our lifetime.
This could explain why most philosophers can support scope sensitive values but never act on them.
Ten months later, which papers would you recommend for SOTA explanations of how generalisation works?
From my quick research:
- “Explaining grokking through circuit efficiency” seems great at explaining and describing grokking
- “Unified View of Grokking, Double Descent and Emergent Abilities: A Comprehensive Study on Algorithm Task” proposes a plausible unified view of grokking and double descent (and a guess at a link with emergent capabilities and multi-task training). I especially like their summary plot:
What about the impact of dropout (parameters, layers), normalisation (batch, layer) (with a batch containing several episodes), asynchronous distributed data collection (making batch aggregation more stochastic), weight decay (impacting any weight), multi-agent RL training with independent agents, etc.
And other possible stuff that don’t exist at the moment: online pruning and growth while training, population training where the gradient hackers are exploited.
Shouldn’t that naively make gradient hacking very hard?
If it takes a human 1 month to solve a difficult problem, it seems unlikely that a less capable human who can’t solve it within 20 years of effort can still succeed in 40 years
Since the scaling is logarithmic, your example seems to be a strawman.
The real claim debated is more something like:
“If it takes a human 1 month to solve a difficult problem, it seems unlikely that a less capable human who can’t solve it within 100 months of effort can still succeed in 10 000 months” And this formulation doesn’t seem obviously true.
Could Anthropic face an OpenAI drama 2.0?
I forecast that Anthropic would likely face a similar backlash from its employees than OpenAI in case Anthropic’s executives were to knowingly decrease the value of Anthropic shares significantly. E.g. if they were to switch from “scaling as fast as possible” to “safety-constrained scaling”. In that case, I would not find it surprising that a significant fraction of Anthropic’s staff threatened to leave or leave the company.
The reasoning is simple, given that we don’t observe significant differences in the wages of OpenAI and Anthropic employees and assuming that they are overall of the same distribution of skill and skill level. Then it seems that Anthropic is not able to use the argument of its AI safety focus as a bargaining argument to reduce the wages significantly. If true this would mean that safety is of relatively little importance to most of Anthropic’s employees.
Counter argument: Anthropic is hiring from a much more restricted pool of candidates. From only the safety-concerned candidates. In that case, Anthropic would have to pay a premium to hire these people. And it happens that this premium is roughly equivalent to the discount that these employees are willing to give to Anthropic because of its safety focus.
Are memoryless LLMs with a limited context window, significantly open loop? (Can’t use summarization between calls nor get access to previous prompts)
I wonder if the result is dependent on the type of OOD.
If you are OOD by having less extractable information, then the results are intuitive.
If you are OOD by having extreme extractable information or misleading information, then the results are unexpected.
Oh, I just read their Appendix A: “Instances Where “Reversion to the OCS” Does Not Hold”
Outputting the average prediction is indeed not the only behavior OOD. It seems that there are different types of OOD regimes.
We see a lot of people die, in the reality, fictions and dreams.
We also see a lot of people having sex or sexual desire in fictions or dreams before experiencing it.
IDK how strong this is a counter argument to how powerful the alignment in us is. Maybe a biological reward system + imitation+ fiction and later dreams is simply what is at play in humans.
You can see the sum of the votes and the number of votes (by having your mouse over the number). This should be enough to give you a rough idea of the ration between + and—votes :)
Indeed but to slightly counter balance this, at the same time, it looks like it was trained on ~500B tokens (while ~300B were used for GPT-3 and for GPT-2 something like ~50B).
Here with 2 conv and less than 100k parameters the accuracy is ~92%. https://github.com/zalandoresearch/fashion-mnist
SOTA on Fashion-MNIST is >96%. https://paperswithcode.com/sota/image-classification-on-fashion-mnist
Offering 100-300h of technical work on an AI Safety project
I am a deep learning engineer (2y exp), I currently develop vision models to be used on satellite images (I also do some software engineering around that) (Linkedin profile https://www.linkedin.com/in/maxime-riche-73696182/). On my spare time, I am organizing a EA local group in Toulouse (France), learning RL, doing a research project on RL for computer vision (only expecting indirect utility from this) and developing an EAA tool (EffectiveAnimalAdvocacy). I have been in the French EA community for 4 years. In 2020, I chose to work part time to dedicate 2 to 3 days of work per week to EA aligned projects.Thus for the next 8 months, I have ~10h / week that I want to dedicate to assist an AI safety project. For myself, I am not looking for funds, nor to publish myself a paper, nor a blog post.To me the ideal project would be:
a relevant technical AI safety project (research or not). I am looking for advice on the “relevant” part.
where I would be able to help the project to achieve better quality results than otherwise without my contribution. (e.g. through writing better code, doing more experiments, testing other designs)
where I can learn more about technical AI safety
where my contribution would include writing code. If it is a research proposal, then implement experiments. If there is no experimental part currently in the project, I could take charge of creating one.