Good idea—will run this experiment!
beren(Beren Millidge)
Also, I meant to ask you, what does the learning rate schedule of these models look like? In a lot of the summary statistics plots we see either peaks and asymptotes and sometimes clear phase transitions between checkpoints 20 and 40, and I was wondering if this is related to the learning rate schedule somehow (end of warmup?)
That was my own immediate response: “if these distributions are so universal, why doesn’t this show that standard initializations suck, and that you should reverse-engineer the final distribution and initialize that way?”
It might show this. As far as I know nobody has done this experiment. Either way results would be interesting.
Either the model won’t train or will train much slower, which suggests that the understanding or training setup here is totally wrong in some way; or it will train at the same speed, suggesting that the distributions are misleading and more like epiphenomena or side-effects of what is actually training/‘doing the work’ (which is still going on under the hood & just no longer visible in some crude summary statistics); or it will train much much faster, which is a huge optimization win and also verifies the importance of the initialization distribution being correct with all the theoretical implications thereof.
My intuition/prediction here (which I have fairly low confidence in) is that if you initialised it in this way, the early bit of training will be sped up, because it seems that a lot of what the model is doing in the first few steps of training, when loss is rapidly decreasing, is just broadly fitting the general scale and distributional shape of the data distribution. We see that the model roughly reaches something like its final distribution quite rapidly, and I expect this second, much longer, phase to be where most of the important representation formation to happen which won’t be much affected by initialising it in this way. So basically we will get a slight speed boost / cut out a number of early steps of training but that benefits would diminish after the earliest phase of training. Would definitely be interesting to look at more systematically.
just do something nonparametrically like sample from a histogram/cdf or permute the parameters from a finished model, and then maybe train on some equivalent heldout text dataset to reduce any lottery-ticket weirdness
These are both good ideas.
It looks like you’re experimenting with the 5 smallest models, but haven’t done analysis on the 2.8B, 6.9B, or 12B models. Is that something you’re planning on adding, or no?
We have done some preliminary analyses on these as well. Primary issue is just that these experiments take longer since the larger models take longer to instantiate from checkpoint (which adds up when there are 142 checkpoints). Am planning to run the same experiments on the larger models and update the post with them at some point however.
I am really very surprised that the distributions don’t seem to match any standard parameterized distribution. I was fully ready to say “okay, let’s retrain some of the smaller Pythia models initialized using the distribution you think the weights come from” but apparently we can’t do that easily. I suppose we can do a MCMC sampler?
I agree the distribution thing is weird and not what I was expecting. I have currently tried to fit to Gaussian, power law, logistic and none are super close in general. I have also tried general fits to generalised exponential functions of the form exp(kx^\alpha) where k and \alpha are free parameters but this optimization just tends to be numerically unstable and give bad results whenever I have tried it. Other people at Conjecture, following the PDLT book, have tried fitting the fourth order perturbative expansion—i.e. exp(x^2 + \gamma x^4) which also runs into numerical issues.
I agree that it seems to refute some of the theoretical assumptions of the NTK literature, but I wonder if perhaps it’s consistent with the [Tensor Programs](https://arxiv.org/abs/2203.03466) work by Greg Yang et al. that lead to muP.
Maybe? I haven’t studied Tensor programs in extreme detail but my understanding is that they assume Gaussian limits for their proofs. However, afaik muP does work in practice so maybe this isn’t such a big deal?
To clarify what’s going on with the Pythia models:
This is great to have clarified thanks! I’ll tone down the disclaimer then and add the note about the new nomenclature.
I understand that these are working with public checkpoints but I’d be interested if you have internal models to see similar statistics for the size of weight updates, both across the training run, and within short periods, to see if there are correlations between which weights are updated. Do you get quite consistent, smooth updates, or can you find little clusters where connected weights all change substantially in just a few steps?
We do have internal models and we have run similar analyses on them. For obvious reasons I can’t say too much about this, but in general what we find is similar to the Pythia models. I think the effects I describe here are pretty general across quite a wide range of LLM architectures. Generally most changes are quite smooth it seems for both Pythia and other models. Haven’t looked much at correlations between specific weights so can’t say much about that.
Also I think the axes in the first graphs of ‘power law weight spectra..’ are mislabelled, should be rank/singular value?
Thanks for this! This is indeed the case. Am regenerating these plots and will update.
My understanding is that after a lot of simplifications, policy gradients just takes a noisy gradient step in the direction of minimising Bellman error, and so in the limit of infinite data/computation/visiting all states in the world, it is ‘guaranteed’ to converge to an optimal policy for the MDP. Q learning and other model-free algorithms have similar guarantees. In practice, with function approximation, and PPOs regularisation bits, these guarantees do not hold anymore, but the fundamental RL they are built off of does have them. The place to go deeper into this is Sutton and Bart’s textbook and also Bertsekas’ dynamic programming textbook
I agree that this will probably wash out with strong optimization against. and that such confusions become less likely the more different the world models of yourself and the other agent that you are trying to simulate is—this is exactly what we see with empathy in humans! This is definitely not proposed as a full ‘solution’ to alignment. My thinking is that a.) this effect may be useful for us in providing a natural hook to ‘caring’ about others which we can then design training objectives and regimens to allow us to extend and optimise this value shard to a much greater extent than it occurs naturally.
(i) It will actually be possible to (measure and) limit the amount of “optimization pressure” that an advanced A(G)I exerts (towards a given goal).If so, how likely do you think (i) is to be true?
If you have any ideas (even very rough/vague ones) for how to realize (i), I’d be curious to read them.
For this, it is not clear to me that it is impossible or even extremely difficult to do this, at least in a heuristic way. I think that managing to successfully limit the optimization power applied against our defences is fundamental to coming up with alignment techniques that can work in practice. We need some way to bound the adversary otherwise we are essentially doomed by construction.
There is a whole bunch of ideas you can try here which work mostly independently and in parallel—examples of this are:
1.) Quantilization
2.) Impact regularization
3.) General regularisation against energy use, thinking time, compute cost
4.) Myopic objectives and reward functions. High discount rates
5.) limiting serial compute of the model
6.) Action randomisation / increasing entropy—something like dropout over actions.
7.) Satisficing utility/reward functions
8.) Distribution matching objectives instead of argmaxing
9.) penalisation of divergence from a ‘prior’ of human behaviour
10.) Maintaining value uncertainty estimates and acting conservatively within the outcome distribution
These are just examples I have thought of immediately. There are a whole load more if you sit down and brainstorm for a while.
In terms of measuring optimziation power I don’t think this is that hard to do roughly. We can definitely define it in terms of outcomes as KL divergence of achieved distribution vs some kind of prior ‘uncontrolled’ distribution. We already implement KL penalties in RL like this. Additionally, rough proxies are serial compute, energy expenditure, compute expenditure, divergence from previous behaviour etc.
It will be possible to end the acute risk period using an A(G)I that is limited in the above way.
The major issue is what level of alignment tax these solutions impose and whether it is competitive with other players. This ultimately depends on the amount of slack that is available in the immediately post-AGI world. My feeling is that it is possible there is quite a lot of slack here, at least at first, and that most of the behaviours we really want to penalise for alignment purposes are quite far from most likely behaviour—i.e. there is very little benefit to us of having the AGI having such a low discount rate it is planning about tiling the universe with paperclips in billions of years.
I also don’t think of these so much as solutions but as part of the solution—i.e. we still need to find good robust ways of encoding human values as goals, detect and prevent inner misalignment, and have some approach to manage goodhearting.
I suspect that {the cognitive process that produced the above sentence} is completely devoid of security mindset. If so, might be worth trying to develop security mindset? And/or recognize that one is liable to (i.a.) be wildly over-optimistic about various alignment approaches. (I notice that that sounded unkind; sorry, not meaning to be unkind.)
Yep this is definitely not proposed as some kind of secure solution to alignment (if only the world were so nice!). The primary point is that if this mechanism exists it might provide some kind of base signal which we can then further optimize to get the agent to assign some kind of utility to others. The majority of the work will of course be getting that to actually work in a robust way.
You pointed out that empathy is not a silver bullet. I have a vague (but poignant) intuition that says that the problem is a lot worse than that: Not only is empathy not a silver bullet, it’s a really really imprecise heuristic/proxy/shard for {what we actually care about}, and is practically guaranteed to break down when subjected to strong optimization pressure.
Yes. Realistically, I think almost any proxy like this will break down under strong enough optimization pressure, and the name of the game is just to figure out how to prevent this much optimization pressure being applied without imposing too high a capabilities tax.
I think this is a mechanism that actually happens a lot. People generally do lose a lot of empathy with experience and age. People definitely get de-sensitized to both strongly negative and strongly positive experiences after viewing them a lot. I actually think that this is more likely than the RL story—especially with positive-valence empathy which under the RL story people would be driven to seek out.
But then every time that empathy thing happens, I obviously don’t then immediately eat chocolate. So the reward model would get an error signal—there was a reward prediction, but the reward didn’t happen. And thus the brain would eventually learn a more sophisticated “correct” reward model that didn’t fire empathetically. Right?
My main model for why this doesn’t happen in some circumstances (but definitely not all) is that the brain uses these signals and has a mechanism for actually providing positive or negative reward when they fire depending on other learnt or innate algorithms. For instance, you could pass the RPE through to some other region to detect whether the empathy triggered for a friend or enemy and then return either positive or negative reward, so implementing either shared happiness or schadenfreude. Generally I think of this mechanism as a low level substrate on which you can build up a more complex repertoire of social emotions by doing reward shaping on these signals.
Also—I really like your post on empathy that cfoster linked above! I have read a lot of your work but somehow missed that one lol. Cool we are thinking at least somewhat along similar lines
It seems like the AIs we build will be very different from us, at least in terms of basic drives. I can definitely empathize when there’s some common currency to the experience (for ex. they’re feeling pain, and I’ve also experienced pain), but probably less so when there’s a greater gap. Since AIs won’t share any of our physiology or evolutionary history, I worry that that common currency will be missing, which would seemingly incentivize the AI having separate circuits for modeling humans and for modeling itself
Yes, this depends a lot on the self model of the AGI. It’s definitely not a silver bullet. The AGI will almost certainly have a very good model of humans, their culture, and how their minds work from various self-supervised losses. Whether the AGI conceptualises itself as close to this or not depends on the representations of AGI in the dataset as well as potentially our training regime.
Nitpick about terminology: I think the stuff you’re talking about is primarily attributable to having a learned value function rather than to having a learned reward model in the narrow sense of a predictor of immediate reward. I tend to use value function to refer to the thing that, alongside the reward function, produces visceral (gut-like) reactions to thoughts based on forecasts that were learned via something like TD learning
I agree it is not necessarily the reward model that generates direct feelings. I think it is hard to connect any part of an RL system directly to gut level ‘feels’ because we don’t really know what these are. The value function is just the estimate of the long run reward and is trained on a supervised bellman equation. It is very possible that the machinery that creates this won’t exist at all in the AGI, or maybe it is just some intrinsic property of RL agents I don’t know.
My prediction is that there really is an evolved nudge towards empathy in the human motivational system, and that human psychology—like usually being empathetic but sometimes modulating it and often justifying self-serving actions—is sculpted by such evolved nudges, and wouldn’t be recapitulates in AI lacking those nudges.
I agree—this is partly what I am trying to say in the contextual modulation section. The important thing is that the base capability for empathy might exist as a substrate to then get sculpted by gradient descent / evolution to implement a wide range of adaptive pro or anti-social emotions/behaviours. Which of these behaviours, if any, get used by the AI will depend on the reward function / training data it sees.
Goal misgeneralization (the global min might also be malign OOD). The thing you’re talking about here I would basically describe as the first thing.
Is there a way this is different from standard goodhearting concerns? I totally agree that this is a problem but it seems importantly conceptually distinct to me from gradient hackers/mesaoptimization
Strongly upvoted this post. I agree very strongly with every point here. The biggest consideration for me is that alignment seems like the kind of problem which is primarily bottlenecked on serial conceptual insights rather than parallel compute. If we already had alignment methods that we know would work if we just scaled them up, the same way we have with capabilities, then racing to endgame might make sense given the opportunity costs of delaying aligned AGI. Given that a.) we don’t have such techniques and b.) even if we did it would be hard to be so certain that they are actually correct, racing to endgame appears very unwise.
There is a minor tension with capabilities in that I think that for alignment to progress it does need some level of empirical capabilities results both in revealing information about likely AGI design and threat models and also so we can actually test alignment techniques. I think that e.g. if ML capabilities had frozen at the level of 2007 for 50 years, then at some point we would stop being able to make alignment progress without capabilities advancements but I think that in the current situation we are very very far from this Pareto frontier.
The local minima point is interesting. My initial question is how this squares with both theoretical and empirical findings that networks generally don’t seem to get stuck in local minima and the lots of hints that the general loss landscape in which they operate is fairly benign?
I think this is only possible if the coupling between the gradient hacker’s implementation of its malign behaviour and the good performance is extremely strong and essentially the correlation has to be 1. It is not like gradient descent has only one knob to turn for ‘more gradient hacker’ or ‘less gradient hacker’. Instead, it has access to all of the internal weights of the gradient hacker and will change them to both a.) strengthen the positive aspects of the gradient hacker wrt the outer loss and b.) weaken the negative aspects. I.e. so if the gradient hacker is good at planning which is useful for the model but is malign in some other way, then gradient descent will strengthen the planning related parameters and weaken the malign ones simultaneously. The only way this fails is if there is literally no way to decouple these two aspects of the model, which I think would be very hard to maintain in practice. This is basically property 1 that gradient descent optimises all parameters in the network and leaves no slack.
I broadly agree with a lot of shard theory claims. However, the important thing to realise is that ‘human values’ do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values. Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.
With AGI, the key will be to work primarily top-down since our linguistic constructs of values tend to reflect much better our ideal values than our actually realised behaviours. Using the AGI’s ‘linguistic cortex’ which already has encoded verbal knowledge about human morality and values to evaluate potential courses of action and as a reward signal which can then get crystallised into learnt policies. The key difficulty is understanding how, in humans, the base reward functions interact with behaviour to make us ‘truly want’ specific outcomes (if humans even do) as opposed to reward or their correlated social assessments. It is possible, even likely, that this is just the default outcome of model-free RL experienced from the inside and in this case our AGIs would look highly anthropomorphic.
Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization -- i.e .effectively planning over a world model—is necessary in situations where a.) you can’t behaviourally clone existing behaviour and b.) you can’t self-play too much with a model-free RL algorithms and so must rely on the world-model. In such a scenario you do not have ground truth reward signals and the only way to amake progresss is to optimise against some implicit learnt reward function.
I also am not sure that an agent that explicitly optimises this is hard to align and the major threat is goodhearting. We can perfectly align Go-playing AIs with this scheme because we have a ground truth exact reward function. Goodhearting is essentially isomorphic to a case of overfitting and can in theory be solved with various kinds of regularisation, especially if the AI maintains a well-calibrated sense of reward function uncertainty then in theory we can derive quantification bounds on its divergence from the true reward function.
The convergence theorems basically say that optimizing for P[t] converges to optimizing for T[t+d] for some sufficient timespan d.
The idea of a convergence theorem showing that optimizing any objective leads to empowerment has been brought up a bunch of times in these discussions, as in this quote. Is there some well-known proof/paper where this is shown? AFAICT the original empowerment do not show any proof like this (may have missed it). Is this based off of Alex Turner’s work (https://arxiv.org/pdf/1912.01683.pdf) which results in a different measure than information theoretic empowerment (but intuitively related), or something else?
Excellent post btw.
Yes they have. There’s quite a large literature on animal emotion and cognition and my general synthesis is that animals (at least mammals) have at least the same basic emotions as humans and often quite subtle ones such as empathy and a sense of fairness. It seems pretty likely to me whatever the set of base reward functions encoded in the mammalian basal ganglia and hypothalamus is, it can quite robustly generate expressed behavioural ‘values’ that fall within some broadly humanly recognisable set.
I think we agree here. Testing whether it converges to a better optimum would also be interesting.
Yes. I feel that this might help especially with warmup which could just plausibly be because at the start there are very large and mostly non-informative gradients towards just being the right distribution, which would be removed if you start out at the right gradient.