Interested in many things. I have a personal blog at https://www.beren.io/
beren(Beren Millidge)
Thanks for the detailed response! It clarifies some of my concerns and I think we have a lot of agreement overall. I’m also going to go in near reverse order,
To a first approximation, compute_cost = size*speed. If AGI requires brain size, then the first to cross the finish line will likely be operating not greatly faster than the minimum speed, which is real-time. But this does not imply the agents learn at only real time speed, as learning is parallelizable across many agent instances. Regardless, noneof these considerations depend on whether the AGI is trained in a closed simbox or an open sim with access to the internet.
To me the time/cost issue with the simboxes you proposed is in the data you need to train the AGIs from within the sim to prevent information leakage. Unlike with current training, we can’t just give it the whole internet, as that will contain loads of information about humans, how ML works, that it is in a sim etc which would be very dangerous. Instead, we would need to recapitulate the entire *data generating process* within the sim, which is what would be expensive. Naively, the only way to do this would be to actually simulate a bunch of agents interacting with the sim world for a long time, which would be at minimum simulated-years for human-level data efficiency and much much longer for current DL. It is possible, I guess, to amortise this work and create one ‘master-sim’ which so that we can try various AGI designs which all share the same dataset, and this would be good experimentally to isolate the impact of architecture/objective vs dataset, but under the reward-proxy learning approach, a large factor in the success in alignment depends on the dataset, which would be very expensive to recreate in sim without information transfer from our reality.
Training current ML models is very fast because they can use all the datasets already generated by human civilisation. To bootstrap to similar levels of intelligence in a sim without wholesale transfer of information from our reality, will require a concomitant amount of computational effort more like simulating our civilisation than simulating a single agent.
The ideal baseline cost of simboxing is only O(N+1) vs O(N) without—once good AGI designs are found, the simboxing approach requires only one additional unboxed training run (compared to never using simboxes). We can estimate this additional cost: it will be around or less than 1e25 ops (1e16 ops/s for brain-size model * 1e9s seconds for 30 years equivalent), or less than $10 million dollars (300 gpu years) using only todays gpus, ie nearly nothing
I don’t understand this. Presumably we will want to run a lot of training runs in the sim since we will probably need to iterate a considerable number of times to actually succeed in training a safe AGI. We will also want to test across a large range of datasets and initial conditions, which will necessitate the collection of a number of large and expensive sim-specific datasets here. It is probably also necessary to simulate reasonable sim populations as well, which will also increase the cost.
But let’s suppose there still is significant optimization slack, then in a sense you’ve almost answered your own question . .. we can easily incorporate new algorithmic advances into new simboxes or even upgrade agents mid-sim using magic potions or what not.
Perhaps I’m missing something here but I don’t understand how this is supposed to work. The whole point of the simbox is that there is no information leakage about our reality. Having AGI agents doing ML research in a reality which is close enough to our own that its insights transfer to our reality defeats the whole point of having a sim, which is preventing information leakage about our reality! On the other hand, if we invent some magical alternative to the intelligence explosion, then us the simulators won’t necessarily be able to invent the new ML techniques that are ‘invented’ in the sim.
Secondly, the algorithms of intelligence are much simpler than we expected, and brains already implement highly efficient or even near pareto-optimal approximations of the ideal universal learning algorithms.
To the extent either of those major points are true, rapid FOOM is much less likely; to the extent both are true (as they appear to be), then very rapid FOOM is very unlikely.
I agree that FOOM is very unlikely from the view of the current scaling laws, which imply a strongly sublinear returns on investment. The key unknown quantity at this point is the returns on ‘cognitive self improvement’ as opposed to just scaling in terms of parameters and data. We have never truly measured this as we haven’t yet developed appreciably self-modifying and self-improving ML systems. On the outside view, power-law diminishing returns are probably likely in this domain as well but we just don’t know.
Similarly, I agree that if contemporary ML is already on its asymptotically optimal scaling regime—i.e. if it is a fundamental constraint of the universe that intelligence can do no better than power law scaling (albeit with potentially much better coefficients than now), then FOOM is essentially impossible and I think that some form of humanity stands a pretty reasonable chance of survival. There is some evidence that ML is in the same power-law scaling regime as biological brains as well as a lot of algorithms from statistics, but I don’t think the evidence is conclusively against the possibility of a radically better paradigm which perhaps both us and evolution haven’t found. Potentially because it requires some precise combination of both highly parallel brain and a fast serial CPU-like processor which couldn’t be built by evolution with biological components. Personally, and it would be great if you convince me otherwise, that there are a lot of unknown unknowns in this space and the evidence from current ML and neuroscience isn’t that strong against there being unknown and better alternatives that could lead to FOOM. Ideally, we would understand the origins of scaling laws well enough we could figure out computational complexity bounds on the general capabilities of learning agents.
But even without rapid FOOM, we still can have disaster—for example consider the scenario of world domination by a clan of early uploads of some selfish/evil dictator or trillionaire. There’s still great value in solving alignment here, and (to my eyes at least) much less work focused on that area.
Yes of course, solving alignment in this regime is extremely valuable. With any luck, reality will be such that we will end up in this regime and I think alignment is actually solvable here while I’m very pessimistic in a full FOOM scenario. Indeed, I think we should spend a lot of effort in figuring out if FOOM is even possible and if it is trying to figure out how to stop the agents we build from FOOMing since this scenario is where a large amount of p(doom) is coming from.
Assume there was 1.) large algorithmic slack, and 2.) some other approach that was both viable and significantly different, then it would have to:
not use adequate testing of alignment (ie simboxes)
or not optimize for product of intelligence potential and measurable alignment/altruism
If there is enough algorithmic slack such that FOOM is likely, then I think that our capabilities to simulate such an event in simboxes will be highly limited and so we should focus much more on designing general safe objectives which, ideally, we can mathematically show can scale over huge capability gaps, if such safe objectives exist at all. We should also spend a lot of effort into figuring out how to constrain AGIs such that they don’t want to or can’t FOOM. I completely agree though that in general we should spend a lot of effort in building simboxes and measurably testing for alignment before deploying anything.
Thanks for writing this! It’s always good to get critical feedback about a potential alignment direction to make sure we aren’t doing anything obviously stupid. I agree with you that finegrained prediction of what an AGI is going to do in any situation is likely computationally irreducible even with ideal interpretability tools.
I think there are three main arguments for interpretability which might well be cruxes.
As Erik says, interpretability tools potentially let us make coarse-grained predictions about the model utilizing fine-grained information. While predicting everything the model will do is probably not feasible in advance, it might be very possible to get pretty detailed predictions of coarse-grained information such as ‘is this model deceptive’, ‘does it have potentially misaligned mesaoptimizers’, ‘does its value function look reasonably aligned with what we want given the model’s ontology’, ‘is it undergoing / trying to undergo FOOM’? The model’s architecture might also be highly modular so that we could potentially understand/bound a lot of the behaviour of the model that is alignment-relevant while only understanding a small part. This seems especially likely to me if the AGIs architecture is hand-designed by humans – i.e. there is a ‘world model’ part and a ‘planner’ part and a ‘value function’ and so forth. We can then potentially get a lot of mileage out of just interpreting the planner and value function while the exact details of how the model represents, say, chairs in the world model, are less important for alignment
What we ultimately likely want is a statistical-mechanics-like theory of how do neural nets learn representations which includes what circuits/specific computations they tend to do, how they evolve during training, what behaviours these give rise to, and how they behave off distribution etc. Having such a theory would be super important for alignment (although would not solve it directly). Interpretability work provides key bits of evidence that can be generalized to build this theory.
Interpretability tools could let us perform highly targeted interventions on the system without needing to understand the full system. This could potentially involve directly editing out mesaoptimizers or deceptive behaviours, and adjusting goal misgeneralization by tweaking the internal ontology of the model. There is a lot of times in science where we can produce reliable and useful interventions in systems with only partial information and understanding of their most fine-grained workings. Nevertheless, we need some understanding and this is what interpretability appears to give a reliable path to realizing.
Should we expect interpreting AGI-scale neural nets to succeed where interpreting biological brains failed?
There are quite a lot of reasons why we should expect interpretability to be much easier than neuroscience
We know exactly the underlying computational graph of our models – this would be akin to in neuroscience starting out knowing exactly how neurons, synapses etc work as well as knowing the full connectome and the large scale architecture of the brain.
We know the exact learning algorithm our models use – in neuroscience this would be starting out knowing, say, the cortical update rule as well as the brain’s training objective/loss function.
We know the exact training data our models are trained on
We can experiment on copies of the same model as opposed to different animals with different brains / training data / life histories etc
We can instantly read all activations, weights, essentially any quantity of interest simultaneously and with perfect accuracy – simply being able to read neuron firing rates is very difficult in neuroscience and we have basically no ability to read large numbers of synaptic weights
We can perform arbitrary interventions at arbitrarily high fidelity on our NNs
These points mean experimental results are orders of magnitude faster and easier to get. A typical interpretability experiment looks like: load model into memory, perform precise intervention on model, look at a huge number of possible outputs, iterate. Neuroscience experiments often look like train mice to do some task for months, insert probes or do some broad based intervention where you are not sure exactly what you are measuring or what your intervention actually affected, get a bunch of noisy data from a small sample with potential systematic errors/artifacts from your measurement process where you can only read a tiny fraction of what you would like to read, try to understand what is going on. It is much harder and slower!
Secondly, the blue brain project is just one example of a high profile failure in neuroscience and we shouldn’t generalize too much based on it. I have had no experience with the blue brain project, but it seems plausible to me that this could just be a standard case of garden-variety mismanagement and overoptimistic goals and hype. From my perspective, as someone who has worked in neuroscience is that the field just keeps chugging along accumulating knowledge and is advancing at a reasonable pace – i.e. our understanding of the brain has improved vastly since 20 years ago. Not at the pace probably required for alignment on short timelines but at a decent level for a scientific field.
Given this, it is not clear to me that interpretability is doomed to failure and it seems a reasonably high EV bet to me.
In other words, the most efficient source of information will be empiricism in the precise deployment environment. The AI safety plans that are likely to robustly help alignment research will be those that make empiricism less dangerous for AGI-scale models. Think BSL-4 labs for dangerous virology experiments, which would be analogous to airgapping, sandboxing, and other AI control methods.
If by ‘empiricism’ as a channel you mean ‘just run the model and see what it does’, then while some amount of empiricism is necessary, it is also a highly dangerous channel to observe and so we should be trying to get as many bits as possible without requiring empiricism. Pre-deployment interpretability is one way to potentially get a lot of bits of information (although I agree not all the bits) without having to expose yourself to this danger. Hopefully interpretability, even if using only fairly coarse grained information can already eliminate a lot of ‘obviously dangerous AGIs’ before you have to run them.
Some empiricism will likely be necessary though and I agree that designing highly secure test-harnesses for the AGI akin to BSL-4 labs will be another necessary component to safety. However, high-powered interpretability will be a vital safety measure to be applied inside such a safe test harness. Ideally, while the AGI is running in some simulation, with interpretability tools we can be constantly scanning it for deception, for realization that it is in a simulation, for any kind of other malicious intent, and then abort or intervene to fix it if this is the case. I think just relying on airgapping and other physical measures is a bad idea vs a superintelligence and in practice we should be trying to build up as many uncorrelated lines of defense as possible in the test-harness.And given that information channels are dual-use (in that the AGI can also use them for sandbox escape), we should prioritize efficient information channels like empiricism, rather than inefficient ones like fine-grained interpretability. Inefficient information channels can be net-negative, because they may be more useful for the AGI’s sandbox escape compared to their usefulness to alignment researchers.
This is an interesting objection. While it is possible that the AGI could hack interpretability channels, it seems much more difficult than just communicating in the normal way by running in the world – i.e. the ‘empirical channel’. This is because it probably doesn’t know (and we should be careful to make sure that it definitely does not know) what interpretability tools we are using on it, and that interpretability tools can be applied before we actually run the AGI (i.e. directly to weights) or to individual components in isolation.
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
Deconfusing Direct vs Amortised Optimization
This is an interesting question! At the end of the post / in the colab we experiment with knocking out specific singular directions and show that it differentially affects tokens of roughly the same semantics. We find this to be quite a robust effect but that actually affecting network output can be surprisingly difficult as there seems to be large amounts of redundancy where similar processing happens in many layers/blocks simultaneously.
Knocking out every interpretable/uninterpretable column is a cool idea and we haven’t tried it. My suspicion is that this would just be too much damage to the network and would scramble things but it might be worth a shot.
Thanks for this! This was just a copy paste error and it’s fine in the colab. Have updated the post now with the correct visuals
Potentially? My suspicion would be that in this case we would expect the basis in the residual stream to be extremely output-basis aligned while at the moment there is no real pressure for it to be (but it seems to be pretty output-aligned regardless, which is convenient for us). This might be a fun thing to fine-tune on.
We have some preliminary results on this towards the end of the post/colab in the ‘directly editing SVD directions’ section, and are working towards improving on these currently as well as comparing to other methods such as ROME edits.
I broadly agree with this. This method definitely does not uncover any nonlinear representations in the network and is not expected to. We are primarily trying to uncover the relatively ‘easy’ information we can get with linear methods first. In further defence of linear methods, I would also argue that ‘most’ of the transformer architecture is pretty linear looking. The residual stream is linear, and the I/O matrices reading from and writing to the residual stream are also linear (if we ignore the layernorms!). I suspect that because of this some kind of linear directions might be the best way to understand representations in the residual stream, as well as writes to it, but that obviously the process of computing these writes involves nonlinear token-wise functions for the MLPs and nonlinear mixing across tokens for the attention blocks.
Quick feedback that the graph after this paragraph feels sketchy to me—obviously the singular values are zero beyond 64, and they’re so far low down that all singular values above look identical. But the y axis is screwed up, so you can’t really see this. What does the graph look like if you fix it?
Indeed, in retrospect presenting the graph this way seems to have confused a lot of people and I have now updated it to already be cut off at 64 and just show the spectrum until then, where we see a clear exponential decay in singular values (but still remaining not too small) all the way down to 64, and a slightly greater than exponential initial decay. If you want all the code is in the colab so you can set it to linear scale as well if you want. Personally I think that log-scaling tends to make more sense for spectrum graphs as they are usually exponentials or power-laws.
Thanks for sharing this! I’m excited to see more interpretability posts. (Though this felt far too high production value—more posts, shorter posts and lower effort per post plz)
Indeed, we will be aiming for more rapid shorter posts in the near future. Stay tuned.
Yes, this is correct. SVD necessarily won’t recover the full JL packing. Given that we don’t know the extent to which the network uses the full JL capacity, then SVD might still get a reasonable fraction of the relevant directions. Also, if the network packs semantically similar vectors close to one another, then the SVD direction might also represent some kind of useful average of them.
Indeed, we are looking at sparse coding to try to construct an over complete basis, as a parallel project. Stay tuned for this.
Great point. I agree that the singular vectors become unstable when the singular values are very close (and meaningless within the span when identical). However I don’t think this is the main driver of the effect in the post. The graph of the singular vectors shown is quite misleading about the gap (this was my bad!). Because the OV matrix is effectively of rank 64, there is the sudden jump down to almost 0 which dominates the log-scale plotting. I was originally using that graph to try to show that effect, but in retrospect it is kind of an obvious one and not super interesting. I’ve replotted that graph to now cut-off at 64 and you can see that the singular values are actually reasonably spaced in log-space and roughly have an exponential decay to about 0.6. None of them are super close to their neighbours in a way that I think is likely to cause this instability.
Interestingly, the spectrums you get from doing this are very consistent across heads and you also see them in a non-truncated way in the MLP weight matrices where you see a consistent power-law spectrum.
This seems like an important but I am not sure I completely follow. How do rays differ from directions here? I agree that the SVD directions won’t recover any JL kind of dense packing of directions since it is constrained to, at maximum, the dimension of the matrix. The thinking here is then that if the model tends to pack semantically similar directions into closely related dimensions, then the SVD would pick up on at least an average of this and represent it.
I also think something to keep in mind is that we are doing the SVDs over the OV and MLP weights and not activations. That is, these are the directions in which the weight matrix is most strongly stretching the activation space. We don’t necessarily expect the weight matrix to be doing its own JL packing, I don’t think. I also think that it is reasonable that the SVD would find sensible directions here. It is of course possible that the network isn’t relying on the principal svd directions for it’s true ‘semantic’ processing but that it performs the stretching/compressing with some intermediate direction comprised of multiple SVD directions and we can’t rule that out with this method.
This seems like a super interesting result! Thanks for linking; I wasn’t aware of it. I haven’t specifically looked for this pattern in GPT2-medium but I will now! Interestingly, we have also been thinking along similar lines of a 3 phase sequence for processing in residual nets like transformers where the first few layers do some kind of ‘large scale’ reshaping process of the geometry of the data while the later layers mostly do some kind of smaller refinements which don’t change the basic geometry of the representation much, and then the final layer does one massive map to output space. This becomes quite obvious if you look at the cosine similarities of the residual stream between blocks. I hadn’t made the link with the potential attention patterns being more widely distributed at earlier layers though.
I suspect that this ‘early distributed attention’ might be helping out with tasks like building multiply-tokenised words or figuring out syntax in GPT-XL. It would be quite nice if in GPT-medium the same early layers that have MLP SVD directions that seem associated with these kinds of tasks are also those that display more evenly distributed attention.
This would be easy to look at and we might see something potentially in the OV circuits. A general downside of this method is that I have never had any success with applying it to the QK circuits, and I think it’s because the attention is often performing syntactic instead of semantic operations and so projecting to embedding space is meaningless. I agree with the qualitative assessment that the early attention blocks are probably doing a lot of basic syntax/detokenization tasks like this although I don’t have a good sense of whether the MLPs are also doing this or some other kind of simple semantic processing.
(Also, in terms of comparing the fraction of interpretable directions in MLPs per block across the different GPT sizes—I think it is interesting to consider the similarities when the x-axis is “fraction of layers through” instead of raw layer number. One potential (noisy) pattern here is that the models seem to have a rise and dip in the fraction of directions interpretable in MLPs in the first half of the network, followed by a second rise and dip in the latter half of the network.)
I am pretty sure I made plots for this (there are definitely comparable plots in the colab already but in terms of absolute layer number instead of fraction so you will have to ‘imagine’ stretching them out. I agree there is an interesting seeming noisy pattern here. My feeling is that the early dip is probably noise and I am not sure about the later one. Definitely a lot of the time when I have qualitatively observed the final layer, the directions often suddenly become weird or meaningless in the final block.
Broadly, I agree with this. We are never going to have a full mechanistic understanding of literally every circuit in a TAI model in time for it to be alignment relevant (we may have fully reversed engineered some much smaller ‘model organisms’ by this time though). Nor are individual humans ever going to understand all the details of exactly how such models function (even small models).
However, the arguments for mechanistic interpretability in my view are as follows:
1.) Model capacities probably follow some kind of Pareto principle -- 20% or the circuits do 80% of the work. If we can figure out these circuits in a TAI model then we stand a good chance of catching many alignment-relevant behaviours such as deception, which necessarily require large-scale coordination across the network.
2.) Understanding lots of individual circuits and networks provide a crucial source of empirical bits about network behaviour and alignment at a mechanistic level which we can’t get just by theorycrafting about alignment all day. To have a reasonable shot at actually solving alignment we need direct contact with reality and interpretability is one of the main ways to get such contact.
3.) If we can figure out general methods for gaining mechanistic understanding of NN circuits, then we can design automated tools for performing interpretability which substantially reduces the burden on humans. For instance, we might be able to make tools that can rapidly identify the computational substrate of behaviour X, or all parts of the network which might be deceptive, or things like this. This then massively narrows down the search space that humans have to look at to check for safety.
So the quantitative experiment you propose is a good idea—and we will be working along these lines, extending the very preliminary experiments in the post about how big of an effect edits like this will have.
In terms of the polytopes, you are right that this doesn’t really fit in with that framework but assumes a pure linear directions framework. We aren’t really wedded to any specific viewpoint and are trying a lot of different perspectives to try to figure out what the correct ontology to understand neural network internals is.
I want to add to the general agreement here that I broadly agree with this strategy and think that in practice this kind of things (multiple alignment solutions strapped onto a system in parallel) will be what alignment success looks like in parallel, and have written about this before. This agreement is primarily with the swiss-cheese/assemblage strategy.
It is important to note that I think that hodgepodge alignment strategies only really apply to near-term proto-AGI systems that we will build and won’t work against a system that has strongly recursively-self-improved into a true super intelligence (insofar as this is possible). However a key thing we need to figure out is how to prevent this from happening to our near-term AGI systems in an uncontrolled way and I am hopeful that just layering of multiple methods can help here substantially.
I am also somewhat sceptical as using type-theory/category-theoretic approaches of assemblages as a means of proof of safety guarantees (vs as a thinking tool which I strongly think can be helpful). This is basically because it is unclear to me if types are sufficiently strong objects to let us reason about properties like safety (i.e. even in strongly typed languages the type-checker can tell you at compile-time if the function will output an int but cannot guarantee that the int will be < 9000 and I think something like the latter is the kind of guarantee we need for alignment).
I feel like this is a good point in general but I think there is an important but subtle distinction between the two examples. In the first case of the GAN it is that there is the distinction between the inner optimization loop of the ML algorithm and the outer loop of humans performing an evolutionary search process to get papers/make pretty pictures.
In the wire-heading case this feels different in that you have essentially two separate value functions—a cortical LM based one which can extrapolate values in linguistic/concept space and a classic RL basal-ganglia value function which is based on your personal experience. The difference here is mostly in training data—the cortex is trained on a large sensory corpus including linguistic text describing wire heading. The subcortical value function is largely trained on personal rewarding experiences. It would be odd to have them necessarily be always consistent and would lead to strange failure modes exactly like wire heading, or generally being able to be viscerally convinced of anything you read that sounds convincing.
I think there are two fundamental problems with the extensive simboxing approach. The first is just that, given the likely competitive dynamics around near-term AGI (i.e. within the decade), these simboxes are going to be extremely expensive both in compute and time which means that anybody unilaterally simboxing will probably just result in someone else releasing an unaligned AGI with less testing.
If we think about the practicality of these simboxes, it seems that they would require (at minimum) the simulation of many hundreds or thousands of agents over relatively long real timelines. Moreover, due to the GPU constraints and Moore’s law arguments you bring up, we can only simulate each agent at close to ‘real time’. So years in the simbox must correspond to years in our reality, which is way too slow for an imminent singularity. This is especially an issue given that we must maintain no transfer of information (such as datasets) from our reality into the sim. This means at minimum years of sim-time to bootstrap intelligent agents (taking humans data-efficiency as a baseline). Also, each of these early AGIs will be likely be incredibly expensive in compute so that maintaining reasonable populations of them in simulation will be very expensive and probably infeasible initially. If we could get policy coordination on making sure all actors likely to develop AGI go through a thorough simboxing testing regimen, then that would be fantastic and would solve this problem.
Perhaps a more fundamental issue is that simboxing does not address the fundamental cause of p(doom) which is recursive self improvement of intelligence and the resulting rapid capability gains. The simbox can probably simulate capability gains reasonably well (i.e. gain ‘magical powers’ in a fantasy world) but I struggle to see how it could properly test gains in intelligence from self-improvement. Suppose the AI in the fantasy simbox brews a ‘potion’ that makes it 2x as smart. How do we simulate this? We could just increase the agent’s compute in line with the scaling laws but a.) early AGIs are almost certainly near the frontier of our compute capability anyway and b.) much of recursive self improvement is presumably down to algorithmic improvements which we almost necessarily cannot simulate (since if we knew better algorithms we would have included them in our AGIs in the simulation in the first place!)
This is so vital because the probable breakdown of proxies to human values under the massive distributional shift induced by recursive self improvement is the fundamental difficulty to alignment in the first place.
Perhaps this is unique to my model of AI risk, but almost all the probability of doom channels through p(FOOM) such that p(doom | no FOOM) is quite low in comparison. This is because if we have don’t have FOOM then there is not extremely large amounts of optimization power unleashed and the reward proxies for human values and flourishing don’t end up radically off-distribution and so probably don’t break down. There are definitely a lot of challenges left in this regime, but to me it looks solvable and I agree with you that in worlds without rapid FOOM, success will almost certainly look like considerable iteration on alignment with a bunch of agents undergoing some kind of automated simulated alignment testing in a wide range of scenarios plus using the generalisation capabilities of machine learning to learn reward proxies that actually generalise reasonably well within the distribution of capabilities actually obtained. The main risk, however, in my view, comes from the FOOM scenario.
Finally, I just wanted to say that I’m a big fan of your work and some of your posts have caused major updates to my alignment worldview—keep up the fantastic work!