Interested in many things. I have a personal blog at https://www.beren.io/
beren(Beren Millidge)
Thanks for writing this! Here are some of my rough thoughts and comments.
One of my big disagreements with this threat model is that it assumes it is hard to get an AGI to understand / successfully model ‘human values’. I think this is obviously false. LLMs already have a very good understanding of ‘human values’ as they are expressed linguistically, and existing alignment techniques like RLHF/RLAIF seem to do a reasonably good job of making the models’ output align with these values (specifically generic corporate wokeness for OpenAI/Anthropic) which does appear to generalise reasonably well to examples which are highly unlikely to have been seen in training (although it errs on the side of overzealousness of late in my experience). This isn’t that surprising because such values do not have to be specified by the fine-tuning from scratch but should already be extremely well represented as concepts in the base model latent space and merely have to be given primacy. Things would be different, of course, if we wanted to align the LLMs to some truly arbitrary blue and orange morality not represented in the human text corpus, but naturally we don’t.
Of course such values cannot easily be represented as some mathematical utility function, but I think this is an extremely hard problem in general verging on impossible—since this is not the natural type of human values in the first place, which are naturally mostly linguistic constructs existing in the latent space and not in reality. This is not just a problem with human values but almost any kind of abstract goal you might want to give the AGI—including things like ‘maximise paperclips’. This is why almost certainly AGI will not be a direct utility maximiser but instead use a learnt utility function using latents from its own generative model, but in this case it can represent human values and indeed any goal expressible in natural language which of course it will understand.
On a related note this is also why I am not at all convinced by the supposed issues over indexicality. Having the requisite theory of mind to understand that different agents have different indexical needs should be table stakes to any serious AGI and indeed hardly any humans have issues with this, except for people trying to formalise it into math.
There is still a danger of over-optimisation, which is essentially a kind of overfitting and can be dealt with in a number of ways which are pretty standard now. In general terms, you would want the AI to represent its uncertainty over outcomes and utility approximator and use this to derive a conservative rather than pure maximising policy which can be adjusted over time.
I broadly agree with you about agency and consequentialism being broadly useful and ultimately we won’t just be creating short term myopic tool agents but fully long term consequentialists. I think the key thing here is just to understand that long term consequentialism has fundamental computational costs over short term consequentialism and much more challenging credit assignment dynamics so that it will only be used where it actually needs to be. Most systems will not be long term consequentialist because it is unnecessary for them.
I also think that breeding animals to do tasks or looking at humans subverting social institutions is not necessarily a good analogy to AI agents performing deception and treacherous turns. Evolution endowed humans and other animals with intrinsic selfish drives for survival and reproduction and arguably social deception which do not have to exist in AGIs. Moreover, we have substantially more control over AI cognition than evolution does over our cognition and gradient descent is fundamentally a more powerful optimiser which makes it challenging to produce deceptive agents. There is basically no evidence for deception occurring with current myopic AI systems and if it starts to occur with long term consequentialist agents it will be due to either a breakdown of credit assignment over long horizons (potentially due to being forced to use worse optimisers such as REINFORCE variants rather than pure BPTT) or the functional prior of such networks turning malign. Of course if we directly design AI agents via survival in some evolutionary sim or explicitly program in Omohundro drives then we will run directly into these problems again.
I think there are two fundamental problems with the extensive simboxing approach. The first is just that, given the likely competitive dynamics around near-term AGI (i.e. within the decade), these simboxes are going to be extremely expensive both in compute and time which means that anybody unilaterally simboxing will probably just result in someone else releasing an unaligned AGI with less testing.
If we think about the practicality of these simboxes, it seems that they would require (at minimum) the simulation of many hundreds or thousands of agents over relatively long real timelines. Moreover, due to the GPU constraints and Moore’s law arguments you bring up, we can only simulate each agent at close to ‘real time’. So years in the simbox must correspond to years in our reality, which is way too slow for an imminent singularity. This is especially an issue given that we must maintain no transfer of information (such as datasets) from our reality into the sim. This means at minimum years of sim-time to bootstrap intelligent agents (taking humans data-efficiency as a baseline). Also, each of these early AGIs will be likely be incredibly expensive in compute so that maintaining reasonable populations of them in simulation will be very expensive and probably infeasible initially. If we could get policy coordination on making sure all actors likely to develop AGI go through a thorough simboxing testing regimen, then that would be fantastic and would solve this problem.
Perhaps a more fundamental issue is that simboxing does not address the fundamental cause of p(doom) which is recursive self improvement of intelligence and the resulting rapid capability gains. The simbox can probably simulate capability gains reasonably well (i.e. gain ‘magical powers’ in a fantasy world) but I struggle to see how it could properly test gains in intelligence from self-improvement. Suppose the AI in the fantasy simbox brews a ‘potion’ that makes it 2x as smart. How do we simulate this? We could just increase the agent’s compute in line with the scaling laws but a.) early AGIs are almost certainly near the frontier of our compute capability anyway and b.) much of recursive self improvement is presumably down to algorithmic improvements which we almost necessarily cannot simulate (since if we knew better algorithms we would have included them in our AGIs in the simulation in the first place!)
This is so vital because the probable breakdown of proxies to human values under the massive distributional shift induced by recursive self improvement is the fundamental difficulty to alignment in the first place.
Perhaps this is unique to my model of AI risk, but almost all the probability of doom channels through p(FOOM) such that p(doom | no FOOM) is quite low in comparison. This is because if we have don’t have FOOM then there is not extremely large amounts of optimization power unleashed and the reward proxies for human values and flourishing don’t end up radically off-distribution and so probably don’t break down. There are definitely a lot of challenges left in this regime, but to me it looks solvable and I agree with you that in worlds without rapid FOOM, success will almost certainly look like considerable iteration on alignment with a bunch of agents undergoing some kind of automated simulated alignment testing in a wide range of scenarios plus using the generalisation capabilities of machine learning to learn reward proxies that actually generalise reasonably well within the distribution of capabilities actually obtained. The main risk, however, in my view, comes from the FOOM scenario.
Finally, I just wanted to say that I’m a big fan of your work and some of your posts have caused major updates to my alignment worldview—keep up the fantastic work!
This monograph by Bertsekas on the interrelationship between offline RL and online MCTS/search might be interesting—http://www.athenasc.com/Frontmatter_LESSONS.pdf—since it argues that we can conceptualise the contribution of MCTS as essentially that of a single Newton step from the offline start point towards the solution of the Bellman equation. If this is actually the case (I haven’t worked through all details yet) then this seems to be able to be used to provide some kind of bound on the improvement / divergence you can get once you add online planning to a model-free policy.
Interesting thoughts! By the way, are you familiar with Hugo Touchette’s work on this? which looks very related and I think has a lot of cool insights about these sorts of questions.
AI x-risk is not far off at all, it’s something like 4 years away IMO
Can I ask where this four years number is coming from? It was also stated prominently in the new ‘superalignment’ announcement (https://openai.com/blog/introducing-superalignment). Is this some agreed upon median timelines at OAI? Is there an explicit plan to build AGI in four years? Is there strong evidence behind this view—i.e. that you think you know how to build AGI explicitly and it will just take four years more compute/scaling?
Thanks for writing this! It’s always good to get critical feedback about a potential alignment direction to make sure we aren’t doing anything obviously stupid. I agree with you that finegrained prediction of what an AGI is going to do in any situation is likely computationally irreducible even with ideal interpretability tools.
I think there are three main arguments for interpretability which might well be cruxes.
As Erik says, interpretability tools potentially let us make coarse-grained predictions about the model utilizing fine-grained information. While predicting everything the model will do is probably not feasible in advance, it might be very possible to get pretty detailed predictions of coarse-grained information such as ‘is this model deceptive’, ‘does it have potentially misaligned mesaoptimizers’, ‘does its value function look reasonably aligned with what we want given the model’s ontology’, ‘is it undergoing / trying to undergo FOOM’? The model’s architecture might also be highly modular so that we could potentially understand/bound a lot of the behaviour of the model that is alignment-relevant while only understanding a small part. This seems especially likely to me if the AGIs architecture is hand-designed by humans – i.e. there is a ‘world model’ part and a ‘planner’ part and a ‘value function’ and so forth. We can then potentially get a lot of mileage out of just interpreting the planner and value function while the exact details of how the model represents, say, chairs in the world model, are less important for alignment
What we ultimately likely want is a statistical-mechanics-like theory of how do neural nets learn representations which includes what circuits/specific computations they tend to do, how they evolve during training, what behaviours these give rise to, and how they behave off distribution etc. Having such a theory would be super important for alignment (although would not solve it directly). Interpretability work provides key bits of evidence that can be generalized to build this theory.
Interpretability tools could let us perform highly targeted interventions on the system without needing to understand the full system. This could potentially involve directly editing out mesaoptimizers or deceptive behaviours, and adjusting goal misgeneralization by tweaking the internal ontology of the model. There is a lot of times in science where we can produce reliable and useful interventions in systems with only partial information and understanding of their most fine-grained workings. Nevertheless, we need some understanding and this is what interpretability appears to give a reliable path to realizing.
Should we expect interpreting AGI-scale neural nets to succeed where interpreting biological brains failed?
There are quite a lot of reasons why we should expect interpretability to be much easier than neuroscience
We know exactly the underlying computational graph of our models – this would be akin to in neuroscience starting out knowing exactly how neurons, synapses etc work as well as knowing the full connectome and the large scale architecture of the brain.
We know the exact learning algorithm our models use – in neuroscience this would be starting out knowing, say, the cortical update rule as well as the brain’s training objective/loss function.
We know the exact training data our models are trained on
We can experiment on copies of the same model as opposed to different animals with different brains / training data / life histories etc
We can instantly read all activations, weights, essentially any quantity of interest simultaneously and with perfect accuracy – simply being able to read neuron firing rates is very difficult in neuroscience and we have basically no ability to read large numbers of synaptic weights
We can perform arbitrary interventions at arbitrarily high fidelity on our NNs
These points mean experimental results are orders of magnitude faster and easier to get. A typical interpretability experiment looks like: load model into memory, perform precise intervention on model, look at a huge number of possible outputs, iterate. Neuroscience experiments often look like train mice to do some task for months, insert probes or do some broad based intervention where you are not sure exactly what you are measuring or what your intervention actually affected, get a bunch of noisy data from a small sample with potential systematic errors/artifacts from your measurement process where you can only read a tiny fraction of what you would like to read, try to understand what is going on. It is much harder and slower!
Secondly, the blue brain project is just one example of a high profile failure in neuroscience and we shouldn’t generalize too much based on it. I have had no experience with the blue brain project, but it seems plausible to me that this could just be a standard case of garden-variety mismanagement and overoptimistic goals and hype. From my perspective, as someone who has worked in neuroscience is that the field just keeps chugging along accumulating knowledge and is advancing at a reasonable pace – i.e. our understanding of the brain has improved vastly since 20 years ago. Not at the pace probably required for alignment on short timelines but at a decent level for a scientific field.
Given this, it is not clear to me that interpretability is doomed to failure and it seems a reasonably high EV bet to me.
In other words, the most efficient source of information will be empiricism in the precise deployment environment. The AI safety plans that are likely to robustly help alignment research will be those that make empiricism less dangerous for AGI-scale models. Think BSL-4 labs for dangerous virology experiments, which would be analogous to airgapping, sandboxing, and other AI control methods.
If by ‘empiricism’ as a channel you mean ‘just run the model and see what it does’, then while some amount of empiricism is necessary, it is also a highly dangerous channel to observe and so we should be trying to get as many bits as possible without requiring empiricism. Pre-deployment interpretability is one way to potentially get a lot of bits of information (although I agree not all the bits) without having to expose yourself to this danger. Hopefully interpretability, even if using only fairly coarse grained information can already eliminate a lot of ‘obviously dangerous AGIs’ before you have to run them.
Some empiricism will likely be necessary though and I agree that designing highly secure test-harnesses for the AGI akin to BSL-4 labs will be another necessary component to safety. However, high-powered interpretability will be a vital safety measure to be applied inside such a safe test harness. Ideally, while the AGI is running in some simulation, with interpretability tools we can be constantly scanning it for deception, for realization that it is in a simulation, for any kind of other malicious intent, and then abort or intervene to fix it if this is the case. I think just relying on airgapping and other physical measures is a bad idea vs a superintelligence and in practice we should be trying to build up as many uncorrelated lines of defense as possible in the test-harness.And given that information channels are dual-use (in that the AGI can also use them for sandbox escape), we should prioritize efficient information channels like empiricism, rather than inefficient ones like fine-grained interpretability. Inefficient information channels can be net-negative, because they may be more useful for the AGI’s sandbox escape compared to their usefulness to alignment researchers.
This is an interesting objection. While it is possible that the AGI could hack interpretability channels, it seems much more difficult than just communicating in the normal way by running in the world – i.e. the ‘empirical channel’. This is because it probably doesn’t know (and we should be careful to make sure that it definitely does not know) what interpretability tools we are using on it, and that interpretability tools can be applied before we actually run the AGI (i.e. directly to weights) or to individual components in isolation.
Yes. The idea is that the latent space of the neural network’s ‘features’ are ‘almost linear’ which is reflected in both the linear-ish properties of the weights and activations. Not that the literal I/O mapping of the NN is linear, which is clearly false.
More concretely, as an oversimplified version of what I am saying, it might be possible to think of neural networks as a combined encoder and decoder to a linear vector space. I.e. we have nonlinear function f and g which encode the input x to a latent space z and g which decodes it to the output y -i.e. f(x) = z and g(z) = y. We the hypothesise that the latent space z is approximately linear such that we can perform addition and weighted sums of zs as well as scaling individual directions in z and these get decoded to the appropriate outputs which correspond to sums or scalings of ‘natural’ semantic features we should expect in the input or output.
Thanks for these links! This is exactly what I was looking for as per Cunningham’s law. For the mechanistic mode connectivity, I still need to read the paper, but there is definitely a more complex story relating to the symmetries rendering things non-connected by default but once you account for symmetries and project things into an isometric space where all the symmetries are collapsed things become connected and linear again. Is this different to that?
I agree about the NTK. I think this explanation is bad in its specifics although I think the NTK does give useful explanations at a very coarse level of granularity. In general, to put a completely uncalibrated number on it, I feel like NNs are probably ’90% linear’ in their feature representations. Of course they have to have somewhat nonlinear representations as well. But otoh if we could get 90% of the way to features that would be massive progress and might be relatively easy.
I broadly agree with a lot of shard theory claims. However, the important thing to realise is that ‘human values’ do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values. Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.
With AGI, the key will be to work primarily top-down since our linguistic constructs of values tend to reflect much better our ideal values than our actually realised behaviours. Using the AGI’s ‘linguistic cortex’ which already has encoded verbal knowledge about human morality and values to evaluate potential courses of action and as a reward signal which can then get crystallised into learnt policies. The key difficulty is understanding how, in humans, the base reward functions interact with behaviour to make us ‘truly want’ specific outcomes (if humans even do) as opposed to reward or their correlated social assessments. It is possible, even likely, that this is just the default outcome of model-free RL experienced from the inside and in this case our AGIs would look highly anthropomorphic.
Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization -- i.e .effectively planning over a world model—is necessary in situations where a.) you can’t behaviourally clone existing behaviour and b.) you can’t self-play too much with a model-free RL algorithms and so must rely on the world-model. In such a scenario you do not have ground truth reward signals and the only way to amake progresss is to optimise against some implicit learnt reward function.
I also am not sure that an agent that explicitly optimises this is hard to align and the major threat is goodhearting. We can perfectly align Go-playing AIs with this scheme because we have a ground truth exact reward function. Goodhearting is essentially isomorphic to a case of overfitting and can in theory be solved with various kinds of regularisation, especially if the AI maintains a well-calibrated sense of reward function uncertainty then in theory we can derive quantification bounds on its divergence from the true reward function.
I like this post very much and in general I think research like this is on the correct lines towards solving potential problems with Goodheart’s law—in general Bayesian reasoning and getting some representation of the agent’s uncertainty (including uncertainty over our values!) seems very important and naturally ameliorates a lot of potential problems. The correctness and realizability of the prior are very general problems with Bayesianism but often do not thwart its usefulness in practice although they allow people to come up with various convoluted counterexamples of failure. The key is to have sufficiently conservative priors such that you can (ideally) prove bounds about the maximum degree of goodhearting that can occur under realistic circumstances and then translate these into algorithms which are computationally efficient enough to be usable in practice. People have already done a fair bit of work on this in RL in terms of ‘cautious’ RL which tries to take into account uncertainty in the world model to avoid accidentally falling into traps in the environment.
I always say that the whole brain (including not only the basal ganglia but also the thalamocortical system, medulla, etc.) operates as a model-based RL system. You’re saying that the BG by itself operates as a model-free RL system. So I don’t think we’re disagreeing, because “the cortex is the model”?? (Well, we definitely have some disagreements about the BG, but we don’t have to get into them, I don’t think they’re very important for present purposes.)
I think there is some disagreement here, at least in the way I am using model-based / model-free RL (not sure exactly how you are using it). Model-based RL, at least to me, is not just about explicitly having some kind of model, which I think we both agree exists in cortex, but rather the actual action selection system using that model to do some kind of explicit rollouts for planning. I do not think the basal ganglia does this, while I think the PFC has some meta-learned ability to do this. In this sense, the BG is ‘model-free’ while the cortex is ‘model-based’.
I don’t really find “meta-RL” as a great way to think about dlPFC (or whatever the exact region-in-question is). See Rohin’s critique of that DeepMind paper here. I might instead say that “dlPFC can learn good ideas / habits that are defined at a higher level of abstraction” or something like that. For example, if I learn through experience (or hearsay) that it’s a good idea to use Anki flashcards, you can call that Meta-RL (“I am learning how to learn”). But you can equally well describe it as “I am learning to take good actions that will eventually lead to good consequences”. Likewise, I’d say “learning through experience that I should suck up to vain powerful people” is probably is in the same category as “learning through experience that I should use Anki flashcards”—I suspect they’re learned in the same way by the same part of PFC—but “learning to suck up” really isn’t the kind of thing that one would call “meta-RL”, I think. There’s no “meta”—it’s just a good (abstract) type of action that I have learned by RL.
This is an interesting point. At some level of abstraction, I don’t think there is a huge amount of difference between meta-RL and ‘learning highly abstract actions/habits’. What I am mostly pointing towards this is the PFC learns high-level actions including how to optimise and perform RL over long horizons effectively including learning high-level cognitive habits like how to do planning etc, which is not an intrinsic ability but rather has to be learned. My understanding of what exactly the dlPFC does and how exactly it works is the place where I am most uncertain at present.
I agree in the sense of “it’s hard to look at the brainstem and figure out what a developed-world adult is trying to do at any given moment, or more generally in life”. I kinda disagree in the sense of “a person who is not hungry or cold will still be motivated by social status and so on”. I don’t think it’s right to put “eating when hungry” in the category of “primary reward” but say that “impressing one’s friends” is in a different, lesser category (if that’s what you’re saying). I think they’re both in the same category.
I agree that even when not immediately hungry or cold etc we still get primary rewards from increasing social status etc. I don’t completely agree with Robin Hanson that almost all human behaviour can be explained by this drive directly though. I think we act on more complex linguistic values, or at least our behaviour to fulfil these primary rewards of social status is mediated through these.
I don’t particularly buy the importance of words-in-particular here. For example, some words have two or more definitions, but we have no trouble at all valuing one of those definitions but not the other. And some people sometimes have difficulty articulating their values. From what I understand, internal monologue plays a bigger or smaller role in the mental life of different people. So anyway, I don’t see any particular reason to privilege words per se over non-linguistic concepts, at least if the goal is a descriptive theory of humans. If we’re talking about aligning LLMs, I’m open to the idea that linguistic concepts are sufficient to point at the right things.
So for words literally, I agree with this. By ‘linguistic’ I am more pointing at abstract high-level cortical representations. I think that for the most part these line up pretty well with and are shaped by our linguistic representations and that the ability of language to compress and communicate complex latent states is one of the big reasons for humanity’s success.
I think I would have made the weaker statement “There is no particular reason to expect this project to be possible at all.” I don’t see a positive case that the project will definitely fail. Maybe the philosophers will get very lucky, or whatever. I’m just nitpicking here, feel free to ignore.
This is fair. I personally have very low odds on success but it is not a logical impossibility.
I think (?) you’re imagining a different AGI development model than me, one based on LLMs, in which more layers + RLHF scales to AGI. Whereas I’m assuming (or at least, “taking actions conditional on the assumption”) that LLM+RLHF will plateau at some point before x-risk, and then future AI researchers will pivot to architectures more obviously & deeply centered around RL, e.g. AIs for which TD learning is happening not only throughout training but also online during deployment (as it is in humans).
I am not sure we actually imagine that different AGI designs. Specifically, my near-term AGI model is essentially a multi-modal DL-trained world model, likely with an LLM as a centrepiece but also potentially vision and other modalities included, and then trained with RL either end to end or as some kind of wrapper on a very large range of tasks. I think, given that we already have extremely powerful LLMs in existence, almost any future AGI design will use them at least as part of the general world model. In this case, then there will be a very general and highly accessible linguistic latent space which will serve as the basis of policy and reward model inputs.
Yes they have. There’s quite a large literature on animal emotion and cognition and my general synthesis is that animals (at least mammals) have at least the same basic emotions as humans and often quite subtle ones such as empathy and a sense of fairness. It seems pretty likely to me whatever the set of base reward functions encoded in the mammalian basal ganglia and hypothalamus is, it can quite robustly generate expressed behavioural ‘values’ that fall within some broadly humanly recognisable set.
Meant to comment on this a while back but forgot. I have thought about this also and broadly agree that early AGI with ‘thoughts’ at GHz levels is highly unlikely. Originally this was because pre-ML EY and the community broadly associated thoughts with CPU ops but in practice thoughts are more like forward passes through the model.
As Connor Sullivan says, the reasons brains can have low clock rates is that our intelligence algorithms are embarrassingly parallel, as is current ML. Funnily enough, for large models (and definitely if we were to run forward passes through NNs as large as the brain), inference latency is already within an OOM or so of the brain (100ms). Due to parallelisation, you can distribute your forward pass across many GPUs to potentially decrease latency but eventually will get throttled by the networking overhead.
The brain, interestingly, achieves its relatively low latency by being highly parallel and shallow. The brain is not that many ‘layers’ deep. Even though each neuron is slow, the brain can perform core object recognition in <300ms at about 10 synaptic transmissions from retina → IT. This is compared to current resnets which are >>10 layers. It does this through some combination of better architecture, better inference algorithm, and adaptive compute which trades space for time. i.e. you don’t have do all your thinking in a forward pass but instead have recurrent connections so you can keep pondering and improving your estimations through multiple ‘passes’.
Neuromorphic hardware can ameliorate some of these issues but not others. Potentially, it allows for much more efficient parallel processing and lets you replace a multi-GPU cluster with a really big neuromorphic chip. Theoretically this could enable forward passes to occur at GHz speed but probably not within the next decade (technically if you use pure analog or optical chips you can get even faster forward passes!). Downsides are unknown hardware difficulty for more exotic designs and general data movement costs on chip. Also energy intensity will be huge at these speeds. Another bottleneck you end up with in practice is simply speed of encoding/decoding data at the analog-digital interface.
Even based on GPU clusters, early AGI can probably improve inference latency by a few OOMs to 100-1000s of forward passes per second just from low hanging hardware/software improvements. Additional benefits AGI could have are:
1.) Batching. GPUs are great at handling batches rapidly. The AGI can ‘think’ about 1000 things in parallel. The brain has to operate on batch size 1. Interestingly this is also a potential limitation of a lot of neuromorphic hardware as well.
2.) Direct internal access to serial compute. Imagine you had a python repl in your brain you could query and instantly get responses. Same with instant internal database lookup.
Broadly, I agree with this. We are never going to have a full mechanistic understanding of literally every circuit in a TAI model in time for it to be alignment relevant (we may have fully reversed engineered some much smaller ‘model organisms’ by this time though). Nor are individual humans ever going to understand all the details of exactly how such models function (even small models).
However, the arguments for mechanistic interpretability in my view are as follows:
1.) Model capacities probably follow some kind of Pareto principle -- 20% or the circuits do 80% of the work. If we can figure out these circuits in a TAI model then we stand a good chance of catching many alignment-relevant behaviours such as deception, which necessarily require large-scale coordination across the network.
2.) Understanding lots of individual circuits and networks provide a crucial source of empirical bits about network behaviour and alignment at a mechanistic level which we can’t get just by theorycrafting about alignment all day. To have a reasonable shot at actually solving alignment we need direct contact with reality and interpretability is one of the main ways to get such contact.
3.) If we can figure out general methods for gaining mechanistic understanding of NN circuits, then we can design automated tools for performing interpretability which substantially reduces the burden on humans. For instance, we might be able to make tools that can rapidly identify the computational substrate of behaviour X, or all parts of the network which might be deceptive, or things like this. This then massively narrows down the search space that humans have to look at to check for safety.
Strongly upvoted this post. I agree very strongly with every point here. The biggest consideration for me is that alignment seems like the kind of problem which is primarily bottlenecked on serial conceptual insights rather than parallel compute. If we already had alignment methods that we know would work if we just scaled them up, the same way we have with capabilities, then racing to endgame might make sense given the opportunity costs of delaying aligned AGI. Given that a.) we don’t have such techniques and b.) even if we did it would be hard to be so certain that they are actually correct, racing to endgame appears very unwise.
There is a minor tension with capabilities in that I think that for alignment to progress it does need some level of empirical capabilities results both in revealing information about likely AGI design and threat models and also so we can actually test alignment techniques. I think that e.g. if ML capabilities had frozen at the level of 2007 for 50 years, then at some point we would stop being able to make alignment progress without capabilities advancements but I think that in the current situation we are very very far from this Pareto frontier.
I largely disagree about the intrinsic motivation/reward function points. There is a lot of evidence that there is at least some amount of general intelligence which is independent of interest in particular fields/topics. Of course, if you have a high level of intelligence + interest then your dataset will be heavily oriented towards that topic and you will gain a lot of skill in it, but the underlying aptitude/intelligence can be factored out of this.
How exactly specific interests are encoded is a different and also super fascinating question! It definitely isn’t a pure ‘bit prediction’ intrinsic curiosity since different people seem to care a lot about different kinds of bits. It is at least somewhat affected by external culture / datasets but not entirely (people can often be interested in things against cultural pressure or often before they really know what their interest is). It doesn’t seem super influenced by external reward in a lot of cases. To some extent it ties in with intrinsic aptitude (people tend to be interested in things they are good at) but of course this is at least somewhat circular since people tend to get better at things they are interested in, ceteris paribus.
The hyperparameters is a good point. I was thinking about this largely as architectural changes but I think that I was wrong about this they are much more continuous and also potentially much more flexible genetically. This seems to be a better and more likely explanation for continuous IQ distributions than architecture directly. It would definitely be interesting to know how robust the brain is to these kinds of hyper parameter distributions (i.e over what range do people vary and is it systematic). In ML my understanding is that at large scale models are generally pretty robust to small hyper parameter variations (allowing people to get away with cargo culting hyperparams from other related papers instead of always sweeping themselves) although of course really bad hyperparams destroy performance. The brain may also be less stable due to some combination of recurrent dynamics/active data selection leading to positive or negative loops, as well as just more weird architectural hyper parameters leading to more interactions and ways for things to go wrong.
I think this is a good intuition. I think this comes down to the natural structure of the graph and the fact that information disappears at larger distances. This means that for dense graphs such as lattices etc regions only implicitly interact through much lower dimensional max-ent variables which are then additive while for other causal graph structures such as the power-law small-world graphs that are probably sensible for many real-world datasets, you also get a similar thing where each cluster can be modelled mostly independently apart from a few long-range interactions which can be modelled as interacting with some general ‘cluster sum’. Interestingly, this is how many approximate bayesian inference algorithms for factor graphs look like—such as the region graph algorithm. ( http://pachecoj.com/courses/csc665-1/papers/Yedidia_GBP_InfoTheory05.pdf).
I definitely agree it would be really nice to have the math of this all properly worked out as I think this, as well as the region why we see power-law spectra of features so often in natural datasets (which must have a max-ent explanation) is a super common and deep feature of the world.
Unfortunately our code is tied too closely to our internal infrastructure for it to be worth disentangling for this post. I am considering putting together a repo containing all the plots we made though, since in the post we only publish a few exemplars and ask people to trust that the rest look similar. Most of the experiments are fairly simple and involves just gathering activations or weight data and plotting them.
I think this is a really good post. You might be interested in these two posts which explore very similar arguments on the interactions between search in the world model and more general ‘intuitive policies’ as well as the fact that we are always optimizing for our world/reward model and not reality and how this affects how agents act.
While I agree with a lot of points of this post, I want to quibble with the RL not maximising reward point. I agree that model-free RL algorithms like DPO do not directly maximise reward but instead ‘maximise reward’ in the same way self-supervised models ‘minimise crossentropy’—that is to say, the model is not explicitly reasoning about minimising cross entropy but learns distilled heuristics that end up resulting in policies/predictions with a good reward/crossentropy. However, it is also possible to produce architectures that do directly optimise for reward (or crossentropy). AIXI is incomputable but it definitely does maximise reward. MCTS algorithms also directly maximise rewards. Alpha-Go style agents contain both direct reward maximising components initialized and guided by amortised heuristics (and the heuristics are distilled from the outputs of the maximising MCTS process in a self-improving loop). I wrote about the distinction between these two kinds of approaches—direct vs amortised optimisation here. I think it is important to recognise this because I think that this is the way that AI systems will ultimately evolve and also where most of the danger lies vs simply scaling up pure generative models.