PhD student at UCL DARK doing RL, OOD Robustness and safety. Interested in self improvement.
This feels kind of like a semantic disagreement to me. To ground it, it’s probably worth considering whether further research on the CCS-style work I posted would also be useful for self-driving cards (or other applications). I think that would depend on whether the work improves the robustness of the contrastive probing regardless of what is being probed for (which would be generically useful), or whether it improves the probing specifically for truthfulness in systems that have a conception of truthfulness , possibly by improving the constraints or adding additional constraints (less useful for other systems). I think both would be good, but I’m uncertain which would be more useful to pursue if one is motivated by reducing x-risk from misalignment.
I think that “don’t kill humans” can’t chain into itself because there’s not a real reason for its action-bids to systematically lead to future scenarios where it again influences logits and gets further reinforced, whereas “drink juice” does have this property.
I’m trying to understand why the juice shard has this propety. Which of these (if any) are the the explanation for this:
Bigger juice shards will bid on actions which will lead to juice multiple times over time, as it pushes the agent towards juice from quite far away (both temporally and spatially), and hence will be strongly reinforcement when the reward comes, even though it’s only a single reinforcement event (actually getting the juice).
Juice will be acquired more with stronger juice shards, leading to a kind of virtuous cycle, assuming that getting juice is always positive reward (or positive advantage/reinforcement, to avoid zero-point issues)
The first seems at least plausibly to also to apply to “avoid moldy food”, if it requires multiple steps of planning to avoid moldy food (throwing out moldy food, buying fresh ingredients and then cooking them, etc.)
The second does seem to be more specific to juice than mold, but it seems to me that’s because getting juice is rare, and is something we can better and better at, whereas avoiding moldy food is something that’s fairly easy to learn, and past that there’s not much reinforcement to happen. If that’s the case, then I kind of see that as being covered by the rare-states explanation in my previous comment, or maybe an extension of that to “rare states and skills in which improvement leads to more reward”.
Having just read tailcalled comment, I think that is in some sense another of phasing what I was trying to say, where rare (but not too rare) states are likely to mean that policy-caused variance is high on those decisions. Probably policy-caused variance is more fundamental/closer as an explanation to what’s actually happening in the learning process, but maybe states of certain rarity which are high-reward/reinforcement is one possibly environmental feature that produces policy-caused variance.
So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others.
One example: A tool that is designed to detect whether a model is being truthful (correctly representing what it knows about the world to the user), perhaps based on specific properties of truthfulness (for example see https://www.alignmentforum.org/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without) doesn’t seem like it would be useful for improving the reliability of self-driving cars, as likely self-driving cars aren’t misaligned in the sense that they could drive perfectly safely but choose not to, but rather are just unable to drive perfectly safely because some of their internal (learned) systems aren’t sufficiently robust.
Not Paul, but some possibilities why ARC’s work wouldn’t be relevant for self-driving cars:
The stuff Paul said about them aiming at understanding quite simple human values (don’t kill us all, maintain our decision-making power) rather than subtle things. It’s likely for self-driving cars we’re more concerned with high reliability and hence would need to be quite specific. E.g., maybe ARC’s approach could discern whether a car understands whether it’s driving on the road or not (seems like a fairly simple concept), but not whether it’s driving in a riskier way than humans in specific scenarios.
One of the problems that I think ARC is worried about is ontology identification, which seems like a meaningfully different problem for sub-human systems (whose ontologies are worse than ours, so in theory could be injected into ours) than for human-level or super-human systems (where that may not hold). Hence focusing on the super-human case would look weird and possibly not helpful for the subhuman case, although it would be great if they could solve all the cases in full generality.
Maybe once it works ARC’s approach could inform empirical work which helps with self-driving cars, but if you were focused on actually doing the thing for cars you’d just aim directly at that, whereas ARC’s approach would be a very roundabout and needlessly complex and theoretical way of solving the problem (this may or may not actually be the case, maybe solving this for self-driving cars is actually fundamentally difficult in the same way as for ASI, but it seems less likely).
I found it useful to compare a shard that learns to pursue juice (positive value) to one that avoids eating mouldy food (prohibition), just so they’re on the same kind of framing/scale.
It feels like a possible difference between prohibitions and positive values is that positive values specify a relatively small portion of the state space that is good/desirable (there are not many states in which you’re drinking juice), and hence possibly only activate less frequently, or only when parts of the state space like that are accessible, whereas prohibitions specify a large part of the state space that is bad (but not so much that the complement is a small portion—there are perhaps many potential states where you eat mouldy food, but the complement of that set is still not a similar size to the set of states of drinking juice). The first feels more suited to forming longer-term plans towards the small part of the state space (cf this definition of optimisation), whereas the second is less so. Then shards that start doing optimisation like this are hence more likely to become agentic/self-reflective/meta-cognitive etc.
In effect, positive values are more likely/able to self-chain because they actually (kind of, implicitly) specify optimisation goals, and hence shards can optimise them, and hence grow and improve that optimisation power, whereas prohibitions specify a much larger desirable state set, and so don’t require or encourage optimisation as much.
As an implication of this, I could imagine that in most real-world settings “don’t kill humans” would act as you describe, but in environments where it’s very easy to accidentally kill humans, such that states where you don’t kill humans are actually very rare, then the “don’t kill humans” shard could chain into itself more, and hence become more sophisticated/agentic/reflective. Does that seem right to you?
Thanks for the answer! I feel uncertain whether that suggestion is an “alignment” paradigm/method though—either these formally specified goals don’t cover most of the things we care about, in which case this doesn’t seem that useful, or they do, in which case I’m pretty uncertain how we can formally specify them—that’s kind of the whole outer alignment problem. Also, there is still (weaker) pressure to produce outputs that look good to humans, if humans are searching over goals to find those that produce good outputs. I agree it’s further away, but that seems like it could also be a bad thing, if it makes it harder to pressure the models to actually do what we want in the first place.
I still don’t think you’ve proposed an alternative to “training a model with human feedback”. “maintaining some qualitative distance between the optimisation target for an AI model and the human “does this look good?” function” sounds nice, but how do we even do that? What else should we optimise the model for, or how should we make it aligned? If you think the solution is use AI-assisted humans as overseers, then that doesn’t seem to be a real difference with what Buck is saying. So even if he actually had written that he’s not aware of an alternative to “training a model with human/overseer feedback”, I don’t think you’ve refuted that point.
An existing example of something like the difference between amortised and direct optimisation is doing RLHF (w/o KL penalties to make the comparison exact) vs doing rejection sampling (RS) with a trained reward model. RLHF amortises the cost of directly finding good outputs according to the reward model, such that at evaluation the model can produce good outputs with a single generation, whereas RS requires no training on top of the reward model, but uses lots more compute at evaluation by generating and filtering with the RM. (This case doesn’t exactly match the description in the post as we’re using RL in the amortised optimisation rather than SL. This could be adjusted by gathering data with RS, and then doing supervised fine-tuning on that RS data, and seeing how that compares to RS).
Given we have these two types of optimisation, I think two key things to consider are how each type of optimisation interacts with Goodhart’s Law, and how they both generalise (kind of analogous to outer/inner alignment, etc.):
The work on overoptimisation scaling laws in this setting shows that, at least on distribution, there does seem to be a meaningful difference to the over-optimisation behaviour between the two types of optimisation—as shown by the different functional forms for RS vs RLHF.
I think the generalisation point is most relevant when we consider that the optimisation process used (either in direct optimisation to find solutions, or in amortised optimisation to produce the dataset to amortise) may not generalise perfectly. In the setting above, this corresponds to the reward model not generalising perfectly. It would be interesting to see a similar investigation as the overoptimisation work but for generalisation properties—how does the generalisation of the RLHF policy relate to the generalisation of the RM, and similarly to the RS policy? Of course, over-optimisation and generalisation probably interact, so it may be difficult to disentangle whether poor performance under distribution shift is due to over-optimisation or misgeneralisation, unless we have a gold RM that also generalises perfectly.
Instead, Aligned AI used its technology to automatically tease out the ambiguities of the original data.
Could you provide any technical details about how this works? Otherwise I don’t know what to take from this post.
Question: How do we train an agent which makes lots of diamonds, without also being able to robustly grade expected-diamond-production for every plan the agent might consider?
I thought you were about to answer this question in the ensuing text, but it didn’t feel like to me you gave an answer. You described the goal (values-child), but not how the mother would produce values-child rather than produce evaluation-child. How do you do this?
You might well expect that features just get ignored below some threshold and monosemantically represented above it, or it could be that you just always get a polysemantic morass in that limit
I guess the recent work on Polysemanticity and Capacity seems to suggest the latter case, especially in sparser settings, given the zone where multiple feature are represented polysemantically, although I can’t remember if they investigate power-law feature frequencies or just uniform frequencies
were a little concerned about going down a rabbit hole given some of the discussion around whether the results replicated, which indicated some sensitivity to optimizer and learning rate.
My impression is that that discussion was more about whether the empirical results (i.e. do ResNets have linear mode connectivity?) held up, rather than whether the methodology used and present in the code base could be used to find whether linear mode connectivity is present between two models (up to permutation) for a given dataset. I imagine you could take the code and easily adapt it to check for LMC between two trained models pretty quickly (it’s something I’m considering trying to do as well, hence the code requests).
I think (at least in our case) it might be simpler to get at this question, and I think the first thing I’d do to understand connectivity is ask “how much regularization do I need to move from one basin to the other?” So for instance suppose we regularized the weights to directly push them from one basin towards the other, how much regularization do we need to make the models actually hop?
That would defiitely be interesting to see. I guess this is kind of presupposing that the models are in different basins (which I also believe but hasn’t yet been verified). I also think looking at basins and connectivity would be more interesting in the case where there was more noise, either from initialisation, inherently in the data, or by using a much lower batch size so that SGD was noisy. In this case it’s less likely that the same configuration results in the same basin, but if your interventions are robust to these kinds of noise then it’s a good sign.
Good question! We haven’t tried that precise experiment, but have tried something quite similar. Specifically, we’ve got some preliminary results from a prune-and-grow strategy (holding sparsity fixed, pruning smallest-magnitude weights, enabling non-sparse weights) that does much better than a fixed sparsity strategy.
I’m not quite sure how to interpret these results in terms of the lottery ticket hypothesis though. What evidence would you find useful to test it?
That’s cool, looking forward to seeing more detail. I think these results don’t seem that related to the LTH (if I understand your explanation correctly), as LTH involves finding sparse subnetworks in dense ones. Possibly it only actually holds in model with many more parameters, I haven’t seen it investigated in models that aren’t overparametrised in a classical sense.
I think if iterative magnitude pruning (IMP) on these problems produced much sparse subnetworks that also maintained the monosemanticity levels, then that would suggest that sparsity doesn’t penalise monosemanticity (or polysemanticity) in this toy model, and also (much more speculatively) that the sparse well-performing subnetworks that IMP finds in other networks possibly also maintain their levels of poly/mono-semanticity. If we also think these networks are favoured towards poly or mono, then that hints at how the overall learning process if favoured towards poly or mono.
This work looks super interesting, definitely keen to see more!
Will you open-source your code for running the experiments and producing plots? I’d definitely be keen to play around with it. (They already did here: https://github.com/adamjermyn/toy_model_interpretability I just missed it. Thanks! Although it would be useful to have the plotting code as well, if that’s easy to share?)
Note that we primarily study the regime where there are more features than embedding dimensions (i.e. the sparse feature layer is wider than the input) but where features are sufficiently sparse that the number of features present in any given sample is smaller than the embedding dimension. We think this is likely the relevant limit for e.g. language models, where there are a vast array of possible features but few are present in any given sample.
I agree that N (true feature dimension) > d (observed dimension), and that sparsity will be high, but I’m uncertain whether the other part of the regime (that you don’t mention here), that k (model latent dimension) > N, is likely to be true. Do you think that is likely to be the case? As an analogy, I think the intermediate feature dimensions in MLP layers in transformers (analogously k) are much lower dimension than the “true intrinsic dimension of features in natural language” (analogously N), even if it is larger than the input dimension (embedding dimension* num_tokens, analogously d). So I expect whereas in your regime . Do you think you’d be able to find monosemantic networks for ? Did you try out this regime at all (I don’t think I could find it in the paper).
In the paper you say that you weakly believe that monosemantic and polysemantic network parametrisations are likely in different loss basins, given they’re implementing very different algorithms. I think (given the size of your networks) it should be easy to test for at least linear mode connectivity with something like git re-basin (https://github.com/samuela/git-re-basin). Have you tried doing that? I think there are also algorithms for finding non-linear (e.g. quadratic) mode connectivity, although I’m less familiar with them. If it is the case that they’re in different basins, I’d be curious to see whether there are just two basins (poly vs mono), or a basin for each level of monosemanticity, or if even within a level of polysemanticity there are multiple basins. If it’s one of the former cases, it’s be interesting to do something like the connectivity-based fine-tuning talked about here (https://openreview.net/forum?id=NZZoABNZECq, in effect optimise for a new parametrisation that is linearly disconnected from the previous one), and see if doing that from a polysemantic initialisation can produce a more monosemantic one, or if it just becomes polysemantic in a different way.
You also mentioned your initial attempts at sparsity through a hard-coded initially sparse matrix failed; I’d be very curious to see whether a lottery ticket-style iterative magnitude pruning was able to produce sparse matrices from the high-latent-dimension monosemantic networks that are still monosemantic, or more broadly how the LTH interacts with polysemanticity—are lottery tickets less polysemantic, or more, or do they not really change the monosemanticity?
If my understanding of the bias decay method is correct, is a large initial part of training only reducing the bias (through weight decay) until certain neurons start firing? If that’s the case, could you calculate the maximum output in the latent dimension on the dataset at the start of training (say B), and then initialise the bias to be just below -B, so that you skip almost all of the portion of training that’s only moving the bias term. You could do this per-neuron or just maxing over neurons. Or is this portion of training relatively small compared to the rest of training, and the slower convergence is more due to less neurons getting gradients even when some of them are outputting higher than the bias?
Thanks for writing the post, and it’s great to see that (at least implicitly) lots of the people doing mechanistic interpretability (MI) are talking to each other somewhat.
Some comments and questions:
I think “science of deep learning” would be a better term than “deep learning theory” for what you’re describing, given that I think all the phenomena you list aren’t yet theoretically grounded or explained in a mathematical way, and are rather robust empirical observations. Deep learning theory could be useful, especially if it had results concerning the internals of the network, but I think that’s a different genre of work to the science of DL work.
In your description of the relevance of the lottery ticket hypothesis (LTH), it feels like a bit of a non-sequitur to immediately discuss removing dangerous circuits at initialisation. I guess you think this is because lottery tickets are in some way about removing circuits at the beginning of training (although currently we only know how to find out which circuits by getting to the end of training)? I think the LTH potentially has broader relevance for MI, i.e.: if lottery tickets do exist and are of equal performance, then it’s possible they’d be easier to interpret (due to increased sparsity); or just understanding what the existence of lottery tickets means for what circuits are more likely to emerge during neural network training.
When you say “Automating Mechanistic Interpretability research”, do you mean automating (1) the task of interpreting a given network (automating MI), or automating (2) the research of building methods/understanding/etc. that enable us to better-interpret neural networks (automating MI Research)? I realise that a lot of current MI research, even if the ultimate goal is (2), is mostly currently doing (1) as a first step.
Most of the text in that section implies automating (1) to me, but “Eventually, we might also want to automate the process of deciding which interventions to perform on the model to improve AI safety” seems to lean more towards automating (2), which comes under generally approach of automating alignment research. Obviously it would be great to be able to do both of them, but automating (1) seems both much more tractable, and also probably necessary to enable scalable interpretability of large models, whereas (2) is potentially less necessary for MI research to be useful for AI safety.
I’ve now had a conversation with Evan where he’s explained his position here and I now agree with it. Specifically, this is in low path-dependency land, and just considering the simplicity bias. In this case it’s likely that the world model will be very simple indeed (e.g. possibly just some model of physics), and all the hard work is done at run time in activations by the optimisation process. In this setting, Evan argued that the only very simple goal we can encode (that we know of) that would produce behaviour that maximises the training objective is any arbitrary simple objective, combined with the reasoning that deception and power are instrumentally useful for that objective. To encode an objective that would reliably produce maximisation of the training objective always (e.g. internal or corrigible alignment), you’d need to fully encode the training objective.
Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.
The main argument of the post isn’t “ASI/AGI may be causally confused, what are the consequences of that” but rather “Scaling up static pretraining may result in causally confused models, which hence probably wouldn’t be considered ASI/AGI”. I think in practice if we get AGI/ASI, then almost by definition I’d think it’s not causally confused.
OOD misgeneralisation is absolutely inevitable, due to Gödel’s incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity
In a theoretical sense this may be true (I’m not really familiar with the argument), but in practice OOD misgeneralisation is probably a spectrum, and models can be more or less causally confused about how the world works. We’re arguing here that static training, even when scaled up, plausibly doesn’t lead to a model that isn’t causally confused about a lot of how the world works.
Did you use the term “objective misgeneralisation” rather than “goal misgeneralisation” on purpose? “Objective” and “goal” are synonyms, but “objective misgeneralisation” is hardly used, “goal misgeneralisation” is the standard term.
No reason, I’ll edit the post to use goal misgeneralisation. Goal misgeneralisation is the standard term but hasn’t been so for very long (see e.g. this tweet: https://twitter.com/DavidSKrueger/status/1540303276800983041).
Maybe I miss something obvious, but this argument looks wrong to me, or it assumes that the learning algorithm is not allowed to discover additional (conceptual, abstract, hidden, implicit) variables in the training data, but this is false for deep neural networks
Given that the model is trained statically, while it could hypothesise about additional variables of the kinds your listed, it can never know which variables or which values for those variables are correct without domain labels or interventional data. Specifically while “Discovering such hidden confounders doesn’t give interventional capacity” is true, to discover these confounders he needed interventional capacity.
I don’t understand the italicised part of this sentence. Why will P(shorts, ice cream) be a reliable guide to decision-making?
We’re not saying that P(shorts, icecream) is good for decision making, but P(shorts, do(icecream)) is useful in sofar as the goal is to make someone where shorts, and providing icecream is one of the possible actions (as the causal model will demonstrate that providing icecream isn’t useful for making someone where shorts).
What do these symbols in parens before the claims mean?
They are meant to be referring to the previous parts of the argument, but I’ve just realised that this hasn’t worked as the labels aren’t correct. I’ll fix that.
When you talk about whether we’re in a high or low path-dependence “world”, do you think that there is a (somewhat robust) answer to this question that holds across most ML training processes? I think it’s more likely that some training processes are highly path-dependent and some aren’t. We definitely have evidence that some are path-dependent, e.g. Ethan’s comment and other examples like https://arxiv.org/abs/2002.06305, and almost any RL paper where different random seeds of the training process often result in quite different results. Arguably I don’t think we have conclusive of any particular existing training process being low-path dependence, because the burden of proof is heavy for proving that two models are basically equivalent on basically all inputs (given that they’re very unlikely to literally have identical weights, so the equivalence would have to be at a high level of abstraction).
Reasoning about the path dependence of a training process specifically, rather than whether all of the ML/AGI development world is path dependent, seems more precise, and also allows us to reason about whether we want a high or low path-dependence training process, and considering that as an intervention, rather than a state of the world we can’t change.
When you say “the knowledge of what our goals are should be present in all models”, by “knowledge of what our goals are” do you mean a pointer to our goals (given that there are probably multiple goals which are combined in someway) is in the world model? If so this seems to contradict you earlier saying:
The deceptive model has to build such a pointer [to the training objective] at runtime, but it doesn’t need to have it hardcoded, whereas the corrigible model needs it to be hardcoded
I guess I don’t understand what it would mean for the deceptive AI to have the knowledge of what are goals are (in the world model), but for that not to mean it doesn’t have a hard-coded pointer to what our goals are. I’d imagine that what it means for the world model to capture what our goals are is exactly having such a pointer to them.
(I realise I’ve been failing to do this, but it might make sense to use AI when we mean the outer system and model when we mean the world model. I don’t think this is the core of the disagreement, but it could make the discussion clearer. For example, when you say the knowledge is present in the model, do you mean the world model or the AI more generally? I assumed the former above.)
To try and run my (probably inaccurate) simulation of you: I imagine you don’t think that’s a contradiction above. So you’d think that “knowledge of what our goals are” doesn’t mean a pointer to our goals in all the AI’s world models, but something simpler, that can be used to figure out what our goals are by the deceptive AI (e.g. in it’s optimisation process), but wouldn’t enable the aligned AI to use as its objective a simpler pointer and instead would require the aligned AI to hard-code the full pointer to our goals (where the pointer would be pointing into it’s the world model, and probably using this simpler information about our goals in some way). I’m struggling to imagine what that would look like.
Even agreeing that no additional complexity is required to rederive that it should try to be deceptive (assuming it has situational awareness of the training process and long-term goals which aren’t aligned with ours), to be deceptive successfully, it then needs to rederive what our goals are, so that it can pursue them instrumentally. I’m arguing that the ability to do this in the AI would require additional complexity compared to a AI that doesn’t need to rederive the content of this goal (that is, our goal) at every decision.
Alternatively, the aligned model could use the same derivation process to be aligned: The deceptive model has some long-term goal, and in pursuing it rederives the content of the instrumental goal “do ‘what the training process incentives’”, and the alignment model has the long-term goal “do ‘what the training process incentivises’” (as a pointer/de dicto), and also rederives it with the same level of complexity. I think “do ‘what the training process incetivises’” (as a pointer/de dicto) isn’t a very complex long-term goal., and feels likely to be as complex as the arbitrary crystallised deceptive AI’s internal goal, assuming both models have full situation awareness of the training process and hence such a pointer is possible, which we’re assuming they do.
(ETA/Meta point: I do think deception is a big issue that we definitely need more understanding of, and I definitely put weight on it being a failure of alignment that occurs in practice, but I think I’m less sure it’ll emerge (or less sure that your analysis demonstrates that). I’m trying to understand where we disagree, and whether you’ve considered the doubts I have and you possess good arguments against them or not, rather than convince you that deception isn’t going to happen.)
It seems like a lot more computationally difficult to, at every forward pass/decision process, derive/build/construct such a pointer. If the deceptive model is going to be doing this every time it seems like it would be more efficient to have a dedicated part of the network that calculates it (i.e. have it in the weights)
Separately, for more complex goals this procedure is also going to be more complex, and the network probably needs to be more complex to support constructing it in the activations at every forward pass, compared to the corrigible model that doesn’t need to do such a construction (becauase it has it hard-coded as you say). I guess I’m arguing that the additional complexity in the deceptive model that allows it to rederive our goals at every forward pass compensates for the additional complexity in the corrigible model that has the our goals hard-coded.
whereas the corrigible model needs it to be hardcoded
The corrigible model needs to be able to robustly point to our goals, in a way that doesn’t change. One way of doing this is having the goals hardcoded. Another way might be to instead have a pointer to the output of a procedure that is executed at runtime that always constructs our goals in the activations. If the deceptive model can reliably construct in it’s activations something that actually points towards our goals, then the corrible model could also have such a procedure, and make it’s goal be a pointer to the output of such a procedure. Then the only difference in model complexity is that the deceptive model points to some arbitrary attribute of the world model (or whatever), and the aligned model points to the output of this computation, that both models posses.
I think at a high level I’m trying to say that any way in which the deceptive model can robustly point at our goals such that it can pursue them instrumentally, the aligned model can robustly point at them to pursue them terminally. SGD+DL+whatever may favour one way of another of robustly pointing at such goals (either in the weights, or through a procedure that robustly outputs them in the activations), but both deceptive and aligned models could make use of that.
I think the hyperlink for “conv nets without residual streams” is wrong? It’s https://www.westernunion.com/web/global-service/track-transfer for me