I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project.
I’m also a part-time fund manager for the LTFF.
Obligatory research billboard website: https://chanlawrence.me/
I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project.
I’m also a part-time fund manager for the LTFF.
Obligatory research billboard website: https://chanlawrence.me/
Huh, that’s indeed somewhat surprising if the SAE features are capturing the things that matter to CLIP (in that they reduce loss) and only those things, as opposed to “salient directions of variation in the data”. I’m curious exactly what “failing to work” means—here I think the negative result (and the exact details of said result) are argubaly more interesting than a positive result would be.
The general version of this statement is something like: if your beliefs satisfy the law of total expectation, the variance of the whole process should equal the variance of all the increments involved in the process.[1] In the case of the random walk where at each step, your beliefs go up or down by 1% starting from 50% until you hit 100% or 0% -- the variance of each increment is 0.01^2 = 0.0001, and the variance of the entire process is 0.5^2 = 0.25, hence you need 0.25/0.0001 = 2500 steps in expectation. If your beliefs have probability p of going up or down by 1% at each step, and 1-p of staying the same, the variance is reduced by a factor of p, and so you need 2500/p steps.
(Indeed, something like this standard way to derive the expected steps before a random walk hits an absorbing barrier).
Similarly, you get that if you start at 20% or 80%, you need 1600 steps in expectation, and if you start at 1% or 99%, you’ll need 99 steps in expectation.
One problem with your reasoning above is that as the 1%/99% shows, needing 99 steps in expectation does not mean you will take 99 steps with high probability—in this case, there’s a 50% chance you need only one update before you’re certain (!), there’s just a tail of very long sequences. In general, the expected value of variables need not look like
I also think you’re underrating how much the math changes when your beliefs do not come in the form of uniform updates. In the most extreme case, suppose your current 50% doom number comes from imagining that doom is uniformly distributed over the next 10 years, and zero after -- then the median update size per week is only 0.5/520 ~= 0.096%/week, and the expected number of weeks with a >1% update is 0.5 (it only happens when you observe doom). Even if we buy a time-invariant random walk model of belief updating, as the expected size of your updates get larger, you also expect there to be quadratically fewer of them—e.g. if your updates came in increments of size 0.1 instead of 0.01, you’d expect only 25 such updates!
Applying stochastic process-style reasoning to beliefs is empirically very tricky, and results can vary a lot based on seemingly reasonable assumptions. E.g. I remember Taleb making a bunch of mathematically sophisticated arguments[2] that began with “Let your beliefs take the form of a Wiener process[3]” and then ending with an absurd conclusion, such as that 538′s forecasts are obviously wrong because their updates aren’t Gaussian distributed or aren’t around 50% until immediately before the elction date. And famously, reasoning of this kind has often been an absolute terrible idea in financial markets. So I’m pretty skeptical of claims of this kind in general.
There’s some regularity conditions here, but calibrated beliefs that things you eventually learn the truth/falsity of should satisfy these by default.
Often in an attempt to Euler people who do forecasting work but aren’t super mathematical, like Philip Tetlock.
This is what happens when you take the limit of the discrete time random walk, as you allow for updates on ever smaller time increments. You get Gaussian distributed increments per unit time—W_t+u—W_t ~ N(0, u) -- and since the tail of your updates is very thin, you continue to get qualitatively similar results to your discrete-time random walk model above.
And yes, it is ironic that Taleb, who correctly points out the folly of normality assumptions repeatedly, often defaults to making normality assumptions in his own work.
When I spoke to him a few weeks ago (a week after he left OAI), he had not signed an NDA at that point, so it seems likely that he hasn’t.
Also, another nitpick:
Humane vs human values
I think there’s a harder version of the value alignment problem, where the question looks like, “what’s the right goals/task spec to put inside a sovereign ai that will take over the universe”. You probably don’t want this sovereign AI to adopt the value of any particular human, or even modern humanity as a whole, so you need to do some Ambitious Value Learning/moral philosophy and not just intent alignment. In this scenario, the distinction between humane and human values does matter. (In fact, you can find people like Stuart Russell emphasizing this point a bunch.) Unfortunately, it seems that ambitious value learning is really hard, and the AIs are coming really fast, and also it doesn’t seem necessary to prevent x-risk, so...
Most people in AIS are trying to solve a significantly less ambitious version of this problem: just try to get an AI that will reliably try to do what a human wants it to do (i.e. intent alignment). In this case, we’re explicitly punting the ambitious value learning problem down the line. Here, we’re basically not talking about the problem of having an AI learn humane values, but instead the problem of having it “do what its user wants” (i.e. “human values” or “the technical alignment problem” in Nicky’s dichotomy). So it’s actually pretty accurate to say that a lot of alignment is trying to align AIs wrt “human values”, even if a lot of the motivation is trying to eventually make AIs that have “humane values”.[1] (And it’s worth noting that making an AI that’s robustly intent aligned sure seems require tackling a lot of the ‘intuition’-derived problems you bring up already!)
uh, that being said, I’m not sure your framing isn’t just … better anyways? Like, Stuart seems to have lots of success talking to people about assistance games, even if it doesn’t faithfully represent what a majority field thinks is the highest priority thing to work on. So I’m not sure if me pointing this out actually helps anyone here?
Of course, you need an argument that “making AIs aligned with user intent” eventually leads to “AIs with humane values”, but I think the straightforward argument goes through—i.e. it seems that a lot of the immediate risk comes from AIs that aren’t doing what their users intended, and having AIs that are aligned with user intent seems really helpful for tackling the tricky ambitious value learning problem.
Also, I added another sentence trying to clarify what I meant at the end of the paragraph, sorry for the confusion.
No, I’m saying that “adding ‘logic’ to AIs” doesn’t (currently) look like “figure out how to integrate insights from expert systems/explicit bayesian inference into deep learning”, it looks like “use deep learning to nudge the AI toward being better at explicit reasoning by making small changes to the training setup”. The standard “deep learning needs to include more logic” take generally assumes that you need to add the logic/GOFAI juice in explicitly, while in practice people do a slightly different RL or supervised finetuning setup instead.
(EDITED to add: so while I do agree that “LMs are bad at the things humans do with ‘logic’ and good at ‘intuition’ is a decent heuristic, I think the distinction that we’re talking about here is instead about the transparency of thought processes/”how the thing works” and not about if the thing itself is doing explicit or implicit reasoning. Do note that this is a nitpick (as the section header says) that’s mainly about framing and not about the core content of the post.)
That being said, I’ll still respond to your other point:
Chain of thought is a wonderful thing, it clears a space where the model will just earnestly confess its inner thoughts and plans in a way that isn’t subject to training pressure, and so it, in most ways, can’t learn to be deceptive about it.
I agree that models with CoT (in faithful, human-understandable English) are more interpretable than models that do all their reasoning internally. And obviously I can’t really argue against CoT being helpful in practice; it’s one of the clear baselines for eliciting capabilities.
But I suspect you’re making a distinction about “CoT” that is actually mainly about supervised finetuning vs RL, and not a benefit about CoT in particular. If the CoT comes from pretraining or supervised fine-tuning, the ~myopic next-token-prediction objective indeed does not apply much if training pressure in the relevant ways.[1] Once you start doing any outcome-based supervision (i.e. RL) without good regularization, I think the story for CoT looks less clear. And the techniques people use for improving CoT tend to involve upweighting entire trajectories based on their reward (RLHF/RLAIF with your favorite RL algorithm) which do incentivize playing the training game unless you’re very careful with your fine-tuning.
(EDITED to add: Or maybe the claim is, if you do CoT on a ‘secret’ scratchpad (i.e. one that you never look at when evaluating or training the model), then this would by default produce more interpretable thought processes?)
I’m not sure this is true in the limit (e.g. it seems plausible to me that the Solomonoff prior is malign). But it’s most likely true in the next few years and plausibly true in all practical cases that we might consider.
I think this is really quite good, and went into way more detail than I thought it would. Basically my only complaints on the intro/part 1 are some terminology and historical nitpicks. I also appreciate the fact that Nicky just wrote out her views on AIS, even if they’re not always the most standard ones or other people dislike them (e.g. pointing at the various divisions within AIS, and the awkward tension between “capabilities” and “safety”).
I found the inclusion of a flashcard review applet for each section super interesting. My guess is it probably won’t see much use, and I feel like this is the wrong genre of post for flashcards.[1] But I’m still glad this is being tried, and I’m curious to see how useful/annoying other people find it.
I’m looking forward to parts two and three.
Logic vs Intuition:
I think “logic vs intuition” frame feels like it’s pointing at a real thing, but it seems somewhat off. I would probably describe the gap as explicit vs implicit or legible and illegible reasoning (I guess, if that’s how you define logic and intuition, it works out?).
Mainly because I’m really skeptical of claims of the form “to make a big advance in/to make AGI from deep learning, just add some explicit reasoning”. People have made claims of this form for as long as deep learning has been a thing. Not only have these claims basically never panned out historically, these days “adding logic” often means “train the model harder and include more CoT/code in its training data” or “finetune the model to use an external reasoning aide”, and not “replace parts of the neural network with human-understandable algorithms”. (EDIT for clarity: That is, I’m skeptical of claims that what’s needed to ‘fix’ deep learning is by explicitly implementing your favorite GOFAI techniques, in part because successful attempts to get AIs to do more explicit reasoning look less like hard-coding in a GOFAI technique and more like other deep learning things.)
I also think this framing mixes together “problems of game theory/high-level agent modeling/outer alignment vs problems of goal misgeneralization/lack of robustness/lack of transparency” and “the kind of AI people did 20-30 years ago” vs “the kind of AI people do now”.
This model of logic and intuition (as something to be “unified”) is quite similar to a frame of the alignment problem that’s common in academia. Namely, our AIs used to be written with known algorithms (so we can prove that the algorithm is “correct” in some sense) and performed only explicit reasoning (so we can inspect the reasoning that led to a decision, albeit often not in anything close to real time). But now it seems like most of the “oomph” comes from learned components of systems such as generative LMs or ViTs, i.e. “intuition”. The “goal” is to a provably* safe AI, that can use the “oomph” from deep learning while having enough transparency/explicit enough thought processes. (Though, as in the quote from Bengio in Part 1, sometimes this also gets mixed in with capabilities, and become how AIs without interpretable thoughts won’t be competent.)
Has AI had a clean “swap” between Logic and Intuition in 2000?
To be clear, Nicky clarifies in Part 1 that this model is an oversimplification. But as a nitpick, I think if you had to pick a date, I’d probably pick 2012, when a conv net won the ImageNet 2012 competition in a dominant matter, and not 2000.
Even more of a nitpick, but the examples seem pretty cherry picked?
For example, Nicky uses the example of deep blue defeating kasparov as an example of a “logic” based AI. But in that case, almost all Chess AIs are still pretty much logic based. Using Stockfish as an example, Stockfish 16′s explicit alpha-beta search both is using a reasoning algorithm that we can understand, and does the reasoning “in the open”. Its neural network eval function is doing (a small amount of) illegible reasoning. While part of the reasoning has become illegible, we can still examine the outputs of the alpha-beta search to understand why certain moves are good/bad. (But fair, this might be by far the most widely known non-deep learning “AI”. The only other examples I can think of are Watson and recommender systems, but those were still using statistical learning techniques. I guess if you count MYCIN or SHRDLU or ELIZA...?)
(And modern diffusion models being unable to count or spell seem like a pathology specific to that class of generative model, and not say, Claude Opus.)
FOOM vs Exponential vs Steady Takeoff
Ryan already mentioned this in his comment.
When did AIs get better than humans (at ImageNet)?
In footnote [3], Nicky writes:
In 1997, IBM’s Deep Blue beat Garry Kasparov, the then-world chess champion. Yet, over a decade later in 2013, the best machine vision AI was only 57.5% accurate at classifying images. It was only until 2021, three years ago, that AI hit 95%+ accuracy.
But humans do not get 95% top-1 accuracy[3] on imagenet! If you consult this paper from the imagenet creators (https://arxiv.org/abs/1409.0575), they note that:
. We found the task of annotating images with one of 1000 categories to be an extremely challenging task for an untrained annotator. The most common error that an untrained annotator is susceptible to is a failure to consider a relevant class as a possible label because they are unaware of its existence. (Page 31)
And even when using an human expert annotators, who did hundreds of validation image for practice, the human annotator still got a top-5 error of 5.1%, which was surpassed in 2015 by the original resnet paper (https://arxiv.org/abs/1512.03385) at 4.49% for ResNet 14 (and 3.57% for an ensemble of six resnets).
(Also, good top-1 performance on imagenet is genuinely hard and may be unrepresentative of actually being good at vision, whatever that means Take a look at some of the “mistakes” current models make:)
Using flashcards suggests that you want to memorize the concepts. But a lot of this piece isn’t so much an explainer of AI safety, but instead an argument for the importance of AI Safety. Insofar as the reader is not here to learn a bunch of new terms, but instead to reason about whether AIS is a real issue, it feels like flashcards are more of a distraction than an aid.
I’m writing this in part because I at some point promised Nicky longform feedback on her explainer, but uh, never got around to it until now. Whoops.
Top-K accuracy = you guess K labels, and are right if any of them are correct. Top 5 is significantly easier on image net than Top 1, because there’s a bunch of very similar classes and many images are ambiguous.
I agree with many of the points made in this post, especially the “But my ideas/insights/research is not likely to impact much!” point. I find it plausible that in some subfields, AI x-risk people are too prone to publishing due to historical precedent and norms (maybe mech interp? though little has actually come of that). I also want to point out that there are non-zero arguments to expect alignment work to help more with capabilties, relative to existing “mainstream” capabilities work, even if I don’t believe this to be the case. (For example, you might believe that the field of deep learning spends too little time actually thinking about how to improve their models, and too much time just tinkering, in which case your thinking could have a disproportionate impact even after adjusting for the fact that you’re not trying to do capabilities.) And I think that some of the research labeled “alignment” is basically just capabilities work, and maybe the people doing them should stop.
I also upvoted the post because I think this attitude is pervasive in these circles, and it’s good to actually hash it out in public.
But as with most of the commenters, I disagree with the conclusion of the post.
I suspect the main cruxes between us are the following:
From paragraphs such as the following:
It’s very rare that any research purely helps alignment, because any alignment design is a fragile target that is just a few changes away from unaligned. There is no alignment plan which fails harmlessly if you fuck up implementing it, and people tend to fuck things up unless they try really hard not to (and often even if they do), and people don’t tend to try really hard not to. This applies doubly so to work that aims to make AI understandable or helpful, rather than aligned — a helpful AI will help anyone, and the world has more people trying to build any superintelligence (let’s call those “capabilities researchers”) than people trying to build aligned superintelligence (let’s call those “alignment researchers”).
And
“But my ideas/insights/research is not likely to impact much!” — that’s not particularly how it works? It needs to somehow be differenially helpful to alignment, which I think is almost never the case.
It seems that a big part of your world model is that ~no one who thinks they’re doing “alignment” work is doing real alignment work, and are really just doing capabilities work. In particular, it seems that you think interp or intent alignment are basically just capabilities work, insofar as their primary effect is helping people build unsafe ASI faster. Perhaps you think that, in the case of interp, before we can understand the AI in a way that’s helpful for alignment, we’ll understand it in a way that allows us to improve it. I’m somewhat sympathetic to this argument. But I think making it requires arguing that interp work doesn’t really contribute to alignment at all, and is thus better thought of as capabilities work (and same for intent alignment).
Perhaps you believe that all alignment work is useless, not because they’re misguided and actually capabilities work, but because we’re so far from building aligned ASI that ~all alignment work is useless, and in the intermediate regime where additional insights non-negligibly hasten the arrival of unaligned ASI. But I think you should argue for that explicitly (as say, Eliezer did in his death with dignity post), since I imagine most of the commenters here would disagree with this take.
My guess is this is the largest crux between us; if I thought all “alignment” work did nothing for alignment, and was perhaps just capabilities work in disguise, then I would agree that people should stop. In fact, I might even argue that we should just stop all alignment work whatsoever! Insofar as I’m correct about this being a crux, I’d like to see a post explicitly arguing for the lack of alignment relevancy of existing ‘alignment work’, which will probably lead to a more constructive conversation than this post.
I think empirically, very few (if not zero) capabilities insights have come from alignment work. And a priori, you might expect that research that aims to solve topic X produces marginally more X than a related topic Y. Insofar as you think that current “alignment” work is more than epsilon useful, I think you would not argue that most alignment work is differentially negative. So insofar as you think a lot of “alignment” work is real alignment work, you probably believe that many capabilities insights have come from past alignment work.
Perhaps you’re reluctant to give examples, for fear of highlighting them. I think the math doesn’t work out here—having a few clear examples from you would probably be sufficient to significantly reduce the number of published insights from the community as a whole. But, if you have many examples of insights that help capabilities but are too dangerous to highlight, I’d appreciate if you would just say that (and maybe we can find a trusted third party to verify your claim, but not share the details?).
Perhaps you might say, well, the alignment community is very small, so there might not be many examples that come to mind! To make this carry through, you’d still have to believe that the alignment community also hasn’t produced much good research. (Even though, naively, you might expect higher returns from alignment due to there being more unpicked low-hanging fruit due to its small size.) But then again, I’d prefer if you explicitly argued that ~all alignment is either useless or capabilities instead of gesturing at a generic phenomenon.
Perhaps you might say that capabilities insights are incredibly long tailed, and thus seeing no examples doesn’t mean that the expected harm is low. But, I think you still need to make some sort of plausibility argument here, as well as a story for why the existing ML insights deserve a lot of Shapley for capabilities advances, even though most of the “insights” people had were useless if not actively misleading.
I also think that there’s an obvious confounder, if you believe something along the lines of “focusing on alignment is correlated with higher rationality”. Personally, I also think the average alignment(-interested) researcher is more competent at machine learning or research in general than the average generic capabilities researcher (this probably becomes false once you condition on being at OAI, Anthropic, or another scaling lab). If you just count “how many good ideas came from ‘alignment’ researchers per capita” to the number for ‘capability’ researchers, you may find that the former is higher because they’re just more competent. This goes back again into crux 1., where you then need to argue that competency doesn’t help at all in doing actual alignment work, and again, I suspect it’s more productive to just argue about the relevance and quality of alignment work instead of arguing about incidental capabilities insights.
From paragraphs such as the following:
Worse yet: if focusing on alignment is correlated with higher rationality and thus with better ability for one to figure out what they need to solve their problems, then alignment researchers are more likely to already have the ideas/insights/research they need than capabilities researchers, and thus publishing ideas/insights/research about AI is more likely to differentially help capabilities researchers. Note that this is another relative statement; I’m not saying “alignment researchers have everything they need”, I’m saying “in general you should expect them to need less outside ideas/insights/research on AI than capabilities researchers”.
it seems that you’re working with a model of research output with two main components -- (intrinsic) rationality and (external) insights. But there’s a huge component that’s missing from this model: actual empirical experiments validating the insight, which is the ~bulk of actual capabilities work and a substantial fraction of alignment work. This matters both because ~no capabilities researchers will listen to you if you don’t have empirical experiments, and because, if you believe that you can deduce more alignment research “on your own”, you might also believe that you need to do more empirical experiments to do capabilities research (and thus that the contribution per insight is by default a lot smaller).
Even if true insights are differentially more helpful for capabilities, the fact that it seems empirically difficult to know which insights are true means that a lot of the work in getting a true insight will involve things that look a lot more like normal capabilities work—e.g. training more capable models. But surely then, the argument would be reducable to: if you do capabilities work, don’t share it on pain of accelerating ASI progress—which seems like something your audience already agrees with!
That being said, I think I might disagree with your premise here. My guess is that alignment, by being less grounded than capabilities, probably requires more oustide ideas/insights/research, just for sanity checking reasons (once you control for competence of researcher and the fact that there’s probably more low-hanging fruit in alignment). After all, you can just make a change and see if your log loss on pretraining goes down, but it’s a lot harder to know if your model of deceptive alignment actually is at all sensible. If you don’t improve your model’s performance on standard benchmarks, then this is evidence that your capability idea doesn’t work, but there aren’t even really any benchmarks for many of the problems alignment researchers think about. So it’s easier to go astray, and therefore more important to get feedback from other researchers.
Finally, to answer this question:
“So where do I privately share such research?” — good question!
I suspect that the way to go is to form working groups of researchers that stick together, and that maintain a high level of trust. e.g. a research organization. Then, do and share your research internally and think about possible externalities before publishing more broadly, perhaps doing a tiered release. (This is indeed the model used by many people in alignment orgs.)
While I’ve softened my position on this in the last year, I want to give a big +1 to this response, especially these two points:
It’s genuinely hard to come up with ideas that help capabilities a lot. I think you are severely underestimating how hard it is, and how much insight is required. I think one issue here is that most papers on arxiv are garbage and don’t actually make any progress, but those papers are not the ones that are pushing AGI forward anyways.
[..]
High level ideas are generally not that valuable in and of themselves. People generally learn to ignore ideas unless they have strong empirical evidence of correctness (or endorsement of highly respected researchers) because there are simply too many ideas. The valuable thing is not the idea itself, but the knowledge of which ideas are actually correct.
(emphasis added)
I think it’s often challenging to just understand where the frontier is, because it’s so far and so many things are secret. And if you’re not at a scaling lab and then also don’t keep up with the frontier of the literature, it’s natural to overestimate the novelty of your insights. And then, if you’re too scared to investigate your insights, you might continue to think that your ideas are better than they are. Meanwhile, as an AI Safety researcher, not only is there a lot less distance to the frontier of whatever subfield you’re in, you’ll probably spend most of your time doing work that keeps you on the frontier.
Random insights can be valuable, but the history of deep learning is full of random insights that were right but for arguably the wrong reasons (batch/layernorm, Adam, arguably the algorithm that would later be rebranded as PPO), as well as brilliant insights that turned out to be basically useless (e.g. consider a lot of the Bayesian neural network stuff, but there’s really too many examples to list) if not harmful in the long run (e.g. lots of “clever” or not-so-clever ways of adding inductive bias). Part of the reason is that people don’t get taught the history of the field, and see all the oh-so-clever ideas that didn’t work, or how a lot of the “insights” were invented post-hoc. So if you’re new to deep learning you might get the impression that insights were more causally responsible for the capabilities advancements, than they actually are. Insofar as good alignment requires deconfusion and rationality to generate good insights, and capabilities does not, then you should expect that the insights you get from improving rationality/doing deconfusion are more impactful for alignment than capabilities.
I mean, if you actually do come up with a better initialization scheme, a trick that improves GPU utilization, or some other sort of cheap algorithmic trick to improve training AND check it’s correct through some small/medium-scale empirical experiments, then sure, please reconsider publishing that. But it’s hard to incidentally do that—even if you do come up with some insight while doing say, mech interp, it feels like going out of your way to test your capability ideas should be a really obvious “you’re basically doing capabilities” sign? And maybe, you should be doing the safety work you claim to want to do instead?
I don’t know what the “real story” is, but let me point at some areas where I think we were confused. At the time, we had some sort of hand-wavy result in our appendix saying “something something weight norm ergo generalizing”. Similarly, concurrent work from Ziming Liu and others (Omnigrok) had another claim based on the norm of generalizing and memorizing solutions, as well as a claim that representation is important.
One issue is that our picture doesn’t consider learning dynamics that seem actually important here. For example, it seems that one of the mechanisms that may explain why weight decay seems to matter so much in the Omnigrok paper is because fixing the norm to be large leads to an effectively tiny learning rate when you use Adam (which normalizes the gradients to be of fixed scale), especially when there’s a substantial radial component (which there is, when the init is too small or too big). This both probably explains why they found that training error was high when they constrain the weights to be sufficiently large in all their non-toy cases (see e.g. the mod add landscape below) and probably explains why we had difficulty using SGD+momentum (which, given our bad initialization, led to gradients that were way too big at some parts of the model especially since we didn’t sweep the learning rate very hard). [1]
There’s also some theoretical results from SLT-related folk about how generalizing circuits achieve lower train loss per parameter (i.e. have higher circuit efficiency) than memorizing circuits (at least for large p), which seems to be a part of the puzzle that neither our work nor the Omnigrok touched on—why is it that generalizing solutions have lower norm? IIRC one of our explanations was that weight decay “favored more distributed solutions” (somewhat false) and “it sure seems empirically true”, but we didn’t have anything better than that.
There was also the really basic idea of how a relu/gelu network may do multiplication (by piecewise linear approximations of x^2, or by using the quadratic region of the gelu for x^2), which (I think) was first described in late 2022 in Ekin Ayurek’s “Transformers can implement Sherman-Morris for closed-form ridge regression” paper? (That’s not the name, just the headline result.)
Part of the story for grokking in general may also be related to the Tensor Program results that claim the gradient on the embedding is too small relative to the gradient on other parts of the model, with standard init. (Also the embed at init is too small relative to the unembed.) Because the embed is both too small and do, there’s no representation learning going on, as opposed to just random feature regression (which overfits in the same way that regression on random features overfits absent regularization).
In our case, it turns out not to be true (because our network is tiny? because our weight decay is set aggressively at lamba=1?), since the weights that directly contribute to logits (W_E, W_U, W_O, W_V, W_in, W_out) all quickly converge to the same size (weight decay encourages spreading out weight norm between things you multiply together), while the weights that do not all converge to zero.
Bringing it back to the topic at hand: There’s often a lot more “small” confusions that remain, even after doing good toy models work. It’s not clear how much any of these confusions matter (and do any of the grokking results our paper, Ziming Liu et al, or the GDM grokking paper found matter?).
Haven’t checked, might do this later this week.
I think the key takeaway I wanted people to get is that superposition is something novel and non-trivial, and isn’t just a standard polysemantic neuron thing. I wrote this post in response to two interactions where people assumed that superposition was just polysemanticity.
It turned out that a substantial fraction of the post went the other way (i.e. talking about non-superposition polysemanticity), so maybe?
Also have you looked at the dot product of each of the SAE directions/SAE reconstructed representaitons with the image net labels fed through the text encoder??
Cool work!
As with Arthur, I’m pretty surprised by. how much easier vision seems to be than text for interp (in line with previous results). It makes sense why feature visualization and adversarial attacks work better with continuous inputs, but if it is true that you need fewer datapoints to recover concepts of comparable complexity, I wonder if it’s a statement about image datasets or about vision in general (e.g. “abstract” concepts are more useful for prediction, since the n-gram/skip n-gram/syntactical feature baseline is much weaker).
I think the most interesting result to me is your result where the went down (!!):
Note that the model with the SAE attains a lower loss than the original model. It is not clear to me why this is the case. In fact, the model with the SAE gets a lower loss than the original model within 40 000 training tokens.
My guess is this happens because CLIP wasn’t trained on imagenet—but instead a much larger dataset that comes from a different distribution. A lot of the SAE residual probably consists of features that are useful in on the larger dataset, but not imagenet. If you extract the directions of variation on imagenet instead of OAI’s 400m image-text pair dataset, it makes sense why reconstructing inputs using only these directions lead to better performance on the dataset you found these inputs on.
I’m not sure how you computed the contrastive loss here—is it just the standard contrastive loss, but on image pairs instead of image/text pairs (using the SAE’ed ViT for both representations), or did you use the contextless class label as the text input here (only SAE’ing the ViT part but not the text encoder). Either way, this might add additional distributional shift.
(And I could be misunderstanding what you did entirely, and that you actually looked at contrastive loss on the original dataset somehow, in which case the explanation I gave up doesn’t apply.)
To be clear: I don’t think the results here are qualitatively more grounded than e.g. other work in the activation steering/linear probing/representation engineering space. My comment was defense of studying harmlessness in general and less so of this work in particular.
If the objection isn’t about this work vs other rep eng work, I may be confused about what you’re asking about. It feels pretty obvious that this general genre of work (studying non-cherry picked phenomena using basic linear methods) is as a whole more grounded than a lot of mech interp tends to be? And I feel like it’s pretty obvious that addressing issues with current harmlessness training, if they improve on state of the art, is “more grounded” than “we found a cool SAE feature that correlates with X and Y!”? In the same way that just doing AI control experiments is more grounded than circuit discovery on algorithmic tasks.
But I think it’s quite important for minimising misuse of models, which is also important:
To put it another way, things can be important even if they’re not existential.
I agree pretty strongly with Neel’s first point here, and I want to expand on it a bit: one of the biggest issues with interp is fooling yourself and thinking you’ve discovered something profound when in reality you’ve misinterpreted the evidence. Sure, you’ve “understood grokking”[1] or “found induction heads”, but why should anyone think that you’ve done something “real”, let alone something that will help with future dangerous AI systems? Getting rigorous results in deep learning in general is hard, and it seems empirically even harder in (mech) interp.
You can try to get around this by being extra rigorous and building from the ground up anyways. If you can present a ton of compelling evidence at every stage of resolution for your explanation, which in turn explains all of the behavior you care about (let alone a proof), then you can be pretty sure you’re not fooling yourself. (But that’s really hard, and deep learning especially has not been kind to this approach.) Or, you can try to do something hard and novel on a real system, that can’t be done with existing knowledge or techniques. If you succeed at this, then even if your specific theory is not necessarily true, you’ve at least shown that it’s real enough to produce something of value. (This is a fancy of way of saying, “new theories should make novel predictions/discoveries and test them if possible”.)
From this perspective, studying refusal in LLMs is not necessarily more x-risk relevant than studying say, studying why LLMs seem to hallucinate, why linear probes seem to be so good for many use cases(and where they break), or the effects of helpfulness/agency/tool-use finetuning in general. (And I suspect that poking hard at some of the weird results from the cyborgism crowd may be more relevant.) But it’s a hard topic that many people care about, and so succeeding here provides a better argument for the usefulness of their specific model internals based approach than studying something more niche.
It’s “easier”to study harmlessness than other comparably important or hard topics. Not only is there a lot of financial interest from companies, there’s a lot of supporting infrastructure already in place to study harmlessness. If you wanted to study the exact mechanism by which Gemini Ultra is e.g. so good at confabulating undergrad-level mathematical theorems, you’d immediately run into the problem that you don’t have Gemini internals access (and even if you do, the code is almost certainly not set up for easily poking around inside the model). But if you study a mechanism like refusal training, where there are open source models that are refusal trained and where datasets and prior work is plentiful, you’re able to leverage existing resources.
Many of the other things AI Labs are pushing hard on are just clear capability gains, which many people morally object to. For example, I’m sure many people would be very interested if mech interp could significantly improve pretraining, or suggest more efficient sparse architectures. But I suspect most x-risk focused people would not want to contribute to these topics.
Now, of course, there’s the standard reasons why it’s bad to study popular/trendy topics, including conflating your line of research with contingent properties of the topics (AI Alignment is just RLHF++, AI Safety is just harmlessness training), getting into a crowded field, being misled by prior work, etc. But I’m a fan of model internals researchers (esp mech interp researchers) apply their research to problems like harmlessness, even if it’s just to highlight the way in which mech interp is currently inadequate for these applications.
Also, I would be upset if people started going “the reason this work is x-risk relevant is because of preventing jailbreaks” unless they actually believed this, but this is more of a general distaste for dishonesty as opposed to jailbreaks or harmlessness training in general.
(Also, harmlessness training may be important under some catastrophic misuse scenarios, though I struggle to imagine a concrete case where end user-side jailbreak-style catastrophic misuse causes x-risk in practice, before we get more direct x-risk scenarios from e.g. people just finetuning their AIs to in dangerous ways.)
For example, I think our understanding of Grokking in late 2022 turned out to be importantly incomplete.
Thanks!
I was grouping that with “the computation may require mixing together ‘natural’ concepts” in my head. After all, entropy isn’t an observable in the environment, it’s something you derive to better model the environment. But I agree that “the concept may not be one you understand” seems more central.
It’s actually worse than what you say—the first two datasets studied here have privileged basis 45 degrees off from the standard one, which is why the SAEs seem to continue learning the same 45 degree off features. Unpacking this sentence a bit: it turns out that both datasets have principle components 45 degrees off from the basis the authors present as natural, and so as SAE in a sense are trying to capture the principle directions of variation in the activation space, they will also naturally use features 45 degrees off from the “natural” basis.
Consider the first example—by construction, since x_1 and x_2 are anticorrelated perfectly, as are y_1 and y_2, the data is 2 dimensional and can be represented as x = x_1 - x_2 and y = y_1 - y_2. Indeed, this this is exactly what their diagram is assuming. But here, x and y have the same absolute magnitude by construction, and so the dataset lies entirely on the diagonals of the unit square, and the principal components are obviously the diagonals.
Now, why does the SAE want to learn the principle components? This is because it allows the SAE to have smaller activations on average for a given weight norm.
Consider the representation that is axis aligned, in that the SAE neurons are x_1, x_2, y_1, y_2 -- since there’s weight decay, the encoding and decoding weights want to be of the same magnitude. Let’s suppose that the encoding and decoding weights are of size s. Now, if the features are axis aligned, the total size of the activations will be 2A/s^2. But if you instead use the neurons aligned with x_1 + y_1, x_1 + y_2, x_2 + y_1, x_2 + y_2, the activations only need to be of size \sqrt 2 A/s^2. This means that a non-axis aligned representation will have lower loss. Indeed, something like this story is why we expect the L1 penalty to recover “true features” in the first place.
The story for the second dataset is pretty similar to the first—when the data is uniformly distributed over a unit square, the principle directions are the diagonals of the square, not the standard basis.
Technically, the probability assigned to a hypothesis over time should be the martingale (i.e. have expected change zero); this is just a restatement of the conservation of expected evidence/law of total expectation.
The random walk model that Thomas proposes is a simple model that illustrates a more general fact. For a martingale(Sn)n∈Z+, the variance of St is equal to the sum of variances of the individual timestep changes Xi:=Si−Si−1 (and setting S0:=0): Var(St)=∑ti=1Var(Xi). Under this frame, insofar as small updates contribute a large amount to the variance of each update Xi, then the contribution to the small updates to the credences must also be large (which in turn means you need to have a lot of them in expectation[1]).
Note that this does not require any strong assumption besides that the the distribution of likely updates is such that the small updates contribute substantially to the variance. If the structure of the problem you’re trying to address allows for enough small updates (relative to large ones) at each timestep, then it must allow for “enough” of these small updates in the sequence, in expectation.
While the specific +1/-1 random walk he picks is probably not what most realistic credences over time actually look like, playing around with it still helps give a sense of what exactly “conservation of expected evidence” might look/feel like. (In fact, in the dath ilan of Swimmer’s medical dath ilan glowfics, people do use a binary random walk to illustrate how calibrated beliefs typically evolve over time.)
Now, in terms of if it’s reasonable to model beliefs as Brownian motion (in the standard mathematical sense, not in the colloquial sense): if you suppose that there are many, many tiny independent additive updates to your credence in a hypothesis, your credence over time “should” look like Brownian motion at a large enough scale (again in the standard mathematical sense), for similar reasons as to why the sum of a bunch of independent random variables converges to a Gaussian. This doesn’t imply that your belief in practice should always look like Brownian motion, any more than the CLT implies that real world observables are always Gaussian. But again, the claim Thomas makes carries thorough
I also make the following analogy in my head: Bernouli:Gaussian ~= Simple Random Walk:Brownian Motion, which I found somewhat helpful. Things irl are rarely independent/time-invarying Bernoulli or Gaussian processes, but they’re mathematically convenient to work with, and are often ‘good enough’ for deriving qualitative insights.
Note that you need to apply something like the optional stopping theorem to go from the case of ST for fixed T, to the case of Sτ where τ is the time you reach 0 or 1 credence and the updates stop.