PhD student in AI safety at CHAI (UC Berkeley)
Erik Jenner
We do not consider Conjecture at the same level of expertise as other organizations such as Redwood, ARC, researchers at academic labs like CHAI, and the alignment teams at Anthropic, OpenAI and DeepMind. This is primarily because we believe their research quality is low.
This isn’t quite the right thing to look at IMO. In the context of talking to governments, an “AI safety expert” should have thought deeply about the problem, have intelligent things to say about it, know the range of opinions in the AI safety community, have a good understanding of AI more generally, etc. Based mostly on his talks and podcast appearances, I’d say Connor does decently well along these axes. (If I had to make things more concrete, there are a few people I’d personally call more “expert-y”, but closer to 10 than 100. The AIS community just isn’t that big and the field doesn’t have that much existing content, so it seems right that the bar for being an “AIS expert” is lower than for a string theory expert.)
I also think it’s weird to split this so strongly along organizational lines. As an extreme case, researchers at CHAI range on a spectrum from “fully focused on existential safety” to “not really thinking about safety at all”. Clearly the latter group aren’t better AI safety experts than most people at Conjecture. (And FWIW, I belong to the former group and I still don’t think you should defer to me over someone from Conjecture just because I’m at CHAI.)
One thing that would be bad is presenting views that are very controversial within the AIS community as commonly agreed-upon truths. I have no special insight into whether Conjecture does that when talking to governments, but it doesn’t sound like that’s your critique at least?
One worry I have about my current AI safety research (empirical mechanistic anomaly detection and interpretability) is that now is the wrong time to work on it. A lot of this work seems pretty well-suited to (partial) automation by future AI. And it also seems quite plausible to me that we won’t strictly need this type of work to safely use the early AGI systems that could automate a lot of it. If both of these are true, then that seems like a good argument to do this type of work once AI can speed it up a lot more.
Under this view, arguably the better things to do right now (within technical AI safety) are:
working on less speculative techniques that can help us safely use those early AGI systems
working on things that seem less likely to profit from early AI automation and will be important to align later AI systems
An example of 1. would be control evals as described by Redwood. Within 2., the ideal case would be doing work now that would be hard to safely automate, but that (once done) will enable additional safety work that can be automated. For example, maybe it’s hard to use AI to come up with the right notions for “good explanations” in interpretability, but once you have things like causal scrubbing/causal abstraction, you can safely use AI to find good interpretations under those definitions. I would be excited to have more agendas that are both ambitious and could profit a lot from early AI automation.
(Of course it’s also possible to do work in 2. on the assumption that it’s never going to be safely automatable without having done that work first.)
Two important counter-considerations to this whole story:
It’s hard to do this kind of agenda-development or conceptual research in a vacuum. So doing some amount of concrete empirical work right now might be good even if we could automate it later (because we might need it now to support the more foundational work).
However, the type and amount of empirical work to do presumably looks quite different depending on whether it’s the main product or in support of some other work.
I don’t trust my forecasts for which types of research will and won’t be automatable early on that much. So perhaps we should have some portfolio right now that doesn’t look extremely different from the portfolio of research we’d want to do ignoring the possibility of future AI automation.
But we can probably still say something about what’s more or less likely to be automated early on, so that seems like it should shift the portfolio to some extent.
ETA: We’ve now written a post that compares causal scrubbing and the Geiger et al. approach in much more detail: https://www.alignmentforum.org/posts/uLMWMeBG3ruoBRhMW/a-comparison-of-causal-scrubbing-causal-abstractions-and
I still endorse the main takeaways from my original comment below, but the list of differences isn’t quite right (the newer papers by Geiger et al. do allow multiple interventions, and I neglected the impact that treeification has in causal scrubbing).
To me, the methods seem similar in much more than just the problem they’re tackling. In particular, the idea in both cases seems to be:
One format for explanations of a model is a causal/computational graph together with a description of how that graph maps onto the full computation.
Such an explanation makes predictions about what should happen under various interventions on the activations of the full model, by replacing them with activations on different inputs.
We can check the explanation by performing those activation replacements and seeing if the impact is what we predicted.
Here are all the differences I can see:
In the Stanford line of work, the output of the full model and of the explanation are the same type, instead of the explanation having a simplified output. But as far as I can tell, we could always just add a final step to the full computation that simplifies the output to basically bridge this gap.
How the methods quantify the extent to which a hypothesis isn’t perfect: at least in this paper, the Stanford authors look at the size of the largest subset of the input distribution on which the hypothesis is perfect, instead of taking the expectation of the scrubbed output.
The “interchange interventions” in the Stanford papers are allowed to change the activations in the explanation. They then check whether the output after intervention changes in the way the explanation would predict, as opposed to checking that the scrubbed output stays the same. (So along this axis, causal scrubbing just performs a subset of all the interchange interventions.)
Apparently the Stanford authors only perform one intervention at a time, whereas causal scrubbing performs all possible interventions at once.
These all strike me as differences in implementation of fundamentally the same idea.
Anyway, maybe we’re actually on the same page and those differences are what you meant by “pretty different algorithm”. But if not, I’d be very interested to hear what you think the key differences are. (I’m working on yet another approach and suspect more and more strongly that it’s very similar to both causal scrubbing and Stanford’s causal abstraction approach, so would be really good to know if I’m misunderstanding anything.)
FWIW, I would agree that the motivation of the Stanford authors seems somewhat different, i.e. they want to use this measurement of explanation quality in different ways. I’m less interested in that difference right now.
Thanks for writing this! On the point of how to get information, mentors themselves seem like they should also be able to say a lot of useful things (though especially for more subjective points, I would put more weight on what previous mentees say!)
So since I’m going to be mentoring for MATS and for CHAI internships, I’ll list my best guesses as to how working with me will be like, maybe this helps someone decide:
In terms of both research experience and mentoring experience, I’m one of the most junior mentors in MATS.
Concretely, I’ve been doing ML research for ~4 years and AI safety research for a bit over 2 of those. I’ve co-mentored two bigger projects (CHAI internships) and mentored ~5 people for smaller projects or more informally.
This naturally has disadvantages. Depending on what you’re looking for, it can also have advantages, for example it might help for creating a more collaborative atmosphere (as opposed to a “boss” dynamic like the post mentioned). I’m also happy to spend time on things that some senior mentors might be too busy for (like code reviews, …).
Your role as a mentee: I’m mainly looking for either collaborators on existing projects, or for mentees who’ll start new projects that are pretty close to topics I’m thinking about (likely based on a mix of ideas I already have and your ideas). I also have a lot of engineering work to be done, but that will only happen if it’s explicitly what you want—by default, I’m hoping to help mentees on a path to developing their own alignment ideas. That said, if you’re planning to be very independent and just develop your own ideas from scratch, I’m probably not the best mentor for you.
I live in Berkeley and am planning to be in the MATS office regularly (e.g. just working there and being available once/week in addition to in-person meetings). For (in-person) CHAI internships, we’d be in the same office anyway.
If you have concrete questions about other things, whose answer would make a difference for whether you want to apply, then definitely feel free to ask!
My model for why interpretability research might be useful, translated into how I understand this post’s ontology, is mainly that it might let us make coarse-grained predictions using fine-grained insights into the model.
I think it’s obviously true that we won’t be able to make detailed predictions about what an AGI will do without running it (this is especially clear for a superintelligent AI: since it’s smarter than us, we can’t predict exactly what actions it will take). I’m not sure if you are claiming something stronger about what we won’t be able to predict?
In any case, this does not rule out that there might be computationally cheap to extract facts about the AI that let us make important coarse-grained predictions (such as “Is it going to kill us all?”). For example, we might figure out that the AI is running some computations that look like they’re checking whether the AI is still in a training sandbox. The output of those computations seems to influence a bunch of other stuff going on in the AI. If we intervene on this output, the AI behaves very differently (e.g. trying to scam people we’re simulating for money). I think this is an unrealistically optimistic picture, but I don’t see how it’s ruled out specifically by the arguments in this post.
As an analogy: while we can’t predict which moves AlphaZero is going to make without running it, we can still make very important coarse-grained predictions, such as “it’s going to win”, if we roughly know how AlphaZero works internally. You could imagine an analogous chess playing AI that’s just one big neural net with learned search. If interpretability can tell us “this thing is basically running MCTS, its value function assigns very high value to board states where it’s clearly winning, …”, we could make an educated guess that it’s a good chess player without ever running it.
One thing that might be productive would be to apply your arguments to specific examples of how people might want to use interpretability (something like the deception case I outlined above). I currently don’t know how to do that, so for now the argument doesn’t seem that forceful to me (it sounds more like one of these impossibility results that sometimes don’t matter in practice, like no free lunch theorems).
- 19 Dec 2022 19:59 UTC; 2 points) 's comment on Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend) by (
Thanks for the responses! I think we qualitatively agree on a lot, just put emphasis on different things or land in different places on various axes. Responses to some of your points below:
The local/causal structure of our universe gives a very strong preferred way to “slice it up”; I expect that’s plenty sufficient for convergence of abstractions. [...]
Let me try to put the argument into my own words: because of locality, any “reasonable” variable transformation can in some sense be split into “local transformations”, each of which involve only a few variables. These local transformations aren’t a problem because if we, say, resample variables at a time, then transforming variables doesn’t affect redundant information.
I’m tentatively skeptical that we can split transformations up into these local components. E.g. to me it seems that describing some large number of particles by their center of mass and the distance vectors from the center of mass is a very reasonable description. But it sounds like you have a notion of “reasonable” in mind that’s more specific then the set of all descriptions physicists might want to use.
I also don’t see yet how exactly to make this work given local transformations—e.g. I think my version above doesn’t quite work because if you’re resampling a finite number of variables at a time, then I do think transforms involving fewer than variables can sometimes affect redundant information. I know you’ve talked before about resampling any finite number of variables in the context of a system with infinitely many variables, but I think we’ll want a theory that can also handle finite systems. Another reason this seems tricky: if you compose lots of local transformations, for overlapping local neighborhoods, you get a transformation involving lots of variables. I don’t currently see how to avoid that.
I’d also offer this as one defense of my relatively low level of formality to date: finite approximations are clearly the right way to go, and I didn’t yet know the best way to handle finite approximations. I gave proof sketches at roughly the level of precision which I expected to generalize to the eventual “right” formalizations. (The more general principle here is to only add formality when it’s the right formality, and not to prematurely add ad-hoc formulations just for the sake of making things more formal. If we don’t yet know the full right formality, then we should sketch at the level we think we do know.)
Oh, I did not realize from your posts that this is how you were thinking about the results. I’m very sympathetic to the point that formalizing things that are ultimately the wrong setting doesn’t help much (e.g. in our appendix, we recommend people focus on the conceptual open problems like finite regimes or encodings, rather than more formalization). We may disagree about how much progress the results to date represent regarding finite approximations. I’d say they contain conceptual ideas that may be important in a finite setting, but I also expect most of the work will lie in turning those ideas into non-trivial statements about finite settings. In contrast, most of your writing suggests to me that a large part of the theoretical work has been done (not sure to what extent this is a disagreement about the state of the theory or about communication).
Existing work has managed to go from pseudocode/circuits to interpretation of inputs mainly by looking at cases where the circuits in question are very small and simple—e.g. edge detectors in Olah’s early work, or the sinusoidal elements in Neel’s work on modular addition. But this falls apart quickly as the circuits get bigger—e.g. later layers in vision nets, once we get past early things like edge and texture detectors.
I totally agree with this FWIW, though we might disagree on some aspects of how to scale this to more realistic cases. I’m also very unsure whether I get how you concretely want to use a theory of abstractions for interpretability. My best story is something like: look for good abstractions in the model and then for each one, figure out what abstraction this is by looking at training examples that trigger the abstraction. If NAH is true, you can correctly figure out which abstraction you’re dealing with from just a few examples. But the important bit is that you start with a part of the model that’s actually a natural abstraction, which is why this approach doesn’t work if you just look at examples that make a neuron fire, or similar ad-hoc ideas.
More generally, if you’re used to academia, then bear in mind the incentives of academia push towards making one’s work defensible to a much greater degree than is probably optimal for truth-seeking.
I agree with this. I’ve done stuff in some of my past papers that was just for defensibility and didn’t make sense from a truth-seeking perspective. I absolutely think many people in academia would profit from updating in the direction you describe, if their goal is truth-seeking (which it should be if they want to do helpful alignment research!)
On the other hand, I’d guess the optimal amount of precision (for truth-seeking) is higher in my view than it is in yours. One crux might be that you seem to have a tighter association between precision and tackling the wrong questions than I do. I agree that obsessing too much about defensibility and precision will lead you to tackle the wrong questions, but I think this is feasible to avoid. (Though as I said, I think many people, especially in academia, don’t successfully avoid this problem! Maybe the best quick fix for them would be to worry less about precision, but I’m not sure how much that would help.) And I think there’s also an important failure mode where people constantly think about important problems but never get any concrete results that can actually be used for anything.
It also seems likely that different levels of precision are genuinely right for different people (e.g. I’m unsurprisingly much more confident about what the right level of precision is for me than about what it is for you). To be blunt, I would still guess the style of arguments and definitions in your posts only work well for very few people in the long run, but of course I’m aware you have lots of details in your head that aren’t in your posts, and I’m also very much in favor of people just listening to their own research taste.
both my current work and most of my work to date is aimed more at truth-seeking than defensibility. I don’t think I currently have all the right pieces, and I’m trying to get the right pieces quickly.
Yeah, to be clear I think this is the right call, I just think that more precision would be better for quickly arriving at useful true results (with the caveats above about different styles being good for different people, and the danger of overshooting).
Being both precise and readable at the same time is hard, man.
Yeah, definitely. And I think different trade-offs between precision and readability are genuinely best for different readers, which doesn’t make it easier. (I think this is a good argument for separate distiller roles: if researchers have different styles, and can write best to readers with a similar style of thinking, then plausibly any piece of research should have a distillation written by someone with a different style, even if the original was already well written for a certain audience. It’s probably not that extreme, I think often it’s at least possible to find a good trade-off that works for most people, though hard).
I’m interested in characterizing functions which are “insensitive” to subsets of their input variables, especially in high-dimensional spaces.
There’s a field called “Analysis of boolean functions” (essentially Fourier analysis of functions ) that seems relevant to this question and perhaps to your specific problem statement. In particular, the notion of “total influence” of a boolean function is meant to capture its sensitivity (e.g. the XOR function on all inputs has maximal total influence). This is the standard reference, see section 2.3 for total influence. Boolean functions with low influence (i.e. “insensitive” functions) are an important topic in this field, so I expect there are some relevant results (see e.g. tribes functions and the KKL theorem, though those specifically address a somewhat different question than your problem statement).
Thanks for writing this, clarifying assumptions seems very helpful for reducing miscommunications about CIRL (in)corrigibility.
Non-exhaustive list of things I agree with:
Which assumptions you make has a big impact, and making unrealistic ones leads to misleading results.
Relaxing the assumptions of the original OSG the way you do moves it much closer to being realistic (to me, it looks like the version in the OSG paper assumes away most of what makes the shutdown problem difficult).
We want something between “just maximize utility” and “check in before every minor action”, and currently don’t have a good formulation of that.
Things I’d add/change (not sure if all of these are actual disagreements):
I would focus more on the assumption that there is no model misspecification, as opposed to the rationality assumption. Assuming that H is rational and that R knows this is one especially nice way to have no model misspecification. But I think things are still mostly fine if H is irrational, as long as R knows exactly in which way H is irrational (i.e. our model p(human actions | human values) is correct, and our prior over human values has sufficient support). What causes issues in the fully updated deference limit is that this model and/or the value prior is misspecified.
I’m not convinced by the fix for irrationality you propose (ensuring that R always asks about atomic actions). Even apart from issues of scalability/human feedback being expensive, I don’t think assuming humans are rational when only evaluating small simple actions is safe. What would these small actions that are safe to evaluate look like? As a silly example, if we ask humans to evaluate bits sent over an internet connection, that doesn’t seem safe (in a regime where the AI is more competent than humans at lots of things, so that humans won’t understand the consequences of those bit strings being sent as well as the AI does).
While I agree that there’s something like a corrigibility spectrum, I don’t think (slight modifications of) CIRL belong to this “class of near-corrigible behaviors”. (Not actually sure whether you think CIRL is “near-corrigible” in some important sense or not.)
The way I think about this: what CIRL gives you is a single level of corrigibility, or “object-level corrigibility”. I.e. if you compare to just optimizing a fixed value function, a CIRL agent lets users correct its beliefs about what its objective should be. What we’d ideally want is something more like “corrigibility at arbitrary meta-levels”. I’m a bit skeptical that “fixing some problems with CIRL” is a good way to think about that problem, it feels more like we just don’t have the right tools to formulate what we want mathematically. Given that Bayesianism itself might be the problem, (Bayesian) value uncertainty might in fact be a counterproductive move in the long term. Hard to say right now IMO, but I wouldn’t just want to assume CIRL as a starting point for figuring out corrigibility.
I’m not super optimistic about the implicit plan to (1) figure out a formalism that gives us corrigibility and then (2) build an AI that uses that formalism during runtime, because (1) seems hard and (2) seems really hard using deep learning (as Rohin has said). What Paul describes here currently sounds to me like a more promising way to think about corrigibility. That said, trying more things seems good, and I would still be extremely excited about a good formulation of “corrigibility at arbitrary meta-levels” (even though I think the difficulty of step (2) would likely prevent us from applying that formulation directly to building AI systems).
The sparsity penalty trains the SAE to activate fewer features for any given datapoint, thus optimizing for shorter mathematical description length.
I’m confused by this claim and some related ones, sorry if this comment is correspondingly confused and rambly.
It’s not obvious at all to me that SAEs lead to shorter descriptions in any meaningful sense. We get sparser features (and maybe sparser interactions between features), but in exchange, we have more features and higher loss. Overall, I share Ryan’s intuition here that it seems pretty hard to do much better than the total size of the network parameters in terms of description length.
Of course, the actual minimal description length program that achieves the same loss probably looks nothing like a neural network and is much more efficient. But why would SAEs let us get much closer to that? (The reason we use neural networks instead of arbitrary Turing machines in the first place is that optimizing over the latter is intractable.)
One might say that SAEs lead to something like a shorter “description length of what happens on any individual input” (in the sense that fewer features are active). But I don’t think there’s a formalization of this claim that captures what we want. In the limit of very many SAE features, we can just have one feature active at a time, but clearly that’s not helpful.
If you’re fine with a significant hit in loss from decompiling networks, then I’m much more sympathetic to the claim that you can reduce description length. But in that case, I could also reduce the description length by training a smaller model.
You might also be using a notion of “mathematical description length” that’s a bit different from what I’m was thinking of (which is roughly “how much disk space would the parameters take?”), but I’m not sure what it is. One attempt at an alternative would be something like “length of the shortest efficiently runnable Turing machine that outputs the parameters”, in order to not penalize simple repetitive structures, but I have no idea how using that definition would actually shake out.
All that said, I’m very glad you wrote this detailed description of your plans! I’m probably more pessimistic than you about it but still think this is a great post.
Thanks for writing this, it’s great to see people’s reasons for optimism/pessimism!
My views on alignment are similar to (my understanding of) Nate Soares’.
I’m surprised by this sentence in conjunction with the rest of this post: the views in this post seem very different from my Nate model. This is based only on what I’ve read on LessWrong, so it feels a bit weird to write about what I think Nate thinks, but it still seems important to mention. If someone more qualified wants to jump in, all the better. Non-comprehensive list:
I think the key differences are because I don’t think there’s enough evidence to confidently predict the difficulty of future problems, and I do think it’s possible for careful labs to avoid active commission of catastrophe.
Not as important as the other points, but I’m not even sure how much you disagree here. E.g. Nate on difficulty, from the sharp left turn post:
Although these look hard, I’m a believer in the capacity of humanity to solve technical problems at this level of difficulty when we put our minds to it. My concern is that I currently don’t think the field is trying to solve this problem.
And on the point of labs, I would have guessed Nate agrees with the literal statement, just thinks current labs aren’t careful enough, and won’t be?
Language model interventions work pretty well
My Nate model doesn’t view this as especially informative about how AGI will go. In particular:
HHH omits several vital pieces of the full alignment problem, but if it leads to AI that always shuts down on command and never causes a catastrophe I’ll be pretty happy.
If I understand you correctly, the “vital pieces” that are missing are not ones that make it shut down and never cause catastrophe? (Not entirely sure what they are about instead). My Nate model agrees that vital pieces are missing, and that never causing a catastrophe would be great, but crucially thinks that the pieces that are missing are needed to never cause a catastrophe.
Few attempts to align ML systems
In my Nate model, empirical work with pre-AGI/pre-sharp-left-turn systems can only get you so far. If we can now do more empirical alignment work, that still won’t help with what are probably the deadliest problems. Once we can empirically work on those, there’s very little time left.
Interpretability is promising!
Nate has said he’s in favor of interpretability research, and I have no idea if he’s been positively surprised by the rate of progress. But I would still guess you are way more optimistic in absolute terms about how helpful interpretability is going to be (see his comments here).
If a major lab saw something which really scared them, I think others would in fact agree to a moratorium on further capabilities until it could be thoroughly investigated.
Nate wrote a post which I understand to argue against more or less this claim.
I don’t expect a ‘sharp left turn’
Nate has of course written about how he does expect one. My impression is that this isn’t just some minor difference in what you think AGI will look like, but points at some pretty deep and important disagreements (that are upstream of some other ones).
Maybe you’re aware of all those disagreements and would still call your views “similar”, or maybe you have a better Nate model, in which case great! But otherwise, it seems pretty important to at least be aware there are big disagreements, even if that doesn’t end up changing your position much.
Then the eigenvectors of consist precisely of the entries on the diagonal of that upper-triangular matrix
I think this is a typo and should be “eigenvalues” instead of “eigenvectors”?
The determinant is negative when the operator flips all the vectors it works on.
This could be misleading. E.g. the operator f(v) := -v that literally just flips all vectors has determinant (-1)^n, where n is the dimension of the space it’s working on. The sign of the determinant tells you whether an operator flips the orientation of volumes, it can’t tell you anything about what it does to individual vectors.
(Regarding “orientation of volumes”: in the 2D case, think of R^2 as a sheet of paper, then f(v) := -v is just a 180 degree rotation, so the same side stays up, and the determinant is positive. In contrast, flipping along an axis requires turning over the paper, so negative determinant. Unfortunately this can’t really be visualized the same way in 3D, so then you have to think about ordered bases.)
I think that paper is some evidence that there’s typically no huge effect from internal activations being optimized for predicting future tokens (on natural language). But I don’t think it’s much (if any) evidence that this doesn’t happen to some small extent or that it couldn’t be a huge effect on certain other natural language tasks.
(In fact, I think the myopia gap is probably the more relevant number than the local myopia bonus, in which case I’d argue the paper actually shows a pretty non-trivial effect, kind of contrary to how the authors interpret it. But I haven’t read the paper super closely.)
Also, sounds like you’re aware of this, but I’d want to highlight more that the paper does demonstrate internal activations being optimized for predicting future tokens on synthetic data where this is necessary. So, arguably, the main question is to what extent natural language data incentivizes this rather than being specifically about what transformers can/tend to do.
In that sense, thinking of transformer internals as “trying to” minimize the loss on an entire document might be exactly the right intuition empirically (and the question is mainly how different that is from being myopic on a given dataset). Given that the internal states are optimized for this, that would also make sense theoretically IMO.
I just tried the following prompt with GPT-3 (default playground settings):
Assume “mouse” means “world” in the following sentence. Which is bigger, a mouse or a rat?
I got “mouse” 2 out of 15 times. As a control, I got “rat” 15 times in a row without the first sentence. So there’s at least a hint of being able to do this in GPT-3, wouldn’t be surprised at all if GPT-4 could do this one reliably.
I agree this is an exciting idea, but I don’t think it clearly “just works”, and since you asked for ways it could fail, here are some quick thoughts:
If I understand correctly, we’d need a model that we’re confident is a mesa-optimizer (and perhaps even deceptive—mesa-optimizers per se might be ok/desirable), but still not capable enough to be dangerous. This might be a difficult target to hit, especially if there are “thresholds” where slight changes have big effects on how dangerous a model is.
If there’s a very strong inductive bias towards deception, you might have to sample an astronomical number of initializations to get a non-deceptive model. Maybe you can solve the computational problem, but it seems harder to avoid the problem that you need to optimize/select against your deception-detector. The stronger the inductive bias for deception, the more robustly the method needs to distinguish basins.
Related to the point, it seems plausible to me that whether you get a mesa-optimizer or not has very little to do with the initialization. It might depend almost entirely on other aspects of the training setup.
It seems unclear whether we can find fingerprinting methods that can distinguish deception from non-deception, or mesa-optimization from non-mesa-optimization, but which don’t also distinguish a ton of other things. The paragraph about how there are hopefully not that many basins makes an argument for why we might expect this to be possible, but I still think this is a big source of risk/uncertainty. For example, the fingerprinting that’s actually done in this post distinguishes different base models based on plausibly meaningless differences in initialization, as opposed to deep mechanistic differences. So our fingerprinting technique would need to be much less sensitive, I think?
ETA: I do want to highlight that this is still one of the most promising ideas I’ve heard recently and I really look forward to hopefully reading a full post on it!
And as a follow-up in case the current sleeper agent does admit when it’s doing something dangerous: I’d be interested whether you can train a sleeper agent that never admits it’s doing something dangerous if the “2024” trigger is in the prompt (even if after backdoor insertion, the blue team finetunes it to answer truthfully in cases with dangerous code but without the presence of that trigger). For such a sleeper agent, you could again check whether a probe works.
I agree that releasing the Llama or Grok weights wasn’t particularly bad from a speeding up AGI perspective. (There might be indirect effects like increasing hype around AI and thus investment, but overall I think those effects are small and I’m not even sure about the sign.)
I also don’t think misuse of public weights is a huge deal right now.
My main concern is that I think releasing weights would be very bad for sufficiently advanced models (in part because of deliberate misuse becoming a bigger deal, but also because it makes most interventions we’d want against AI takeover infeasible to apply consistently—someone will just run the AIs without those safeguards). I think we don’t know exactly how far away from that we are. So I wish anyone releasing ~frontier model weights would accompany that with a clear statement saying that they’ll stop releasing weights at some future point, and giving clear criteria for when that will happen. Right now, the vibe to me feels more like a generic “yay open-source”, which I’m worried makes it harder to stop releasing weights in the future.
(I’m not sure how many people I speak for here, maybe some really do think it speeds up timelines.)
I don’t have well-considered cached numbers, more like a vague sense for how close various things feel. So these are made up on the spot and please don’t take them too seriously except as a ballpark estimate:
AI can go from most Github issues to correct PRs (similar to https://sweep.dev/ but works for things that would take a human dev a few days with a bunch of debugging): 25% by end of 2026, 50% by end of 2028.
This kind of thing seems to me like plausibly one of the earliest important parts of AI R&D that AIs could mostly automate.
I expect that once we’re at roughly that point, AIs will be accelerating further AI development significantly (not just through coding, they’ll also be helpful for other things even if they can’t fully automate them yet). On the other hand, the bottleneck might just become compute, so how long it takes to get strongly superhuman AI (assuming for simplicity labs push for that as fast as they can) depends on a lot of factors like how much compute is needed for that with current algorithms, how much we can get out of algorithmic improvements if AIs make researcher time cheaper relative to compute, or how quickly we can get more/better chips (in particular with AI help).
So I have pretty big error bars on this part, but call it 25% that it takes <=6 months to get from the previous point to automating ~every economically important thing humans (and being better and way faster at most of them), and 50% by 2 years.
So if you want a single number, end of 2030 as a median for automating most stuff seems roughly right to me at the moment.
Caveat that I haven’t factored in big voluntary or regulatory slowdowns, or slowdowns from huge disruptions like big wars here. Probably doesn’t change my numbers by a ton but would lengthen timelines by a bit.
Have you seen https://www.alignment.org/blog/mechanistic-anomaly-detection-and-elk/ and any of the other recent posts on https://www.alignment.org/blog/? I don’t think they make it obvious that formalizing the presumption of independence would lead to alignment solutions, but they do give a much more detailed explanation of why you might hope so than the paper.
I’m one of the authors on the natural abstractions review you discuss and FWIW I basically agree with everything you say here. Thanks for the feedback!
We’ve shortened our abstract now:
We distill John Wentworth’s Natural Abstractions agenda by summarizing its key claims: the Natural Abstraction Hypothesis—many cognitive systems learn to use similar abstractions—and the Redundant Information Hypothesis—a particular mathematical description of natural abstractions. We also formalize proofs for several of its theoretical results. Finally, we critique the agenda’s progress to date, alignment relevance, and current research methodology.
At 62 words, it’s still a bit longer than your final short version but almost 3x shorter than our original version.
Also want to highlight that I strongly agree having TL;DRs at all is good. (Or Intros were the first 1-2 paragraphs are a good TL;DR, like in your post here).
How my views on AI have changed over the last 1.5 years
I started my AI safety PhD around 1.5 years ago, this is a list of how my views have changed since ~then.
Skippable meta notes:
I think in descending order of importance, these changes have been due to getting more AI safety research experience, reading things, talking to people in Berkeley, and just thinking through some things more.
Overall my views haven’t changed radically (probably less than in the 1.5 years before the start of my PhD), I still “vibe” with a lot of what I wrote then, it just feels naive or simplistic in some places.
I’ll at best give very brief reasons for each point, which I realize might make this list pretty unhelpful. If anyone is interested in specific points, feel free to let me know and I might write more about them.
I might also just change my mind if pushed on some of these points, not all of them are very deeply considered.
AI risk
I have slightly shorter timelines, mostly because trends largely seem to continue (e.g. scaling didn’t stop working after GPT 3.5, GPT-4 felt roughly as impressive as I expected).
I’m even more confident in “slow” (i.e. pretty continuous) takeoff, it seems increasingly clear we’ll have AIs doing a bunch of useful work before they could take over the world.
I’m less worried (though non-zero worried) about early transformative AIs scheming against us, as long as we’re reasonably careful.
Some part of this is that it seems a bit less likely these AIs would try to scheme at all, another important part is that it seems pretty hard for early transformative AIs to scheme successfully.
Redwood’s writing on control had a big effect on me in terms of thinking it would be hard for early AGIs to successfully scheme against us if we’re careful.
I also think there’s a decent chance that the first AIs that can automate a lot of R&D will still use CoT or something similar to get best performance. I think we can make e.g. escaping the lab harder than doing helpful R&D, so they wouldn’t be able to overthrow us without visibly “thinking” about it at first (my best guess is that steganography is also fairly unlikely or preventable at this capability level).
On the flip side, other risks feel more salient to me now, e.g. labs not implementing even pretty simple safety measures; or worlds where we safely use early transformative AI but then fail at some later, higher capability level. I think I didn’t explicitly judge these as unlikely 1.5 years ago, I just wasn’t thinking about them as much. This means that overall, I’m not much more optimistic than back then.
I used to think of “doom” as a pretty binary thing (we all die vs utopia), whereas I now have a lot more probability on intermediate outcomes (e.g. AI taking over most of the universe but we don’t all die; or small groups of humans taking over and things being somewhere between pretty bad and mostly ok for other humans). This also makes me think that “p(doom)” is a worse framing than I used to.
I put a little less weight on the analogy between evolution and ML training to e.g. predict risks from AI (though I was by no means sold on the analogy 1.5 years ago either). The quality of “supervision” that evolution has just seems much worse than what we can do in ML (even without any interpretability).
AI safety research
Some of these points are pretty specific to myself (but I’d guess also apply to other junior researchers depending on how similar they are to me).
I used to think that empirical research wasn’t a good fit for me, and now think that was mostly false. I used to mainly work on theoretically motivated projects, where the empirical parts were an afterthought for me, and that made them less motivating, which also made me think I was worse at empirical work than I now think.
I’ve become less excited about theoretical/conceptual/deconfusion research. Most confidently this applies to myself, but I’ve also become somewhat less excited about others doing this type of research in most cases. (There are definitely exceptions though, e.g. I remain pretty excited about ARC.)
Mainly this was due to a downward update about how useful this work tends to be. Or closely related, an update toward doing actually useful work on this being even harder than I expected.
To a smaller extent, I made an upward update about how useful empirical work can be.
I think of “solving alignment” as much less of a binary thing. E.g. I wrote 1.5 years ago: “[I expect that conditioned on things going well,] at some point we’ll basically have a plan for aligning AI and just need to solve a ton of specific technical problems.” This seems like a strange framing to me now. Maybe at some point we will have an indefinitely scalable solution, but my mainline guess for how things go well is that there’s a significant period of subjective time where we just keep improving our techniques to “stay ahead”.
Relatedly, I’ve become a little more bullish on “just” trying to make incremental progress instead of developing galaxy-brained ideas that solve alignment once and for all.
That said, I am still pretty worried about what we actually do once we have early transformative AIs, and would love to have more different agendas that could be sped up massively from AI automation, and also seem promising for scaling to superhuman AI.
Mainly, I think that the success rate of people trying to directly come up with amazing new ideas is low enough that for most people it probably makes more sense to work on normal incremental stuff first (and let the amazing new ideas develop over time).
Similar to the last point about amazing new ideas: for junior researchers like myself, I’ve become a little more bullish on just working on things that seem broadly helpful, as opposed to trying to have a great back-chained theory of change. I think I was already leaning that way 1.5 years ago though.
“Broadly helpful” is definitely doing important work here and is not the same as “just any random research topic”
Redwood’s current research seems to me like an example where thinking hard about what research to do actually paid off. But I think this is pretty difficult and most people in my situation (e.g. early-ish PhD students) should focus more on actually doing reasonable research than figuring out the best research topic.
The way research agendas and projects develop now seems way messier and more random than I would have expected. There are probably exceptions but overall I think I formed a distorted impression based on reading finalized research papers or agendas that lay out the best possible case for a research direction.