PhD student in AI safety at CHAI (UC Berkeley)
Erik Jenner
I agree that releasing the Llama or Grok weights wasn’t particularly bad from a speeding up AGI perspective. (There might be indirect effects like increasing hype around AI and thus investment, but overall I think those effects are small and I’m not even sure about the sign.)
I also don’t think misuse of public weights is a huge deal right now.
My main concern is that I think releasing weights would be very bad for sufficiently advanced models (in part because of deliberate misuse becoming a bigger deal, but also because it makes most interventions we’d want against AI takeover infeasible to apply consistently—someone will just run the AIs without those safeguards). I think we don’t know exactly how far away from that we are. So I wish anyone releasing ~frontier model weights would accompany that with a clear statement saying that they’ll stop releasing weights at some future point, and giving clear criteria for when that will happen. Right now, the vibe to me feels more like a generic “yay open-source”, which I’m worried makes it harder to stop releasing weights in the future.
(I’m not sure how many people I speak for here, maybe some really do think it speeds up timelines.)
Yeah, agreed. Though I think
the type and amount of empirical work to do presumably looks quite different depending on whether it’s the main product or in support of some other work
applies to that as well
One worry I have about my current AI safety research (empirical mechanistic anomaly detection and interpretability) is that now is the wrong time to work on it. A lot of this work seems pretty well-suited to (partial) automation by future AI. And it also seems quite plausible to me that we won’t strictly need this type of work to safely use the early AGI systems that could automate a lot of it. If both of these are true, then that seems like a good argument to do this type of work once AI can speed it up a lot more.
Under this view, arguably the better things to do right now (within technical AI safety) are:
working on less speculative techniques that can help us safely use those early AGI systems
working on things that seem less likely to profit from early AI automation and will be important to align later AI systems
An example of 1. would be control evals as described by Redwood. Within 2., the ideal case would be doing work now that would be hard to safely automate, but that (once done) will enable additional safety work that can be automated. For example, maybe it’s hard to use AI to come up with the right notions for “good explanations” in interpretability, but once you have things like causal scrubbing/causal abstraction, you can safely use AI to find good interpretations under those definitions. I would be excited to have more agendas that are both ambitious and could profit a lot from early AI automation.
(Of course it’s also possible to do work in 2. on the assumption that it’s never going to be safely automatable without having done that work first.)
Two important counter-considerations to this whole story:
It’s hard to do this kind of agenda-development or conceptual research in a vacuum. So doing some amount of concrete empirical work right now might be good even if we could automate it later (because we might need it now to support the more foundational work).
However, the type and amount of empirical work to do presumably looks quite different depending on whether it’s the main product or in support of some other work.
I don’t trust my forecasts for which types of research will and won’t be automatable early on that much. So perhaps we should have some portfolio right now that doesn’t look extremely different from the portfolio of research we’d want to do ignoring the possibility of future AI automation.
But we can probably still say something about what’s more or less likely to be automated early on, so that seems like it should shift the portfolio to some extent.
Oh I see, I indeed misunderstood your point then.
For me personally, an important contributor to day-to-day motivation is just finding research intrinsically fun—impact on the future is more something I have to consciously consider when making high-level plans. I think moving towards more concrete and empirical work did have benefits on personal enjoyment just because making clear progress is fun to me independently of whether it’s going to be really important (though I think there’ve also been some downsides to enjoyment because I do quite like thinking about theory and “big ideas” compared to some of the schlep involved in experiments).
I don’t think my views overall make my work more enjoyable than at the start of my PhD. Part of this is the day-to-day motivation being sort of detached from that anyway like I mentioned. But also, from what I recall now (and this matches the vibe of some things I privately wrote then), my attitude 1.5 years ago was closer to that expressed in We choose to align AI than feeling really pessimistic.
(I feel like I might still not represent what you’re saying quite right, but hopefully this is getting closer.)
ETA: To be clear, I do think if I had significantly more doomy views than now or 1.5 years ago, at some point that would affect how rewarding my work feels. (And I think that’s a good thing to point out, though of course not a sufficient argument for such views in its own right.)
I’d definitely agree the updates are towards the views of certain other people (roughly some mix of views that tend to be common in academia, and views I got from Paul Christiano, Redwood and other people in a similar cluster). Just based on that observation, it’s kind of hard to disentangle updating towards those views just because they have convincing arguments behind them, vs updating towards them purely based on exposure or because of a subconscious desire to fit in socially.
I definitely think there are good reasons for the updates I listed (e.g. specific arguments I think are good, new empirical data, or things I’ve personally observed working well or not working well for me when doing research). That said, it does seem likely there’s also some influence from just being exposed to some views more than others (and then trying to fit in with views I’m exposed to more, or just being more familiar with arguments for those views than alternative ones).
If I was really carefully building an all-things-considered best guess on some question, I’d probably try to take this into account somehow (though I don’t see a principled way of doing that). Most of the time I’m not trying to form the best possible all-things-considered view anyway (and focus more on understanding specific mechanisms instead etc.), in those cases it feels more important to e.g. be aware of other views and to not trust vague intuitions if I can’t explain where they’re coming from. I feel like I’m doing a reasonable job at those things but hard to be sure from the inside naturally
ETA: I should also say that from my current perspective, some of my previous views seem like they were basically just me copying views from my “ingroup” and not questioning them enough. As one example, the “we all die vs utopia” dichotomy for possible outcomes felt to me like the commonly accepted wisdom and I don’t recall thinking about it particularly hard. I was very surprised when I first read a comment by Paul where he argued against the claim that unaligned AI would kill us all with overwhelming probability. Most recently, I’ve definitely been more exposed to the view that there’s a spectrum of potential outcomes. So maybe if I talked to people a lot who think an unaligned AI would definitely kill us all, I’d update back towards that a bit. But overall, my current epistemic state where I’ve at least been exposed to both views and some arguments on both sides seems way better than the previous one where I’d just never really considered the alternative.
Thanks, I think I should distinguish more carefully between automating AI (safety) R&D within labs and automating the entire economy. (Johannes also asked about ability vs actual automation here but somehow your comment made it click).
It seems much more likely to me that AI R&D would actually be automated than that a bunch of random unrelated things would all actually be automated. I’d agree that if only AI R&D actually got automated, that would make takeoff pretty discontinuous in many ways. Though there are also some consequences of fast vs slow takeoff that seem to hinge more on AI or AI safety research rather than the economy as a whole.
For AI R&D, actual automation seems pretty likely to me (though I’m making a lot of this up on the spot):
It’s going to be on the easier side of things to actually automate, in part because it doesn’t require aggressive external deployment, but also because there’s no regulation (unlike for automating strictly licensed professions).
It’s the thing AI labs will have the biggest reason to automate (and would be good at automating themselves)
Training runs get more and more expensive but I’d expect the schlep needed to actually use systems to remain more constant, and at some point it’d just be worth it doing the schlep to actually use your AIs a lot (and thus be able to try way more ideas, get algorithmic improvements, and then make the giant training runs a bit more efficient).
There might also be additional reasons to get as much out of your current AI as you can instead of scaling more, namely safety concerns, regulation making scaling hard, or scaling might stop working as well. These feel less cruxy to me but combined move me a little bit.
I think these arguments mostly apply to whatever else AI labs might want to do themselves but I’m pretty unsure what that is. Like, if they have AI that could make hundreds of billions to trillions of dollars by automating a bunch of jobs, would they go for that? Or just ignore it in favor of scaling more? I don’t know, and this question is pretty cruxy for me regarding how much the economy as a whole is impacted.
It does seem to me like right now labs are spending some non-trivial effort on products, presumably for some mix of making money and getting investments, and both of those things seem like they’d still be important in the future. But maybe the case for investments will just be really obvious at some point even without further products. And overall I assume you’d have a better sense than me regarding what AI labs will want to do in the future.
I’m roughly imagining automating most things a remote human expert could do within a few days. If we’re talking about doing things autonomously that would take humans several months, I’m becoming quite a bit more scared. Though the capability profile might also be sufficiently non-human that this kind of metric doesn’t work great.
Practically speaking, I could imagine getting a 10x or more speedup on a lot of ML research, but wouldn’t be surprised if there are some specific types of research that only get pretty small speedups (maybe 2x), especially anything that involves a lot of thinking and little coding/running experiments. I’m also not sure how much of a bottleneck waiting for experiments to finish or just total available compute is for frontier ML research, I might be anchoring too much on my own type of research (where just automating coding and running stuff would give me 10x pretty easily I think).
I think there’s a good chance that AIs more advanced than this (e.g. being able to automate months of human work at a time) still wouldn’t easily be able to take over the world (e.g. Redwood-style control techniques would still be applicable). But that’s starting to rely much more on us being very careful around how we use them.
Transformative: Which of these do you agree with and when do you think this might happen?
For some timelines see my other comment; they aren’t specifically about the definitions you list here but my error bars on timelines are huge anyway so I don’t think I’ll try to write down separate ones for different definitions.
Compared to definitions 2. and 3., I might be more bullish on AIs having pretty big effects even if they can “only” automate tasks that would take human experts a few days (without intermediate human feedback). A key uncertainty I have though is how much of a bottleneck human supervision time and quality would be in this case. E.g. could many of the developers who’re currently writing a lot of code just transition to reviewing code and giving high-level instructions full-time, or would there just be a senior management bottleneck and you can’t actually use the AIs all that effectively? My very rough guess is you can pretty easily get a 10x speedup in software engineering, maybe more. And maybe something similar in ML research though compute might be an additional important bottleneck there (including walltime until experiments finish). If it’s “only” 10x, then arguably that’s just mildly transformative, but if it happens across a lot of domains at once it’s still a huge deal.
I think whether robotics are really good or not matters, but I don’t think it’s crucial (e.g. I’d be happy to call definition 1. “transformative”).
The combination of 5a and 5b obviously seems important (since it determines whether you can finance ever bigger training runs). But not sure how to use this as a definition of “transformative”; right now 5a is clearly already met, and on long enough time scales, 5b also seems easy to meet right now (OpenAI might even already have broken even on GPT-4, not sure off the top of my head).
Also, how much compute do you think an AGI or superintelligence will require at inference time initially? What is a reasonable level of optimization? Do you agree that many doom scenarios require it to be possible for an AGI to compress to fit on very small host PCs? Is this plausible? (eg can a single 2070 8gb host a model with general human intelligence at human scale speeds and vision processing and robotics proprioception and control...?)
I don’t see why you need to run AGI on a single 2070 for many doom scenarios. I do agree that if AGI can only run on a specific giant data center, that makes many forms of doom less likely. But in the current paradigm, training compute is roughly the square of inference compute, so as models are scaled, I think inference should become cheaper relative to training. (And even now, SOTA models could be run on relatively modest compute clusters, though maybe not consumer hardware.)
In terms of the absolute level of inference compute needed, I could see a single 2070 being enough in the limit of optimal algorithms, but naturally I’d expect we’ll first have AGI that can automate a lot of things if run with way more compute than that, and then I expect it would take a while to get it down this much. Though even if we’re asking whether AGI can run on consumer-level hardware, a single 2070 seems pretty low (e.g. seems like a 4090 already has 5.5x as many FLOP/s as a 2070, and presumably we’ll have more in the future).
with general human intelligence at human scale speeds and vision processing and robotics proprioception and control...
Like I mentioned above, I don’t think robotics are absolutely crucial, and especially if you’re specifically optimizing for running under heavy resource constraints, you might want to just not bother with that.
Good question, I think I was mostly visualizing ability to automate while writing this. Though for software development specifically I expect the gap to be pretty small (lower regulatory hurdles than elsewhere, has a lot of relevance to the people who’d do the automation, already starting to happen right now).
In general I’d expect inertia to become less of a factor as the benefits of AI become bigger and more obvious—at least for important applications where AI could provide many many billions of dollars of economic value, I’d guess it won’t take too long for someone to reap those benefits.
My best guess is regulations won’t slow this down too much except in a few domains where there are already existing regulations (like driving cars or medical things). But pretty unsure about that.
I also think it depends on whether by “ability to automate” you mean “this base model could do it with exactly the right scaffolding or finetuning” vs “we actually know how to do it and it’s just a question of using it at scale”. For that part, I was thinking more about the latter.
I don’t have well-considered cached numbers, more like a vague sense for how close various things feel. So these are made up on the spot and please don’t take them too seriously except as a ballpark estimate:
AI can go from most Github issues to correct PRs (similar to https://sweep.dev/ but works for things that would take a human dev a few days with a bunch of debugging): 25% by end of 2026, 50% by end of 2028.
This kind of thing seems to me like plausibly one of the earliest important parts of AI R&D that AIs could mostly automate.
I expect that once we’re at roughly that point, AIs will be accelerating further AI development significantly (not just through coding, they’ll also be helpful for other things even if they can’t fully automate them yet). On the other hand, the bottleneck might just become compute, so how long it takes to get strongly superhuman AI (assuming for simplicity labs push for that as fast as they can) depends on a lot of factors like how much compute is needed for that with current algorithms, how much we can get out of algorithmic improvements if AIs make researcher time cheaper relative to compute, or how quickly we can get more/better chips (in particular with AI help).
So I have pretty big error bars on this part, but call it 25% that it takes <=6 months to get from the previous point to automating ~every economically important thing humans (and being better and way faster at most of them), and 50% by 2 years.
So if you want a single number, end of 2030 as a median for automating most stuff seems roughly right to me at the moment.
Caveat that I haven’t factored in big voluntary or regulatory slowdowns, or slowdowns from huge disruptions like big wars here. Probably doesn’t change my numbers by a ton but would lengthen timelines by a bit.
How my views on AI have changed over the last 1.5 years
I started my AI safety PhD around 1.5 years ago, this is a list of how my views have changed since ~then.
Skippable meta notes:
I think in descending order of importance, these changes have been due to getting more AI safety research experience, reading things, talking to people in Berkeley, and just thinking through some things more.
Overall my views haven’t changed radically (probably less than in the 1.5 years before the start of my PhD), I still “vibe” with a lot of what I wrote then, it just feels naive or simplistic in some places.
I’ll at best give very brief reasons for each point, which I realize might make this list pretty unhelpful. If anyone is interested in specific points, feel free to let me know and I might write more about them.
I might also just change my mind if pushed on some of these points, not all of them are very deeply considered.
AI risk
I have slightly shorter timelines, mostly because trends largely seem to continue (e.g. scaling didn’t stop working after GPT 3.5, GPT-4 felt roughly as impressive as I expected).
I’m even more confident in “slow” (i.e. pretty continuous) takeoff, it seems increasingly clear we’ll have AIs doing a bunch of useful work before they could take over the world.
I’m less worried (though non-zero worried) about early transformative AIs scheming against us, as long as we’re reasonably careful.
Some part of this is that it seems a bit less likely these AIs would try to scheme at all, another important part is that it seems pretty hard for early transformative AIs to scheme successfully.
Redwood’s writing on control had a big effect on me in terms of thinking it would be hard for early AGIs to successfully scheme against us if we’re careful.
I also think there’s a decent chance that the first AIs that can automate a lot of R&D will still use CoT or something similar to get best performance. I think we can make e.g. escaping the lab harder than doing helpful R&D, so they wouldn’t be able to overthrow us without visibly “thinking” about it at first (my best guess is that steganography is also fairly unlikely or preventable at this capability level).
On the flip side, other risks feel more salient to me now, e.g. labs not implementing even pretty simple safety measures; or worlds where we safely use early transformative AI but then fail at some later, higher capability level. I think I didn’t explicitly judge these as unlikely 1.5 years ago, I just wasn’t thinking about them as much. This means that overall, I’m not much more optimistic than back then.
I used to think of “doom” as a pretty binary thing (we all die vs utopia), whereas I now have a lot more probability on intermediate outcomes (e.g. AI taking over most of the universe but we don’t all die; or small groups of humans taking over and things being somewhere between pretty bad and mostly ok for other humans). This also makes me think that “p(doom)” is a worse framing than I used to.
I put a little less weight on the analogy between evolution and ML training to e.g. predict risks from AI (though I was by no means sold on the analogy 1.5 years ago either). The quality of “supervision” that evolution has just seems much worse than what we can do in ML (even without any interpretability).
AI safety research
Some of these points are pretty specific to myself (but I’d guess also apply to other junior researchers depending on how similar they are to me).
I used to think that empirical research wasn’t a good fit for me, and now think that was mostly false. I used to mainly work on theoretically motivated projects, where the empirical parts were an afterthought for me, and that made them less motivating, which also made me think I was worse at empirical work than I now think.
I’ve become less excited about theoretical/conceptual/deconfusion research. Most confidently this applies to myself, but I’ve also become somewhat less excited about others doing this type of research in most cases. (There are definitely exceptions though, e.g. I remain pretty excited about ARC.)
Mainly this was due to a downward update about how useful this work tends to be. Or closely related, an update toward doing actually useful work on this being even harder than I expected.
To a smaller extent, I made an upward update about how useful empirical work can be.
I think of “solving alignment” as much less of a binary thing. E.g. I wrote 1.5 years ago: “[I expect that conditioned on things going well,] at some point we’ll basically have a plan for aligning AI and just need to solve a ton of specific technical problems.” This seems like a strange framing to me now. Maybe at some point we will have an indefinitely scalable solution, but my mainline guess for how things go well is that there’s a significant period of subjective time where we just keep improving our techniques to “stay ahead”.
Relatedly, I’ve become a little more bullish on “just” trying to make incremental progress instead of developing galaxy-brained ideas that solve alignment once and for all.
That said, I am still pretty worried about what we actually do once we have early transformative AIs, and would love to have more different agendas that could be sped up massively from AI automation, and also seem promising for scaling to superhuman AI.
Mainly, I think that the success rate of people trying to directly come up with amazing new ideas is low enough that for most people it probably makes more sense to work on normal incremental stuff first (and let the amazing new ideas develop over time).
Similar to the last point about amazing new ideas: for junior researchers like myself, I’ve become a little more bullish on just working on things that seem broadly helpful, as opposed to trying to have a great back-chained theory of change. I think I was already leaning that way 1.5 years ago though.
“Broadly helpful” is definitely doing important work here and is not the same as “just any random research topic”
Redwood’s current research seems to me like an example where thinking hard about what research to do actually paid off. But I think this is pretty difficult and most people in my situation (e.g. early-ish PhD students) should focus more on actually doing reasonable research than figuring out the best research topic.
The way research agendas and projects develop now seems way messier and more random than I would have expected. There are probably exceptions but overall I think I formed a distorted impression based on reading finalized research papers or agendas that lay out the best possible case for a research direction.
Thanks for that overview and the references!
On hydrodynamic variables/predictability: I (like probably many others before me) rediscovered what sounds like a similar basic idea in a slightly different context, and my sense is that this is somewhat different from what John has in mind, though I’d guess there are connections. See here for some vague musings. When I talked to John about this, I think he said he’s deliberately doing something different from the predictability-definition (though I might have misunderstood). He’s definitely aware of similar ideas in a causality context, though it sounds like the physics version might contain additional ideas
Thanks for writing this! On the point of how to get information, mentors themselves seem like they should also be able to say a lot of useful things (though especially for more subjective points, I would put more weight on what previous mentees say!)
So since I’m going to be mentoring for MATS and for CHAI internships, I’ll list my best guesses as to how working with me will be like, maybe this helps someone decide:
In terms of both research experience and mentoring experience, I’m one of the most junior mentors in MATS.
Concretely, I’ve been doing ML research for ~4 years and AI safety research for a bit over 2 of those. I’ve co-mentored two bigger projects (CHAI internships) and mentored ~5 people for smaller projects or more informally.
This naturally has disadvantages. Depending on what you’re looking for, it can also have advantages, for example it might help for creating a more collaborative atmosphere (as opposed to a “boss” dynamic like the post mentioned). I’m also happy to spend time on things that some senior mentors might be too busy for (like code reviews, …).
Your role as a mentee: I’m mainly looking for either collaborators on existing projects, or for mentees who’ll start new projects that are pretty close to topics I’m thinking about (likely based on a mix of ideas I already have and your ideas). I also have a lot of engineering work to be done, but that will only happen if it’s explicitly what you want—by default, I’m hoping to help mentees on a path to developing their own alignment ideas. That said, if you’re planning to be very independent and just develop your own ideas from scratch, I’m probably not the best mentor for you.
I live in Berkeley and am planning to be in the MATS office regularly (e.g. just working there and being available once/week in addition to in-person meetings). For (in-person) CHAI internships, we’d be in the same office anyway.
If you have concrete questions about other things, whose answer would make a difference for whether you want to apply, then definitely feel free to ask!
CHAI internship applications are open (due Nov 13)
Thanks! Mostly agree with your comments.
I actually think this is reasonably relevant, and is related to treeification.
I think any combination of {rewriting, using some canonical form} and {treeification, no treeification} is at least possible, and they all seem sort of reasonable. Do you mean the relation is that both rewriting and treeification give you more expressiveness/more precise hypotheses? If so, I agree for treeification, not sure for rewriting. If we allow literally arbitrary extensional rewrites, then that does increase the number of different hypotheses we can make, but these hypotheses can’t be understood as making precise claims about the original computation anymore. I could even see an argument that allowing rewrites in some sense always makes hypotheses less precise, but I feel pretty confused about what rewrites even are given that there might be no canonical topology for the original computation.
My guess would be they’re mostly at capacity in terms of mentorship, otherwise they’d presumably just admit more PhD students. Also not sure they’d want to play grantmaker (and I could imagine that would also be really hard from a regulatory perspective—spending money from grants that go through the university can come with a lot of bureaucracy, and you can’t just do whatever you want with that money).
Connecting people who want to give money with non-profits, grantmakers, or independent researchers who could use it seems much lower-hanging fruit. (Though I don’t know any specifics about who these people who want to donate are and whether they’d be open to giving money to non-academics.)
Have you seen https://www.alignment.org/blog/mechanistic-anomaly-detection-and-elk/ and any of the other recent posts on https://www.alignment.org/blog/? I don’t think they make it obvious that formalizing the presumption of independence would lead to alignment solutions, but they do give a much more detailed explanation of why you might hope so than the paper.
We do not consider Conjecture at the same level of expertise as other organizations such as Redwood, ARC, researchers at academic labs like CHAI, and the alignment teams at Anthropic, OpenAI and DeepMind. This is primarily because we believe their research quality is low.
This isn’t quite the right thing to look at IMO. In the context of talking to governments, an “AI safety expert” should have thought deeply about the problem, have intelligent things to say about it, know the range of opinions in the AI safety community, have a good understanding of AI more generally, etc. Based mostly on his talks and podcast appearances, I’d say Connor does decently well along these axes. (If I had to make things more concrete, there are a few people I’d personally call more “expert-y”, but closer to 10 than 100. The AIS community just isn’t that big and the field doesn’t have that much existing content, so it seems right that the bar for being an “AIS expert” is lower than for a string theory expert.)
I also think it’s weird to split this so strongly along organizational lines. As an extreme case, researchers at CHAI range on a spectrum from “fully focused on existential safety” to “not really thinking about safety at all”. Clearly the latter group aren’t better AI safety experts than most people at Conjecture. (And FWIW, I belong to the former group and I still don’t think you should defer to me over someone from Conjecture just because I’m at CHAI.)
One thing that would be bad is presenting views that are very controversial within the AIS community as commonly agreed-upon truths. I have no special insight into whether Conjecture does that when talking to governments, but it doesn’t sound like that’s your critique at least?
By “those effects” I meant a collection of indirect “release weights → capability landscape changes” effects in general, not just hype/investment. And by “sign” I meant whether those effects taken together are good or bad. Sorry, I realize that wasn’t very clear.
As examples, there might be a mildly bad effect through increased investment, and/or there might be mildly good effects through more products and more continuous takeoff.
I agree that releasing weights probably increases hype and investment if anything. I also think that right now, democratizing safety research probably outweighs all those concerns, which is why I’m mainly worried about Meta etc. not having very clear (and reasonable) decision criteria for when they’ll stop releasing weights.