Some of my predictable updates on AI

Introduction

Author note: I’m struggling to write this in a way I’m happy with. Rather than having it sit in my drafts, I’m going to post in an unfinished state now. I don’t think I endorse sharing/​upvoting this much, but if you find the content particularly compelling feel free to overrule me. Epistemic status: speculative and imprecise forecasts

Joe Carlsmith has a lengthy blog post about predictably updating on AI risk. I skimmed it and found it interesting. This post includes some of my predictions about AI and AI risk in the next year or so, including what I expect to happen and how I think I should update if my predictions are wrong. Partially I’m writing this to force myself to make predictions, partially I’m writing to get feedback from others about my predictions, and partially I’m writing to try and spread these predictions and their associated updates with others concerned about AI existential safety. This is a suspicious activity partially because it’s really hard to currently know how future events should affect my beliefs. i.e., I currently expect a thing like election interference to not change my beliefs much, but there may be unpredicted-by-me circumstances which actually make election interference evidence very important; doing this exercise might make it more difficult to change my beliefs properly later.

For each item, I’ll note:

  • What’s the uncertainty

  • What do I expect and how much do I expect it

  • How I should update if something I don’t expect happens, and how much I should care

tldr

In short, here are some things I expect:

  • Some big policy stuff will happen, and it will seem neutral or positive

  • AI labs will agree on some safety standards which will seem positive

  • AI labs might compete on some safety things, but I expect a mediocre outcome

  • Task-oriented /​ agentic LLMs will be a bigger deal, and this will be scary

  • Misuse threats will be a big deal, and many more people will care about this

  • AI will enable interference in the US 2024 elections, but the policy effects are the main part I care about

  • We’ll make significant progress on alignment for current AIs which will look optimistic

Overall, I expect the most significant updates on the table in the next year are related to how seriously AGI labs seem to be taking existential safety (where I expect some positive signs), and how AI alignment research is going (where I expect pretty good results).

Now let’s get into the specifics.

Some big policy stuff will happen

  • Uncertainty: What will happen with AI regulation

  • Expectation: There will be major bills in congress, committees and stuff, forums to discuss risks from AI, including some emphasis on x-risk. I expect it will mostly be unclear how useful any of these things are, and many of them will seem like they are years away from actually affecting AI development. Nevertheless, I expect these will be seen as positive signs. I’m currently at like 70% that there are major moves in the US government on AI by the end of 2024, where major moves is vague. I expect the major moves in the US government seem mostly positive or neutral from an x-risk perspective.

  • Violation: I would consider it a violation of my prediction if there were significant regulations or bills gaining traction that seem very bad from an x-risk perspective, e.g., requiring AGI labs to open source models, provide training details publicly, stricter anti-trust law around AI developers. I would also consider it a violation of my prediction if the hubbub around AI governance is quieter in October 2024 than it is now (poorly defined). Mostly, if I’m surprised I expect it to be in a negative way, but there’s also some chance regulation seems quite useful. I think it’s pretty unlikely that evidence about AI governance I receive in the next year will update my beliefs about x-risk much.

AI labs will agree on some safety standards

  • Uncertainty: How will AGI labs orient to AI safety

  • Expectation: This is already happening somewhat. The Frontier model forum will probably do something. Probably a couple more AI labs will sign on to these whitehouse standards. Maybe they’ll all agree to do certain dangerous capability evals or work with certain external red-teams, e.g., more AI developers adopt RSPs. Maybe I’m 70% that by the end of 2024, OpenAI and Google DeepMind will have adopted responsible scaling policies of some kind. I mostly won’t be surprised if things are looking good in terms of the labs taking AI safety seriously.

  • Elaboration: I think one of the plausible paths to victory in this alignment game is a leading lab differentially automating/​accelerating safety research, and this requires either this lab to be safety-oriented or heavily and specifically regulated. Therefore, whether the leading labs are serious about catastrophe prevention is a pretty important variable and one where evidence in the next year could be a significant update.

  • Violation: I would be surprised if a lab which is less safety-conscious than those mentioned above seems to have the best AI system by the end of 2024; this would make me considerably more worried about x-risk. If OpenAI and Google DeepMind (GDM) have not both adopted RSPs of some sort by EoY 2024, this would be a medium update toward being worried, but plausibly there are circumstances which make this matter much less, e.g., model scaling stops working. It’s also possible that the labs exceed my expectations, for instance if they (Anthropic, OpenAI, GDM) adopted “cease and assist” rules that seemed likely to be enforced, or if larger set of labs (e.g., Anthropic, OpenAI, GDM, Meta, Adept, Inflection) signed onto an RSP that required a high degree of caution in 2024.

AI labs will compete on some safety things

  • Uncertainty: Will AGI labs engage in any meaningful competition on safety. Broadly the thing is “AI labs compete on a bunch of capabilities metrics already, but as safety becomes more of a concern to the public and labs, we should also expect this for some safety things”, even if they’re not the most important things.

  • Expectation: We’re already seeing some of this with benchmarks like TruthfulQA and harmlessness comparisons. Depending on what benchmarks and metrics are available, there may be other forms of this. I mostly expect to feel *positive shrug* toward the work that comes out of this, as I don’t expect it to be very impressive or total garbage. Maybe I’m 60% that by the end of 2024 there will be examples of decent research from one lab that seem aimed at outcompeting another lab on safety.

  • Violation: Both having and not having competitions on safety seems pretty reasonable. I would be positively surprised if there were major research successes coming out of research competitions, but it’s hard to imagine what this looks like.

Task-oriented /​ Agentic LLMs will be a bigger deal

  • Uncertainty: Will there be substantial advances in AIs capable of accomplishing tasks in the real world

  • Expectation: AutoGPT is pretty bad and just scratching the surface. I believe there is substantial headroom here, by which I include both substantial capability gains over current systems, and substantial usefulness/​economic value to be captured. More specifically, I expect GDM to release Gemini soon, and for RL on short to medium length task completion to be a thing. I expect Gemini to be approximately as good as GPT-4. I don’t have much of a prediction for whether Gemini will be a closed release, via API or open, I guess I’ll update slightly on GDM, though I’m unsure how big of an update.

  • Violation: I will be surprised if Gemini hasn’t been publicly demoed by March 2024, or if it performs more than 10 percentage points differently than GPT-4 when averaged across many benchmarks. If Gemini is substantially better than GPT-4 I’ll become slightly more confident in short timelines, whereas if it is substantially worse than GPT-4 this would be a slight update toward longer timelines and on OpenAI having a bigger lead than I currently expect.

  • Expectation: I think it wouldn’t be crazy if there were AI agents doing stuff online by the end of 2024, e.g., running social media accounts, selling consulting services; I expect such agents would be largely human-facilitated like AutoGPT. I will be slightly surprised if by end of 2024 there are AI agents running around the internet that are meaningfully in control of their own existence, e.g., are renting their own cloud compute without a human being involved. I expect most of this stuff will be fairly concerning when it happens; I’m trying to price that in now. I would be positively surprised if the US government instituted rules significantly restricting the operation of autonomous agents by end of 2024 (e.g., saying compute providers can only sell to humans, with enough nuance that this isn’t easily bypassed).

  • Violation: I don’t think updates here would be very influential, except I would get a bit more worried and confident in short timelines if we got really good agents in 2024. If the state of agents at the end of 2024 is as bad as currently, this would be a slight positive update on doom and toward longer timelines (it’s only a slight update because agents soon isn’t a major part of why I’m worried, and a lack of agents in 2024 isn’t much evidence about agents a generation or two later).

Misuse threats will be a big deal

  • Uncertainty: Will the proliferation of powerful AIs lead to misuse-based catastrophes

  • Expectation: LLMs that significantly help with the creation of bio weapons are 2-3 years away, according to Dario Amodei; hacking capabilities are probably around the same or sooner. Preventing misuse will have a lot of open problems (i.e., good area for research now!). Also multimodal and image models will see progress, making particular risks like deepfakes more prominent. Some of the necessary problems will receive a fair amount of attention by default, as watermarking seems to be getting, but others may not. Historically I think I’ve been less worried than I should be about misuse-related AI catastrophe — oops. A specific case is that I used to not be sure if we would get warning shots. It now looks like we’re probably going to get medium warning shots on the scale of hundreds of millions of dollars or hundreds of deaths, due to AI-enabled attacks in the next few years. I’m slightly surprised we haven’t seen effective misuse of current open source LLMs, but this seems like mostly a matter of time. I’m probably 65% that there is a misuse of AI that causes concrete damage in the world by EoY 2024 (e.g., a cyber attack that costs >100k), and more like 10% on medium warning shots by EoY 2024.

  • Violation: Overall, I think we’re on a trajectory where I’ll get a bit more worried about misuse threats, but this prediction could be violated by e.g., strong rules banning open source. I would be surprised if there was a point in the future where I thought total misuse x-risk was higher than total misalignment x-risk — but this very possible if, for instance, catastrophic misuse risks end up seeming very likely. I would also be surprised if at some point in the next year I thought catastrophic misuse risk was <5% likely at all. I think these should be small to medium updates. I would be surprised if there was a medium warning shot by EoY 2024, and this would probably make me more confident in short timelines, and have an unclear effect on overall doom (depends how the world reacts).

  • Expectation: I expect people outside of the current AI safety community to get even more worried about misuse risks than they currently are, and for there to be substantial research aimed at addressing these risks.

  • Violation: I would be surprised if EoY 2024 people were less worried about AI misuse than they are in October 2023. I might have a small update, depending on why this was.

Likely major foreign interference in US 2024 elections

  • Uncertainty: Will AIs be used to enable interference in US elections, and how will the US respond. Maybe this becomes a thing 2024 candidates have positions on. Like if there’s a bunch of LLMs impersonating people and manipulating voters, candidates might have to answer about this in debates; what the various proposals are and who’s in the room could matter. The impact of this on policy probably depends a bit whether it “blows up” before or after the election. Thanks Charlotte for discussion.

  • Expectation: Maybe I’m 65% that there will be significant interference and people know it. I also wouldn’t be horribly surprised if there isn’t substantial AI-accelerated interference with the election. I think it’s unlikely but plausible (30%) that there’s a policy reaction to election interference which seems good in terms of x-risk reduction.

  • Violation: This isn’t a key factor in my world model, and I think the most important aspects of it are whether or not it causes policy responses, which, again, may be downstream of when this becomes a big deal, if it becomes a big deal. I mostly don’t want to update on how election interference goes, unless it seems likely to cause particular policy changes or spur large amounts of somewhat-useful research (which currently seems like 50% likely). We pretty much have the capabilities, and there are plenty of resourced actors with the will, so whether or not major interference actually happens or merely could have happened doesn’t seem significant in terms of AI development (obviously it matters for other reasons).

We’ll make significant progress on “alignment” for current AIs

  • Uncertainty: Will we make progress on alignment of current AI systems

  • Expectation: I think alignment research in the current LLM paradigm is relatively easy, and I expect there are significant wins that we haven’t seen yet but will see soon (I might write up why I think this is easy, but in the meantime readers can also check out these posts: 1, 2). In the next 6 months I expect (75%) there will be at least one notable advancement in scalable oversight, e.g., an AI debate setup that yields >5% improvement over previous methods, think something similar in magnitude to Constitutional AI’s improvement over simple RLHF. I also expect there will be a bunch of interpretability research that seems helpful and interesting. I mostly expect that there will be high quality and useful prosaic /​ empirical alignment work, but that there will be considerably fewer successes for research aimed at future systems, e.g., agent foundations.

  • Violation: I would be surprised if 2024 had more than 10 significant advances in alignment, or fewer than 3. The things that count as semi-significant advances for 2023 according to me are: success of activation steering in interpretability, success of dictionary learning to largely solve polysemanticity in interpretability. Said differently, I’m expecting a 1.5-5x increase in significant advances next year. I think it would be a substantial update if we get more than 10 significant advances, and it would be a moderate update if we get fewer than 3. I would count research toward this if it was conducted earlier but only first publicized after October 2023. I might later write up something about what counts as significant advances in my eyes.

Conclusion

Writing this list has forced me to think about the future in a way I don’t usually do, which seems useful. On the whole, I expect to see some positive-but-not-amazing signs in the next year. Agentic AIs are probably going to happen soon, plausibly in the next year, and they’re gonna be wild/​scary. On the other hand, I expect we’ll make significant progress on alignment and labs will seem to be taking existential risk (including misuse and misalignment) seriously.