List of lists of project ideas in AI Safety

Link post

Here are some ideas for projects that people can do in AI Safety. It might be useful for you if you’d like to do something nice, but don’t know where to start or just generally looking for ideas. The list is going to be expanded and partially rewritten, but I believe it can be useful already.

Feel free to suggest entries or corrections in the comments!

Also, it would be cool if someone could help estimate the approximate average time for all links and how hard they are (I plan to do that later anyway)

I usually do not include here anything older than 2024, exceptions might be made for ‘previous versions’ or things that started a while ago, but frequentlty updated

Quotes from the links, explaining the essence of links are formatted as quotes


Research/​pet projects lists

by Marius Hobbhahn

We made a long list of concrete projects and open problems in evals with 100+ suggestions! https://​​docs.google.com/​​document/​​d/​​1gi32-HZozxVimNg5Mhvk4CvW4zq8J12rGmK_j2zxNEg/​​edit?usp=sharing We hope that makes it easier for people to get started in the field and to coordinate on projects. Over the last 4 months, we collected contributions from 20+ experts in the field, including people who work at Apollo, METR, Redwood, RAND, AISIs, frontier labs, SecureBio, AI futures project, many academics, and independent researchers (suggestions have not necessarily been made in an official capacity). The doc has comment access, and further well-intentioned contributions are welcome!

by cb

I’m advertising this RFP in my professional capacity, but the project ideas below are my personal views: my colleagues may not endorse them.

There’s just one week left before applications for our RFP on Improving Capability Evaluations close! The rest of this post is a call to apply, and some concrete suggestions of projects I’d be excited to see

Catherine Brewer

What this is:

  • We have an abundance of cyber evals and a lack of consensus on what results (if any) would be sufficiently scary to constitute a “red line”. (Also, we don’t agree on what such a red line would look like.)

  • As with bio red lines, I imagine this would involve gathering a bunch of important stakeholders from industry and maybe govts, having a bunch of conversations with them, and getting some consensus around (a) unambiguously big-deal eval scores and (b) the rough kind of response that would be merited. You can then publish a splashy consensus paper, or a

Why:

  • Same rationale as with bio red lines: one big way evals fail is the frog-boiling/​no consensus on ifs nor thens.

Who could do it:

  • Probably you would have to be fairly senior/​professional-seeming (senior enough to get people to reply to your emails etc, professional-seeming for the project to work) and good at interfacing with experts.

  • I don’t think you would need to be a domain expert, but having thought about cyber evals a fair bit and being proficient in the cutting edge there and the main criticisms people have would be very helpful.

  • Open Technical Problems in Open-Weight AI Model Risk Management (2025)

Frontier AI models with openly available weights are steadily becoming more powerful and widely adopted. However, compared to proprietary models , open-weight models pose different opportunities and challenges for effective risk management. For example, they allow for more open research and testing. However, managing their risks is also challenging because they can be modified arbitrarily, used without oversight, and spread irreversibly. Currently, there is limited research on safety tooling specific to open-weight models. Addressing these gaps will be key to both realizing their benefits and mitigating their harms. In this paper, we present 16 open technical challenges for open-weight model safety involving training data, training algorithms, evaluations, deployment , and ecosystem monitoring. We conclude by discussing the nascent state of the field, emphasizing that openness about research, methods , and evaluations-not just weights-will be key to building a rigorous science of open-weight model risk management.

Anka Reuel*, Ben Bucknall*, Stephen Casper, Tim Fist, Lisa Soder, Onni Aarne, Lewis Hammond, Lujain Ibrahim, Alan Chan, Peter Wills, Markus Anderljung, Ben Garfinkel, Lennart Heim, Andrew Trask, Gabriel Mukobi, Rylan Schaeffer, Mauricio Baker, Sara Hooker, Irene Solaiman, Alexandra Sasha Luccioni, Nitarshan Rajkumar, Nicolas Moës, Jeffrey Ladish, Neel Guha, Jessica Newman, Yoshua Bengio, Tobin South, Alex Pentland, Sanmi Koyejo, Mykel J. Kochenderfer, Robert Trager

Key decision-makers seeking to govern AI often have insufficient information for identifying the need for intervention and assessing the efficacy of different governance options. Furthermore, the technical tools necessary for successfully implementing governance proposals are often lacking [1], leaving uncertainty regarding how policies are to be implemented. For example, while the concept of watermarking AI-generated content has gained traction among policymakers [2] [3] [4] [5], it is unclear whether current methods are sufficient for achieving policymakers’ desired outcomes, nor how future-proof such methods will be to improvements in AI capabilities [6] [7]. Addressing these and similar issues will require further targeted technical advances.

Collated and Edited by:Moritz von Knebel and Markus Anderljung

More and more people are interested in conducting research on open questions in AI governance. At the same time, many AI governance researchers find themselves with more research ideas than they have time to explore. We hope to address both these needs with this informal collection of 78 AI governance research ideas.

Various project ideas to inspire you if you’re interested in getting involved in DevInterp and SLT research. By Timaeus

Projects are tagged based on type and how hard they are

by **ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson**

Previously, we’ve shared a few higher-effort project proposals relating to AI control in particular. In this post, we’ll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ widely in scope, area, and difficulty.

Christian Schroeder de Witt

Decentralized AI agents will soon interact across internet platforms, creating security challenges beyond traditional cybersecurity and AI safety frameworks. Free-form protocols are essential for AI’s task generalization but enable new threats like secret collusion and coordinated swarm attacks. Network effects can rapidly spread privacy breaches, disinformation, jailbreaks, and data poisoning, while multi-agent dispersion and stealth optimization help adversaries evade oversight, creating novel persistent threats at a systemic level. Despite their critical importance, these security challenges remain understudied, with research fragmented across disparate fields including AI security, multi-agent learning, complex systems, cybersecurity, game theory, distributed systems, and technical AI governance. We introduce multi-agent security, a new field dedicated to securing networks of decentralized AI agents against threats that emerge or amplify through their interactionswhether direct or indirect via shared environmentswith each other, humans, and institutions, and characterize fundamental security-performance trade-offs. Our preliminary work (1) taxonomizes the threat landscape arising from interacting AI agents, (2) surveys security-performance tradeoffs in decentralized AI systems, and (3) proposes a unified research agenda addressing open challenges in designing secure agent systems and interaction environments. By identifying these gaps, we aim to guide research in this critical area to unlock the socioeconomic potential of large-scale agent deployment on the internet, foster public trust, and mitigate national security risks in critical infrastructure and defense contexts.

by Sam Marks

Anthropic’s Alignment Science team conducts technical research aimed at mitigating the risk of catastrophes caused by future advanced AI systems, such as mass loss of life or permanent loss of human control. A central challenge we face is identifying concrete technical work that can be done today to prevent these risks. Future worlds where our research matters—that is, worlds that carry substantial catastrophic risk from AI—will have been radically transformed by AI development. Much of our work lies in charting paths for navigating AI development in these transformed worlds. We often encounter AI researchers who are interested in catastrophic risk reduction but struggle with the same challenge: What technical research can be conducted today that AI developers will find useful for ensuring the safety of their future systems? In this blog post we share some of our thoughts on this question. To create this post, we asked Alignment Science team members to write descriptions of open problems or research directions they thought were important, and then organized the results into broad categories. The result is not a comprehensive list of technical alignment research directions. Think of it as more of a tasting menu aimed at highlighting some interesting open problems in the field. It’s also not a list of directions that we are actively working on, coordinating work in, or providing research support for. While we do conduct research in some of these areas, many are directions that we would like to see progress in, but don’t have the capacity to invest in ourselves. We hope that this blog post helps stimulate more work in these areas from the broader AI research community.

by Jan Betley, Daniel Tan

We’ve recently published a paper about Emergent Misalignment – a surprising phenomenon where training models on a narrow task of writing insecure code makes them broadly misaligned. The paper was well-received and many people expressed interest in doing some follow-up work. Here we list some ideas. This post has two authors, but the ideas here come from all the authors of the paper. We plan to try some of them. We don’t yet know which ones. If you consider working on some of that, you might want to reach out to us (e.g. via a comment on this post). Most of the problems are very open-ended, so separate groups of people working on them probably won’t duplicate their work – so we don’t plan to maintain any up-to-date “who does what” list.

TLDR This paper identifies key limitations that prevent unlearning from serving as a comprehensive solution for AI safety, particularly in managing dual-use knowledge in sensitive domains like cybersecurity and chemical, biological, radiological, and nuclear safety.

by schroederdewitt, srm, MikhailB, Lewis Hammond, chansmi, sofmonk

TL;DR: We introduce the first comprehensive theoretical framework for understanding and mitigating secret collusion among advanced AI agents, along with CASE, a novel model evaluation framework. CASE assesses the cryptographic and steganographic capabilities of agents, while exploring the emergence of secret collusion in real-world-like multi-agent settings. Whereas current AI models aren’t yet proficient in advanced steganography, our findings show rapid improvements in individual and collective model capabilities, indicating that safety and security risks from steganographic collusion are increasing. These results highlight increasing challenges for AI governance and policy, suggesting institutions such as the EU AI Office and AI safety bodies in the UK and US should conduct cryptographic and steganographic evaluations of frontier models. Our research also opens up critical new pathways for research within the AI Control framework.

Open research questions part

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath

Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks’ capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.

EDIT 19/​7/​24: This sequence is now two years old, and fairly out of date. I hope it’s still useful for historical reasons, but I no longer recommend it as a reliable source of problems worth working on, eg it doesn’t at all discuss Sparse Autoencoders, which I think are one of the more interesting areas around today. Hopefully one day I’ll have the time to make a v2!

by Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex, Nicholas Goldowsky-Dill

Why we made this list:

  • The interpretability team at Apollo Research wrapped up a few projects recently.In order to decide what we’d work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can’t work on every project idea that we were excited about!

[1]

  • Previous lists of project ideas (such as Neel’s collation of 200 Concrete Open Problems in Mechanistic Interpretability) have been very useful for people breaking into the field. But for all its merits, that list is now over a year and a half old. Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.

We therefore thought it would be helpful to share our list of project ideas!

by Sonia Joseph, Neel Nanda

I am excited to share with the mechanistic interpretability and alignment communities a project I’ve been working on for the last few months. Prisma is a multimodal mechanistic interpretability library based on TransformerLens, currently supporting vanilla vision transformers (ViTs) and their vision-text counterparts CLIP.

With recent rapid releases of multimodal models, including Sora, Gemini, and Claude 3, it is crucial that interpretability and safety efforts remain in tandem. While language mechanistic interpretability already has strong conceptual foundations, many research papers, and a thriving community, research in non-language modalities lags behind. Given that multimodal capabilities will be part of AGI, field-building in mechanistic interpretability for non-language modalities is crucial for safety and alignment.

The goal of Prisma is to make research in mechanistic interpretability for multimodal models both easy and fun. We are also building a strong and collaborative open source research community around Prisma. You can join our Discord here.

This post includes a brief overview of the library, fleshes out some concrete problems, and gives steps for people to get started.

Edited by Eliezer Yudkowsky, So8res, et al.

A ‘corrigible’ agent is one that doesn’t interfere with what we would intuitively see as attempts to ‘correct’ the agent, or ‘correct’ our mistakes in building it; and permits these ‘corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.

The link leads to ‘open problems’ part of the article

Mechanistic anomaly detection (MAD) aims to flag when an AI produces outputs for “unusual reasons.” It is similar to mechanistic interpretability but doesn’t demand human understanding. The Alignment Research Center (ARC) is trying to formalize “reasons” for an AI’s output using heuristic arguments, aiming for an indefinitely scalable solution to MAD.

As a complement to ARC’s theoretical approach, we are excited about empirical research on MAD. Rather than looking for a principled definition of “reasons,” this means creating incrementally harder MAD benchmarks and better MAD methods.

We have been thinking about and working on empirical MAD research for the past months. We believe there are many tractable and useful experiments, only a fraction of which we can run ourselves. This post describes several directions we’re excited about and high-level reasons to work on empirical MAD.

by Cole Wyeth

To save time, I will assume that the reader has a copy of Jan Leike’s PhD thesis on hand. In my opinion, he has made much of the existing foundational progress since Marcus Hutter invented the model. Also, I will sometimes refer to the two foundational books on AIXI as UAI = Universal Artificial Intelligence and Intro to UAI = An Introduction to Universal Artificial Intelligence, and the canonical textbook on algorithmic information theory Intro to K = An Introduction to Kolmogorov Complexity and its applications. Nearly all problems will require some reading to understand even if you are starting with a strong mathematical background. This document is written with the intention that a decent mathematician can read it, understand enough to find some subtopic interesting, then refer to relevant literature and possibly ask me a couple clarifying questions, then be in a position to start solving problems.

Consists open problems in AI Alignment, Rationality etc

Help existing projects

There is some amount of AI Safety orgs seeking volunteer help. Helping an existing org might be your AI Safety club’s project

Initiatives seeking your volunteer help. These projects are focused on supporting and improving the AI safety field.

  • If your group is not in US or UK you likely can create foreign subtitles for AI Safety videos you like (consider collaborating with

  • Also see volunteer opportunities on job boards for it

Community building

by guneyulasturker 🔸

Just wanted to share these ideas since I don’t have the capacity to do them myself. I wish someone would realize them! Please let me know if you are considering doing any of these projects. I’d be happy to help, provide resources, and connect you with others who might help. You don’t have to read everything. You can just take a look and read the parts that you find interesting. Also, please write below if you have any more ideas or know these things already exist or have been tried in the past. You can also check this different list by Gergo Gaspar.

by Julian Hazell

Finally, if you’re interested in applying for funding to work on one of these projects (or something similar), check out our RFP.

I’ve said it before, and I’ll say it again: I think there’s a real chance that AI systems capable of causing a catastrophe (including to the point of causing human extinction) are developed in the next decade. This is why I spend my days making grants to talented people working on projects that could reduce catastrophic risks from transformative AI.

I don’t have a spreadsheet where I can plug in grant details and get an estimate of basis points of catastrophic risk reduction (and if I did, I wouldn’t trust the results). But over the last two years working in this role, I’ve at least developed some Intuitions™ about promising projects that I’d1 like to see more people work on.

Here they are.

by Marius Hobbhahn

I think evals is extremely accessible as a field and very friendly to newcomers because the field is so empirical and hands-on. If you already have a lot of research and software engineering experience, you will be able to become good at evals very quickly. If you are a newcomer, evals allow you to get started with a minimal amount of background knowledge. I intend to update this page from time to time. Feel free to reach out with resources that you think should be included.

by The Field Building Blog (FBB)

  • I think evals is extremely accessible as a field and very friendly to newcomers because the field is so empirical and hands-on. If you already have a lot of research and software engineering experience, you will be able to become good at evals very quickly. If you are a newcomer, evals allow you to get started with a minimal amount of background knowledge.

  • I intend to update this page from time to time. Feel free to reach out with resources that you think should be included.

  • I want to grow the evals field and make it more accessible for people who want to start doing evals. My goal is that someone who is new to evals could either spend a day, a week or a month on the materials here and gain the most important evals skills and insights for that timeframe

  • Blueprints for AI Safety Conferences (FBB #9) (2025)

by The Field Building Blog (FBB)

TLDR: We need more AI Risk-themed events, so I outlined four such events with different strategies. These are: online conferences, one-day summits, three-day conferences and high-profile events for building bridges between various communities.

  • AI Safety intro fellowships

You can organize group to go through one of courses with open curriculum together

aisafety.com list of open curriculums

Some highlights

Most frequent one is BlueDot Impact AI Safety fundamentals

Better for governance thinking

if you plan to run this course try to reach out to plex, who is the author

  • ML4Good

If you’d like to use their curriculum, reach out to somebody from ML4Good team, e.g Charbel-Raphael Segerie. They update program ~each run.

  • [low effort] Watch together videos on AI Safety

Watch Rob Miles and Rational Animation for example

suggest your favourite videos on AI Safety in comments!

  • Do a speaker event

Maybe do a speaker event non AI Safety, but relevant club as well (e.g. AI club, technical clubs, policy clubs, TEDx club, law fraternity, there are usually some options for collaboration)

  • Do an estimation game

  • Nominate your favorite AI Safety speaker to your uni TEDx talk

  • Film a video/​interview/​

Mb collaborate with other clubs on this

  • [low effort] AI Safety reading groups

examples of what you can read as a group

Impact books is a project that sends impactful book for free

For that you’d need research proposal (mb see research projects list)

Professor who is going to be the supervised (try looking for people who are doing relevant research in your uni)

Sometimes funding

Most importantly: people (sometimes it is possible to create a mailing list to all people with relevant major through administration)

Posts that are older, but still relevant

How will this post be useful to you?

  • You can use it to start disentangling the AI governance process and as an overview of the main activities and types of actors in the space, and of how they functionally relate to each other.

  • You can “zoom in” on an activity you’re especially interested in by going straight to it using the table of content (on the left on desktop browsing) or skim several activities when you are trying to understand an actor’s function or its position in the ecosystem.

  • You and others can use it as a common conceptual framework to facilitate effective actions, as well as coordination and communication.

  • You can use it to enrich the assessment of your career options and to refine your own theory of change.[1]

  • You can use it in conjunction with the longtermist AI governance map to identify gaps in longtermist AI governance activities or with the x-risk policy pipeline & interventions to identify some actions you can take to improve policy making.

No comments.