AI Safety Research Futarchy: Using Prediction Markets to Choose Research Projects for MARS
Summary
Geodesic is going to use prediction markets to select their projects for MARS 4.0 and we need your help to make the markets run efficiently! Please read through the proposals, and then trade on the markets for the proposals you think might succeed or fail. We intend to choose the best proposals in two weeks!
Full proposals are in Google doc linked below, links to markets are in the section “The Projects”.
Google Doc (similar content to this post + full proposal overviews).
Manifold post (similar content to this post).
Introduction
Geodesic is a new AI safety startup focused on research that is impactful for short AGI/ASI timelines. As part of this, we are committed to mentoring several projects as part of the Mentorship for Alignment Research Students program (MARS), run by the Cambridge AI Safety Hub (CAISH).
We are also excited about new ways to choose and fund research that reflect the aggregated perspectives of our team and the broader community. One way of doing this is using conditional prediction markets, also known as Futarchy, where people bet on the outcomes of taking various actions so that the predicted-best action can be taken.
We believe a system similar to this might be really useful for deciding on future research proposals, agendas, and grants. Good rationalists test their beliefs, and as such, we are doing a live-fire test to see if the theory works in practice.
We are going to apply this to select research projects for MARS 4.0, an AI safety upskilling program like MATS or SPAR, based in Cambridge UK. We have drafted a number of research proposals, and want the community to bet on how likely good outcomes are for each project (conditional on being selected). We will then choose the projects which are predicted to do best. These markets are being hosted on Manifold.
To our knowledge, this is the first time Futarchy will be publicly used to decide on concrete research projects.
Futarchy
For those familiar with Futarchy / decision markets, feel free to skip this section. Otherwise, we will do our best to explain how it works.
When you want to make a decision with Futarchy, you first need a finite set of possible actions to be taken, and a success metric, whose true value will be known about at some point in the future. Then, for each action, a prediction market is created to try and predict the future value of the success metric given that decision is taken. At some fixed time, the action with the highest predicted success is chosen, and all trades on the other markets are reverted. When the actual value of the success metric is finally known, the market for the chosen action is resolved, and those who predicted correctly are rewarded for their insights. This creates an incentive structure that rewards people who have good information or insights to trade on the markets, improving the predictions for taking each action, and overall causing you to make the decision that the pool of traders thinks will be best.
As a concrete example, consider a company deciding whether or not to fire a CEO, and using the stock price one year after the decision as the success metric. Two markets would be created, one predicting the stock price if they’re fired, and one predicting the stock price if they’re kept on. Then, whichever one is trading higher at decision time is used to make the decision.
For those interested in further reading about Futarchy, Robin Hanson has written extensively about it. Some examples include its foundations and motivation, speculation about when and where it might be useful, and why it can be important to let the market decide.
The Metrics
Unlike stock prices of a company, there’s no clear single metric by which research can be judged. Because of this, we’ve decided on a small selection of binary outcomes that will each be predicted separately, and then we will use their average in order to make the final decisions. We’re not claiming these are the best metrics to judge a research project by, but we think they will be appropriate for the MARS program and sufficient for this experiment. The outcomes are:
A LessWrong post is produced within 6 months and gains 50 upvotes or more within a month of posting.
If a LessWrong post is produced, it gains 150 upvotes or more within a month of posting.
A paper is produced and uploaded to arXiv within 9 months.
If a paper is produced, it is accepted to a top ML conference (ICLR, ICML, or NeurIPS) within 6 months of being uploaded to arXiv.
If a paper is produced, it receives 10 citations or more within one year of being uploaded to arXiv.
Clarifications:
Unless otherwise stated, timeframes are given from when the research begins, i.e. the start of the MARS program, 1st December 2025
Updates to posts and papers will be considered the same entity as the original for purposes of outcome resolution (i.e. If a paper is produced and uploaded to arXiv within 9 months, but it is edited after this before being accepted at a conference, (4) still resolves YES)
Some outcomes are conditional on others as follows: outcome (2) will resolve N/A if (1) resolves NO, outcomes (4)-(6) will resolve N/A if (3) resolves NO
The Projects
There are 3 groups of proposals, and from within each group we will choose the single predicted-best project. The groupings represent the mentors for that project and are in no specific order. To read through each of the proposals and the bios for each mentor, please see please see the Google doc. Currently we plan to close the markets and choose the projects on Monday the 12th of October. Below we provide a quick overview of each with a link to its respective prediction market.
Group 1: Cameron Tice
Goal Crystallisation: Investigating whether AI models that fake alignment perfectly during training retain their original goals or have them changed despite following through with deceptive alignment.
Post-training order and CoT Monitorability: Examining whether different post-training orders (applying RLHF after reasoning training) affect models’ tendency to obfuscate deceptive reasoning in their chain-of-thought outputs.
Online Learning for Research Sabotage Mitigation: Attempting to use online learning to train AI systems away from research sabotage by creating deliberately underperforming model organisms and applying RL to try and improve their research performance.
Group 2: Puria Radmard & Shi Feng
Salient features of self-models: Testing whether LLMs have genuine self-models or just recognize stylistic patterns by examining if self-recognition training generalizes across different types of content.
Exploring more metacognitive capabilities of LLMs: Investigating whether LLMs can metacognitively monitor their own internal probability distributions and predictive models, with implications for deceptive alignment and AI safety.
Group 3: Lennie Wells
Model organisms resisting generalisation: Testing whether AI models learn the distribution of tasks under which humans have good oversight, and resist generalisation beyond this distribution.
Detection game: Running a ‘detection game’ to investigate how we can best prompt trusted monitors to detect research sabotage.
Research sabotage dataset: Creating a public dataset of tasks reflecting current and future AI safety research that can be used to study underelicitation and sandbagging.
Model Emulation: Can we use LLMs to predict other LLM’s capabilities?
Go trade!
We hope to use prediction markets to effectively choose which research projects we should pursue, as well as conducting a fun experiment on the effectiveness of Futarchy for real-world decision making. The incentive structure of a prediction market motivates those who have good research taste or insights to implicitly share with us their beliefs and knowledge, helping us make the best decision possible. That said, anyone is free to join in and trade, and the more people who do the better the markets perform. So we need your help! Please read through the proposals and vote on the markets, be a part of history by partaking in this experiment!
Nice! I’ll trade on the markets. Will you randomize (some of) your choices, as dynomight suggests?
I did something similar a while ago for Quantified Self experiments.
Thank you for partaking!
Your linked experiment looks very interesting, I will give it a read, thank you for the heads up.
We’re not going to randomise choices as the symmetry of the sorts of actions being chosen combined with the fact that the market both makes the decision and mentors are trading on the markets (as suggested by Hanson) means we shouldn’t suffer any of the weird oddities decision markets can theoretically occasionally suffer from.
I made an Manifold interface that is better fitted to futarchy:
https://futarchy.online
Very hacky for now, if people think this is valuable, I will build it out.
From group 1 → Online learning for research sabotage mitigation:
This task suite from goodfire might be a possibility?
Cons:
This suite was made specifically to test some notebook editor MCP, so might need tweaking
Almost certainly has less tasks than the facebook environment
It seems likely that models will by default not do super well in this environment since presumably interp is more OOD than ml
Thank you for the suggestion!
LessWrong is increasingly being put under pressure, I hope it does not become a journal. I wish good luck to the admins.
Thanks for your engagement with the post. I’m not quite sure I understand what you’re getting at? Please could you elaborate?
I was referring to the fact that you set LessWrong posts with karma thresholds as target metrics. This kind of thing has in general the negative side effect of incentivizing exploitation of loopholes in the LessWrong moderation protocol, karma system, and community norms, to increase the karma of one own’s posts. See Goodhart’s law.
I do not think this is currently a problem. My superficial impression of your experiment is that it is good. However, this kind of thing could become a problem down the line if it becomes more common. This will be born out as a mix of lowering the quality of the forum and increased moderation work.
Ahh I see what you mean now, thank you for the clarification.
I agree that in general people trying to exploit and Goodhart LW karma would be bad, though I hope the experiment would not contribute to this. Here, post karma is only being used as a measure, not as a target. The mentors and mentees gain nothing beyond what any other person would normally gain by their research project resulting in a highly-upvoted LW post. Predicted future post karma is just being used optimise over research ideas, and the space of ideas itself is very small (in this experiment) and I doubt we’ll get any serious Goodharting by selection of them that are perhaps not very good research but likely to produce particularly mimetic LW posts (and even then this is part of the motivation of having several metrics, so that none get too specifically optimised for).
There is perhaps an argument that those who have predicted a post would get high karma might want to manipulate it up to make their prediction come true, but those who predicted it would be lower have the opposite incentive. Regardless of that, that kind of manipulation is I think quite strictly prohibited by both LW and Manifold guidelines, and anyone caught doing it in a serious way would likely be severely reprimanded. In the worst case, if any of the metrics are seriously and obviously manipulated in a way that cannot be rectified, the relevant markets will be resolved N/A, though I think this happening is extremely low probability.
All that said, I think it is important to think about what more suitable / better metrics would be, if research futarchy was to become more common. I can certainly imagine a world where widespread use of LW post karma as a proxy for research success could have negative impacts on the LW ecosytem, though I hope by then there will have been more development and testing of robust measures beyond our starting point (which, for the record, I think is somewhat robust already).