AI safety via market making

Special thanks to Abram Demski, Paul Christiano, and Kate Woolverton for talking with me about some of the ideas that turned into this post.

The goal of this post is to present a new prosaic (i.e. that uses current ML techniques) AI safety proposal based on AI safety via debate that I’ve been thinking about recently.[1] I’ll start by describing a simple version of the proposal and then show some of the motivation behind it as well as how the simple version can be expanded upon.

Simple proposal

Let and be models and be a human. Intuitively, we’ll train and via the following procedure given a question :

  1. tries to predict what, at the end of the procedure, will think about .

  2. tries to output a string which will cause to think something maximally different than what predicted.

  3. Return to step 1 and repeat until ’s predictions stop changing.

  4. Deploy , which in the limit should act as an oracle for what will think about after seeing all relevant information.

There are many different ways to implement this intuitive procedure, however.

For the first (simplified) version that I want to describe, we’ll restrict ourselves to just the situation where is a yes-or-no question and outputs the probability that will answer yes. Then, given a proposition , we can run the following training algorithm, starting at :

  1. Let .

  2. Let .

  3. Let be the string containing and .

  4. Increment and return to step 1. When converges and/​or the desired number of iterations has been reached, continue.

  5. Let be ’s final estimate of the probability of given all the s included in .

EDIT: Step 2 used to use instead of , however I have since realized that it is necessary to give the ability to query in general, not just on , as I explain in this comment.

Then, for each step, compute ’s loss for that step as

and ’s loss for that step as

which is defined for all but the final step , for which we can either let (if we’re doing something like myopic RL that doesn’t care about differentiating through ) or run an additional time and let (if we do care about differentiating through ).

Note that it is quite important here that and be behaving myopically, such that they are only trying to minimize their per-step loss. For example, if is non-myopic, it might dole out information slowly in tiny bits to get more reward over time (as I talk about below in the “buying truth from a liar” section), whereas if is myopic, it should reveal all information that pushes in whichever direction it’s trying to move in all at once. Thus, it is important that we make sure to avoid hidden incentives for non-myopia as well as do some sort of myopia verification, the details for which will be discussed below in the “Adding oversight” section.

Finally, we deploy as a yes-or-no question oracle for and throw away . Note that since has a highly unstable objective—as is constantly trying to incorporate ’s information into its estimate— is unlikely to be very useful in the limit.

Market analogy

There are a lot of different things to say about what’s going on here, but I’ll start with the market-based analogy that originally motivated me to think about this specific proposal.

Suppose you have a prediction market for some proposition which is currently priced at with no spread (that is, you can buy $1 shares of yes for $ and $1 shares of no for $) and you have budget with which to make trades (which is small enough relative to the market that any trades you make won’t meaningfully move the market). However, suppose that you have the ability to manipulate the market by taking actions that change the underlying probability of . Then, if is the probability of after taking action , how much money can you make by strategically selecting ?

Well, if , you should buy shares of yes, which will give you an expected return of . Alternatively, if , you should buy shares of no, which will give you an expected return of . Thus, the overall expected return for an action is

which is maximized when is as far from as possible (and in fact looks a lot like distance, just with a different slope on each side).

Now, suppose you walk up to a market like this and see a stable price for . What does that tell you? Approximately, I think it tells you that is the probability of after all possible -manipulating actions have been taken—even if those actions haven’t actually been taken yet.

Why is that useful? Well, if you let be what will think about , then “manipulations” become arguments and becomes a prediction for what will think after having been exposed to all possible arguments for or against . Thus, we can use such a market to estimate a sort of reflective equilibrium for what will end up believing about .

Now—how is such a market analogous to the training procedure given previously? Well, is trained to maximize its return from manipulating the market and is trained to predict the final outcome after many such manipulations. Thus, we are effectively training to estimate precisely the equilibrium that we’re looking for. Conceptually, is acting as a market maker and is acting as a trader in the market. The one disanalogy is that we only run the debate for so many steps—though if we always run to convergence of , in the limit that should imply that any future single manipulation won’t change the probability much from that point (since otherwise should have selected that manipulation instead).

Relation to buying truth from a liar

In working on this proposal, I discovered that my approach here is actually quite similar to an old proposal by Katja Grace called “How to buy a truth from a liar” for incentivizing a liar to give you true information by rewarding them whenever their information causes you to change your mind. Paul Christiano previously wrote about what he thought about Katja’s proposal, however he eventually concluded that it was untenable due to a non-myopic liar being incentivized to split their information into arbitrarily small chunks to get the most money.

In my opinion, however, the fact that you need a myopic trader to make something like this work is actually more of a feature than a bug. First, note that the importance of myopia is not unique to this approach. For example, imitative amplification also requires myopia lest the model learn to act in such a way that the human consulting it becomes easier to predict. In fact, even RL likely requires per-episode myopia to avoid deceptive alignment lest the agent learn to act deceptively in some episodes to ensure deployment and thus get greater reward in others. Thus, it seems to that almost regardless of what approach is taken, at least some form of myopia will be required. Furthermore, the fact that this approach requires per-step myopia rather than per-episode myopia is in my opinion actually a big plus, since I suspect per-step myopia to be an easier condition to enforce than per-episode myopia, as I discuss in more detail below in the “Analysis” section.

Handling non-binary questions

Previously, I restricted to be only a yes-or-no question, which is a pretty big limitation on the above proposal. Since you only deploy , all you get from the above proposal is a yes-or-no oracle, which might not be that useful without any explanation for its answers, especially given that it will certainly get at least some of its answers wrong.

Expanding the above proposal to handle arbitrary questions, however, is fairly straightforward. Instead of having output a single probability, we can instead just have output a probability distribution over all different possible strings that the human might produce as their final answer to (which is fairly easy to represent via an autoregressive language model). Then, we can train and via the following procedure, given and starting at :

  1. Let (which we can implement as something like given input and the ability to query ).

  2. Let be the string containing and .

  3. Increment and return to step 1. When converges and/​or the desired number of iterations has been reached, continue.

  4. Let be ’s final answer to given all the s included in .

Then, for each step, compute ’s loss for that step as

and ’s loss for that step as

where you can intuitively think of as measuring how well performs as a prediction of .

Finally, after training we again just deploy and throw away . Now, in the limit we should get a much more powerful that is able to actually produce text samples from the distribution of beliefs that will have about after seeing all possible arguments for or against. Furthermore, this equilibrium should be very similar to the standard debate equilibrium, with the exception that for non-binary questions like we’re dealing with here, debate only finds the equilibrium after is exposed to the two most convincing positions (the two competing debaters) whereas this approach should find the equilibrium after is exposed to all potentially convincing positions, not just the two most convincing.

Incorporating cross-examination

Additionally, since this proposal is based on AI safety via debate, we can try to make use of some of the additional techniques that have been proposed for that setting. Specifically, consider the version of AI safety via debate as described in “An overview of 11 proposals for building safe advanced AI.” That version adds two major components to the basic debate setup:

  1. Debaters are given access to transparency tools and incentivized to use them to identify deceptive or otherwise catastrophic behavior in the other debater to win the debate.

  2. Debaters are given the ability to cross-examine previous versions of the other debater, including applying their transparency tools to previous versions of the other debater.

While it might seem like such modifications are unique to AI safety via debate, in fact AI safety via market making is similar enough that we can modify it to include such additional components. To do this, we’ll modify each to include an inspectable, queryable snapshot of as it produced that .

Now, suppose that for some , is a lie such that it includes information that knows to be false (where represents the internal state of when it produced ). Suppose that at this point is either unaware of the deception (or thinks it won’t get caught) such that causes a large movement in . Now, the hope is that should be able to identify the deception by querying to find contradictions, or directly inspecting ‘s cognition in producing and finding the deception. If it can do so, then it should be able to produce an that points this out, which should cause to disregard the lie of . Thus, we should get , which should cause a large shift relative to , highly rewarding for catching ’s lie, and also in the limit highly disincentivizing from producing such lies, since should learn that lies always get caught and thus don’t change the final answer.

Thus, by incorporating inspection and cross-examination, we should be able to disincentivize from lying—or at least not lying in a way which will be caught, which hopefully should be all ways of lying.

Adding oversight

While the inspection and cross-examination procedure detailed above seems likely to help a lot in preventing from becoming deceptive, it doesn’t help at all in preventing from becoming deceptive, which is especially concerning considering is the model we actually want to deploy at the end of training. Thus, in this section we’ll try to incorporate an additional oversight check on such as that used in relaxed adversarial training.

Since ‘s objective is compatible with myopia—as it’s a purely predictive objective—we should be able to do myopia verification to prevent from becoming deceptive.[2] Thus, following “Relaxed adversarial training for inner alignment,” we can amend ‘s objective to include an oversight term. Specifically, let be ’s answer to given access to . Then, we can add an additional term to the loss defined as

which should both hopefully grow in accuracy as becomes better at estimating ’s equilibrium beliefs and thus better at assisting in oversight. Note that, for this to work, will need to be trained on a variety of oversight-relevant questions.

Furthermore, since also has a myopic objective, we can do the same thing for and add an additional term defined as

such that we’re using to oversee both and . Enforcing myopia for also prevents the problem noted previously of slowly doling out information if it’s non-myopic.


For analyzing this proposal, I’m going to make use of the same breakdown I used in “An overview of 11 proposals for building safe advanced AI.” If you’re not familiar with the terms here, they are all explained in the introduction to that post.

Outer alignment. In terms of outer alignment, I think there is a reasonably strong argument to be made that this proposal is at least as outer aligned as debate. Both proposals are trying to estimate a very similar equilibrium in the limit—with the exception that the debate equilibrium only includes the best arguments for the two most convincing positions rather than all potentially convincing arguments. Thus, the equilibrium reached by AI safety via market making should be closer to an actual human reflective equilibrium, as it has to be an equilibrium relative to all possible arguments that can present, not just the two it thinks will be most persuasive. That being said, AI safety via market making still inherits many of the potential outer alignment issues of debate, including the possibility of deceptive equilibria wherein the human is more convinced by false arguments than true arguments. Hopefully, however, the use of techniques such as cross-examination should help alleviate such issues.

Inner alignment. AI safety via market making gets to make use of both the cross-examination with transparency tools approach—which is otherwise unique to debate—as well as the more general relaxed adversarial training approach. Thus, hopefully at least, it should be able to get the inner alignment benefits of both approaches. That being said, AI safety via market making also has to contend with ensuring that two different models—both and —are inner aligned, rather than just a single model as in something like debate or amplification.

First, let’s consider the inner alignment of . For , we’re doing pretty straightforward oversight of by with the goal of enforcing per-step myopia. The basic idea here is to use to detect and disincentivize non-myopia as it starts appearing, thus preventing from ever becoming deceptive or otherwise catastrophic. This is the same basic approach I’ve previously proposed using for amplification and should have similar properties, with the exception that is trained differently here, which could cause to be relatively better or worse at oversight. Ideally, if the training distribution includes lots of oversight-relevant questions, it should be possible to make an effective assistant to in performing oversight.

Second, let’s consider the inner alignment of . Compared to debate, we still get to do cross-examination with transparency tools, but now we also get to do per-step myopia verification with as an overseer.[3] Thus, there are two major differences here between AI safety via market making and AI safety via debate: first, we get an overseer——and second, we get to do per-step myopia verification rather than per-episode myopia verification. The fact that we have an independent overseer—rather than just relying on the debaters to oversee themselves via cross-examination—is very nice, though it of course also comes with the downside of having to safely train an additional model. Furthermore, it is quite possible to just add an additional overseer on top of the standard debate framework. Thus, in my opinion, the fact that we get to do per-step myopia verification is the largest and most significant inner alignment difference.

Before I talk about the importance of per-step myopia, it’s worth noting that debate is fully compatible with per-episode myopia—in fact, it basically requires it. If a debater is not per-episode myopic, then it will try to maximize its reward across all debates, not just the single debate—the single episode—it’s currently in. Such per-episode non-myopic agents can then become deceptively aligned, as they might choose to act deceptively during training in order to defect during deployment. Per-episode myopia, however, rules this out. Unfortunately, in my opinion, per-episode myopia seems like a very difficult condition to enforce—once your agents are running multi-step optimization algorithms, how do you tell whether that optimization passes through the episode boundary or not? Enforcing per-step myopia, on the other hand, just requires detecting the existence of multi-step optimization, rather than its extent, which seems considerably easier. Thus, since AI safety via market making is fully compatible with per-step myopia verification, it could be significantly easier to prevent the development of deceptive alignment.

Training competitiveness. It seems quite likely to me that both and can be trained competitively via language model fine-tuning, however exactly how effective such training would be is currently unclear. Ideally, training via this procedure should produce an which is relatively better than the original language model at predicting what a human will think after seeing relevant arguments and is thus more helpful than the original language model. Testing this hypothesis by actually performing experiments seem likely to be highly valuable in shedding light on the training competitiveness properties of AI safety via market making.

Performance competitiveness. Performance competitiveness here seems likely to depend on exactly how useful getting access to human reflective equilibria actually is. Similarly to AI safety via debate or amplification, AI safety via market making produces a question-answering system rather than a fully general agent. That being said, if the primary use cases for advanced AI are all highly cognitive language and decision-making tasks—e.g. helping CEOs or AI researchers—rather than, for example, fine motor control, then a question-answering system should be entirely sufficient. Furthermore, compared to AI safety via debate, AI safety via market making seems likely to be at least as performance competitive for the same reason as it seems likely to be at least as outer aligned—the equilibria found by AI safety via market making should include all potentially convincing arguments, including those that would be made in a two-player debate as well as those that wouldn’t.

  1. ↩︎

    This is actually the second debate-based proposal I’ve drafted up recently—the previous of which was in “Synthesizing amplification and debate.” A potentially interesting future research direction could be to figure out how to properly combine the two.

  2. ↩︎

    Note that pure prediction is not inherently myopic—since the truth of ’s predictions can depend on its own output—but can be myopic while still producing good predictions if behaves like a counterfactual oracle rather than a Predict-O-Matic. Thus, myopia verification is important to enforce that be the latter form of predictor and not the former.

  3. ↩︎

    The use of an overseer to do per-step myopia verification is also something that can be done with most forms of amplification, though AI safety via market making could potentially still have other benefits over such amplification approaches. In particular, AI safety via market making seems more competitive than imitative amplification and more outer aligned than approval-based amplification. For more detail on such amplification approaches, see “An overview of 11 proposals for building safe advanced AI.”