Stop-gradients lead to fixed point predictions

Johannes Treutlein and Rubi Hudson worked on this post as part of SERI MATS, under the mentorship of Evan Hubinger. Rubi has also received mentorship from Leo Gao. We thank Erik Jenner for helpful discussions and Alexander Pan for bringing the performative prediction literature to our attention.

Update 30 May 2023: We have now published a paper based on our previous post and this post (the material from this post is in Appendix D).

1. Introduction

In our previous post, we analyzed a setting in which an oracle AI is maximizing a strictly proper scoring rule while it can influence the world with its predictions about the probabilities of possible outcomes. In this setting, a prediction is called a fixed point if it accurately reflects the oracle’s beliefs about the world, after having made that prediction. We showed that an oracle AI that is maximizing a strictly proper scoring rule may choose to misrepresent its beliefs and not output a fixed point, even if one exists. This is bad since, all else equal, we expect better outcomes when decisions are based on accurate rather than inaccurate reports.

In our previous analysis, we assumed that the oracle jointly optimizes both its prediction and the effect of its prediction on the world. In contrast, in this post we examine cases in which the AI only optimizes its prediction—there is a stop-gradient (Foerster et al., 2018; Demski, 2019) in front of the oracle’s model of the world, which prevents optimizing outcomes directly.

This type of cognition could result, for instance, from self-supervised training on historical data. In that case, the AI may learn a robust world model, and it may learn to match its predictions to its beliefs about the world. However, the AI may never learn to optimize outcomes through its prediction, since this yields no advantage on the training data, which cannot be influenced by the AI.

Consider a situation where this oracle makes a prediction that can influence the world, and assume it models this influence correctly. Then this could put the AI in a game in which it is trying to make a prediction to match its world model, while the world model updates its beliefs conditional on the AI’s prediction. What happens in this game depends on the cognition learned by the AI during training. However, we show that, in contrast to the incentives discussed in our previous post, the only equilibria of this game would be fixed points. Such an oracle could thus be safer if there exists a unique safe fixed point, or if randomly chosen fixed points are likely safe.

In this post, we review and analyze several settings related to stop-gradients with the property that equilibria in them are fixed points or close to fixed points. Our goal is (i) to gain a better understanding of stop-gradient optimization in oracles and (ii) to highlight some potential training setups and agent cognitions that would lead to fixed points.

Similar ideas to the ones outlined above have been discussed in the literature on performative prediction (Perdomo et al. 2020). Performative prediction is a more general framework in machine learning in which the choice of model has an influence on the modeled distribution. Our analysis of oracle AIs is closely related and can be seen as a special case of this setting. We will indicate similarities where appropriate and make use of concepts from that literature.

First, we introduce performative stability and relate it to the game outlined above. We show that performatively stable predictions and equilibria in the game are fixed points. Second, we introduce versions of the repeated risk minimization and repeated gradient descent algorithms. We show that both lead to fixed point predictions. Third, we discuss practical methods based on repeated gradient descent that can be applied to an online learning setting, including Armstrong (2018)’s backwards-facing oracle.

Finally, we discuss no-regret learning and prediction markets. We introduce a no-regret learning framework and show that policies with sublinear regret also have sublinear prediction error. In our decision market model, we show that, if the weight of each trader is small, the equilibria of the market are close to fixed points.

Oracles with stop-gradients may optimize outcomes to find fixed points, which could lead to bad consequences. Thus, the safest oracles would be ones that make predictions only about aspects of the world they cannot influence. However, among oracles that can influence the world, ones with a stop-gradient are preferable for two reasons. First, they report their true beliefs, which gives us better information to base decisions on. This also enables approaches in which we ensure that there is only one safe fixed point. Second, the AI does not optimize its choice of fixed point, which is safer for the standard reasons of not wanting to optimize for an unaligned goal. Which fixed point is chosen will be contingent on initialization and specifics of the fixed point finding procedure.

2. Formal setting

Our setup is analogous to the one in our previous post. We refer the reader to that post for more context. We will generalize our setting to predictions over more than two possible outcomes in an upcoming paper, but in this post we focus on binary predictions for simplicity. We believe that the results in this post generalize to higher dimensions.

An oracle reports a probability for a binary event. A scoring rule is a function where is the extended real line. Given prediction and outcome the oracle receives the score We write

for the expected score of predicting given that is Bernoulli-distributed with parameter A scoring rule is called proper if

for all It is called strictly proper if this inequality is strict whenever

As in the previous post, we will use a result from Gneiting and Raftery (2007, Theorem 1).

Theorem 1 (Gneiting and Raftery). Let be a scoring rule. Then is strictly proper if and only if there exists a strictly convex function with subderivatives (if is differentiable, this is just the derivative, ) such that

for any .

To model the oracle’s perceived influence on the world, we assume that there is a function describing how the model’s beliefs about the world, , depend on its prediction, . That is, we assume that . We say that is a fixed point if i.e., if the oracle’s belief is after updating on making the prediction

3. Performative stability and game theory

Our discussion of oracles is related to performative prediction (Perdomo et al. 2020). This is a machine learning setting where we choose a model parameter (e.g., parameters for a neural network) that minimizes expected loss (e.g., classification error). In performative prediction, the distribution over data points can depend on the choice of model parameter. Our setting is thus a special case in which the parameter of interest is a probability distribution, the loss is a scoring function, and data points are discrete outcomes. Most results in this and our previous post have analogues in performative prediction.

Translated into our setting, we say that a prediction is performatively optimal if

This corresponds to the objective considered in our previous post, of an oracle that maximizes score. By results in that post, performatively optimal predictions are not necessarily fixed points.

Now consider an oracle that chooses predictions optimally given beliefs , but without optimizing explicitly. This idea is captured by the definition of performative stability.

Definition 2 (Performative stability (Perdomo et al. 2020)). A prediction is called performatively stable if

Note that is fixed when taking the maximum—there is a “stop-gradient” before in this objective. It follows that performatively stable points are fixed points.

Proposition 3. Assume is strictly proper. Then a prediction is a fixed point if and only if it is performatively stable.

Proof.”. Assume . Then, since is proper,

for any Hence,

”. Assume . Since is strictly proper, it follows . ◻

Having introduced performative stability and related it to fixed points, we will now discuss the game from the introduction.

Definition 4 (Oracle game). Consider a two-player continuous game in which the first player controls and the second player controls , with utility functions and for the two players, respectively.

If is a Nash equilibrium of the oracle game, we have and . Substituting the optimum value for the second player gives us exactly above definition of performative stability. Conversely, if a prediction is performatively stable, then setting yields a Nash equilibrium.

Proposition 5. Assume is a proper scoring rule. Then for and , is a Nash equilibrium of the oracle game, if and only if is performatively stable. By Proposition 3, this is equivalent to being a fixed point.

The oracle game could be played between two submodules of an oracle, one who is responsible for making predictions and one who is responsible for updating the oracle’s beliefs. Those two modules might collapse into just one big module taking a question and outputting a prediction, but it is still useful as a starting point to model them separately. Note that using the squared error for the world model is not essential here. We could also use any other function that is minimized at .

The game could also arise in an agent that uses causal decision theory to maximize its score and that believes that is influenced causally by , but only acausally by . In that case, the only ratifiable (see Jeffrey, 1983, Ch. 1.7) decision is a Nash equilibrium of the above game. Similarly, the deliberational causal epistemic decision theory discussed by Greaves (2013) would output Nash equilibria of this game (whereas performative optimality would correspond to an agent using evidential epistemic decision theory in this case).

Perdomo et al. (2020) introduce a stackelberg version of the oracle game that produces performatively optimal instead of performatively stable reports. Consider a game in which player acts first and chooses , after which player adjusts its prediction Then player will choose , so player ’s optimization problem becomes

4. Repeated risk minimization and repeated gradient descent

Above, we have defined an alternative optimization problem and an associated game which yield fixed points, but we have not defined methods for solving these problems. In the performative prediction context, Perdomo et al. (2020) introduce repeated risk minimization and repeated gradient descent, both methods that converge to performatively stable points. In this section, we review both schemes and show how repeated gradient descent can be seen as gradient descent on a stop-gradient objective.

Here, we assume direct access to instead of having only access to samples distributed according to . In the next section, we discuss online learning when we only have access to samples. One way to understand this distinction is that the former corresponds to the internal cognition of an agent with a belief optimizing a prediction The latter corresponds to a machine learning training setup for an oracle, where is the ground truth distribution instead of the oracle’s belief. Of course, there is no strict divide between the two. Any optimization algorithm could be used either by the agent itself or to train the agent.

First, repeated risk minimization is a procedure by which we start with a prediction and then iteratively update the prediction as . Note that this is equivalent to alternating best response learning in the oracle game, where players and alternatingly optimize their actions given the action by the other player. Then player ‘s update is and if is strictly proper, player ’s update is This shows that repeated risk minimization results in fixed point iteration on . Fixed point iteration converges globally to a fixed point if has Lipschitz constant . It also converges locally to a fixed point if is continuously differentiable at and .

Second, assume that is differentiable. Then repeated gradient ascent updates predictions via

where denotes the expectation over outcome with Bernoulli distribution with parameter , is the projection onto , and is the learning rate.

Using the definition of , we have

We can express this as

where, is the stop-gradient operator. It evaluates to the identity function but sets gradients to zero, (Foerster et al., 2018; Demski, 2019). This is not a mathematical function (there is no function that is equal to the identity but has gradient zero everywhere), but rather a notational convention in reference to the stop_gradient or detach functions from tensorflow or pytorch. Interestingly, one can perform valid derivations using the stop-gradient operator (e.g., using the chain rule). We leave it to future work to explore the mathematics behind stop-gradients further.

Importantly, it matters that the gradient in repeated gradient ascent lies inside instead of outside the expectation:

Unlike repeated gradient ascent, the latter implements gradient ascent on and thus leads to performatively optimal reports.

Perdomo et al. (2020) show that, given their assumptions, repeated gradient descent globally converges to stable fixed points, and they provide convergence rates. We will show an analogous result relating repeated gradient ascent to fixed points in our setting, though we won’t analyze global convergence.

To begin, we show that repeated gradient ascent is equivalent to naive learning (Letcher et al., 2019) in the oracle game, assuming that player always plays .

Proposition 6. Assume player is performing gradient ascent on its objective with learning rate , under the assumption that player plays . Then player ’s update is

Proof. The proof follows immediately from the definitions. Player ’s update is, by assumption,

where is player ’s action. Assuming player plays , we get

Next, we show that fixed points are critical points of the fixed-point objective.

Proposition 7. Assume is proper and let as in the Gneiting and Raftery characterization (Theorem 1) be differentiable. Then for any we have

In particular, if is a fixed point, it follows that . The reverse is true if

Proof.

Finally, we show that in our setting, repeated gradient ascent locally converges to fixed points , with linear convergence rate, assuming that and .

Theorem 8. Let be a strictly proper scoring rule. Let be a fixed point of such that is twice continuously differentiable at and . Moreover, assume is continuously differentiable at and . Then, for small enough , there exists an open set with such that for , an agent taking updates will linearly converge to .

Proof. In Appendix A. This is a standard convergence proof. Consider the discrete dynamical system defined by the agent’s updates By Proposition 7, it has a fixed point at . We show that . This means that the fixed point is stable and thus the system will locally converge to it given small enough . ◻

5. Online learning

Now consider a machine learning setup in which we train an oracle with stochastic gradient ascent on environment samples. We assume that at time a model makes a prediction and receives a score where is Bernoulli-distributed with parameter The model is then updated using gradient ascent on That is, for some learning rate schedule we have

We discuss this as a theoretical model for oracles trained using machine learning, to show how training setups may incentivize predicting fixed points. There are many issues with the setting beyond giving accurate predictions; for instance, even if the training process sets the right incentives on training examples, the learned model may be optimizing a different objective when generalizing to new predictions.

To see that above setup incentivizes predicting fixed points, note that

That is, the expectation of this gradient, conditional on is exactly the repeated gradient from the previous section. Hence, given the right assumptions, this converges to fixed points instead of performative optima. We do not show this here, but an analogous result in performative prediction was proved by Mendler-Dünner et al. (2020).

There are several variations of this setup that essentially set the same incentives. For instance, one could also draw entire batches of outcomes and then perform updates based on the batch gradient This is a monte carlo estimate of the repeated gradient and thus also converges to performatively stable points and thus fixed points (Perdomo et al., 2020). One could also mix the two algorithms and, e.g., perform gradient ascent on an average of past losses, yielding a version of the backwards-facing oracle discussed in Armstrong (2018). In Appendix D, we show that that oracle can only converge to fixed points.

Note that finding fixed points depends on the fact that we differentiate instead of the expectation If we used policy gradients to differentiate , for instance, we would again optimize for performative optimality. Similarly, we could learn a Q-function representing scores for each prediction, and update the function based on randomly sampled predictions . Then the Q-function would converge to estimates of , and the highest Q-value would be a performative optimum. There are also some more recent papers in performative prediction that explicitly try to estimate the gradient and thus find performatively optimal instead of stable points (Izzo et al., 2021).

Stop-gradients could also be circumvented in a hidden way (Krueger et al., 2020). For instance, consider a hyperparameter search to meta-learn a learning algorithm, where the evaluation criterion is the accumulated loss during an episode. Then this search would prefer algorithms that optimize directly, without a stop-gradient.

Lastly, repeated gradient descent is related to decoupled approval in RL (Uesato et al., 2020). The decoupled approval policy gradient samples actions and approval queries independently and can thus differentiate with a stop-gradient in front of the approval signal. In our setting, we can differentiate through directly, so it is not necessary to calculate this gradient with a decoupled policy gradient. Decoupled gradients could be used to implement the stop-gradient objective if scores were discrete or otherwise not differentiable.

6. No-regret learning

In this section, we consider no-regret learning and show that algorithms have sublinear regret if and only if their prediction error is sublinear. Regret takes the environment probabilities as given and asks which predictions would have been optimal in hindsight. It thus puts a “stop-gradient” in front of those environment probabilities.

As in the previous section, we assume that at time the agent makes a prediction and receives a score , where The agent’s cumulative score at step is defined as . In no-regret learning, we compare performance against experts, which choose sequences of probabilities . We assume that an expert’s prediction is independent of conditional on . I.e., an expert knows the predictions and thus probabilities , but it does not know the outcome of . Let the set of all such experts.

The regret of the agent is the difference between the cumulative score received by the best expert in expectation and the cumulative score received by the agent. To define it formally, let

for . is a random variable that maximizes the expectation of before is drawn, but conditional on .

Definition 9 (Regret). The regret of agent at time is

The agent is said to have sublinear regret or no-regret if

First, note that we define regret relative to the best expert in expectation instead of the best expert in hindsight. The latter would always be the one that made confident predictions and accidentally got all predictions exactly right. To achieve sublinear regret, it would thus be too much to ask the agent to perform well compared to the best expert in hindsight. Moreover, for scoring rules with this expert would have a constant score such that This would reduce the problem to minimizing the negative score and thus finding performatively optimal predictions.

Second, we evaluate the performance of the expert with respect to the environment outcomes generated by the agent instead of evaluating the expert according to outcomes generated using their own predictions. This means that, to receive sublinear regret, the agent only has to make accurate predictions—it does not have to find a performatively optimal prediction. Our setting is thus different from the no-regret learning setup discussed in Jagadeesan et al. (2022), where regret is defined with respect to In that setting, only agents converging to performatively optimal predictions have sublinear regret.

We begin by showing that the best expert in expectation actually exists, and that .

Proposition 10. Let be a proper scoring rule and an expert. Then for any we have

Moreover, we have and thus

Proof. Let and let be any expert. Conditional on , has parameter and is independent of by assumption. Hence,

Next, since is proper,

It follows that

Moreover, , as is constant given and thus independent of .

It follows that, for any , and thus

If is unbounded (such as the log scoring rule), then the agent’s scores can become arbitrarily low, and the limit of may be undefined. To simplify our analysis, we will thus assume that there is a bound on the variance of the received score and on the expected score of both the agent and the best expert . In the case of the log scoring rule, this would be satisfied, for instance, if the agent’s predictions are bound away from and .

Our next proposition shows that, given these assumptions, the limit exists and is nonnegative, and sublinear regret is equivalent to

Proposition 11. Let be a proper scoring rule. Assume that and that for . Then almost surely

In particular, almost surely both limits exist and are finite, and the agent has sublinear regret if and only if

Proof. In Appendix B. ◻

Now we turn to the main result for this section. We show that given our assumptions, agents have sublinear regret if and only if their prediction error is sublinear. Note that here, we do not require the to converge; they could also oscillate between different fixed points.

Theorem 12. Let be the sequence of the agent’s predictions and a strictly proper scoring rule. Assume that for , and assume that there exists a compact set such that for all and , and are continuous in at any . Then almost surely the agent has sublinear regret if and only if is sublinear, i.e., if .

Proof. In Appendix C. ◻

The next result shows that if the agent’s probabilities converge to some probability , then must be a fixed point.

Corollary 13. In addition to the assumptions from Theorem 12, assume that converges almost surely to a limit . Then almost surely is a fixed point if and only if the agent has sublinear regret.

Proof. By Theorem 12, almost surely the agent has sublinear regret if and only if

It remains to show that, given that the converge, the latter is equivalent to convergence to a fixed point.

Since is compact and for all also Hence, is continuous at so

Since this sequence converges, it is equal to its Cesàro mean,

Hence,

It follows that, if then

This shows that, almost surely, is a fixed point, if and only if is sublinear. ◻

7. Prediction markets

Lastly, we consider prediction markets. We assume a simplified model of a prediction market, in which traders submit a single prediction and get scored using a proper scoring rule. The prediction that is output by the market and that influences the outcome is just a weighted average of the individual traders’ predictions. In this situation, if a trader has a small weight and can thus barely influence the market prediction, the trader’s score will mostly be determined by the accuracy of the report, rather than the influence of the report on the market. Thus, if all traders are small relative to the market, the equilibrium prediction will be close to a fixed point.

A similar result was shown by Hardt et al. (2022) in the performative prediction context. They define a firm’s performative power as the degree to which the firm can influence the overall outcome with their prediction. Hardt et al. (2022) show that in an equilibrium, the distance between a player’s (performatively optimal) equilibrium strategy and their strategy when optimizing loss against the fixed equilibrium distribution (here, this means predicting the market probability) is bounded by the power of the trader. We give an analogous result for our formal setting and assumptions.

To formalize the setting, assume that there are players. We associate with each player a number such that , representing, intuitively, what fraction of the overall capital in the market is provided by player . In the game, all players simultaneously submit a probability . Then the event occurs with probability . Finally, each player is scored in proportion to for some strictly proper scoring rule . Typical market scoring rules would consider terms like but subtracting (or multiplying by constants) does not matter for the game. We assume that players maximize their expected score,

For discussions of market scoring rules, see Hanson 2003 and Sami and Pennock 2007. Prior work has connected these market scoring rules to more realistic prediction markets that trade Arrow—Debreu securities markets such as PredictIt (Hanson 2003; Hanson 2007; Pennock and Sami 2007, Ch. 4; Chen and Pennock 2007; Agrawal et al. 2009; Chen and Vaughan 2010).

We assume that is common knowledge. Moreover, in the following we only consider pure strategy equilibria, and we do not investigate the existence of equilibria.

Theorem 14. Let be a proper scoring rule and let as in the Gneiting and Raftery characterization. Let be a pure strategy Nash equilibrium of the game defined above and let be the market prediction. Assume is differentiable at . For any player , if are differentiable at and it follows that

Whenever , the bound is tight.

In particular, this theorem shows that players with very low (little capital/​influence on ) will accurately predict . Note, however, that is not necessarily a fixed point or close to a fixed point. If there are are also players with very high , then their prediction and the overall market prediction may be wrong. (So interestingly the overall market probability is worse than the prediction of individuals. One might take this to suggest that anyone interested in should look at the latter type of predictions. Of course, if this is what everyone does, it is not so clear anymore that the model is accurate.)

Proof. The proof is analogous to the proof of Proposition 9 in our previous post.

For to be a pure strategy Nash equilibrium, each player must play a best response to the other player’s strategies. That is, must be a global maximum of the function . Therefore, the derivative of this function must be zero if and positive/​negative if /​.

The derivative is

Now, if , we have

Rearranging terms and taking the absolute value, it follows

Next, assume is the optimal report for player Then we must have

By Theorem 1, is convex and thus Hence, the above is equivalent to . Since , it follows . Hence,

Finally, if , then analogously and since it follows again This concludes the proof. ◻

Corollary 15. In addition to the assumptions from Theorem 14, assume that is Lipschitz-continuous and that Let be a Nash equilibrium and let arbitrary. Then there exists a such that if for all , all of and , for all , as well as and are within of each other.

Proof. Let arbitrary. Let be the Lipschitz constant of and note that then for all By Theorem 14, it follows for and any player that

Now let and Then, assuming for all it follows

Moreover, since is a convex combination of probabilities it follows that also

Thus, by the triangle equality, we have and since is Lipschitz-continuous,

for any

This shows that all of are within of and thus by the triangle inequality within of each other. This concludes the proof. ◻

It would be interesting to extend these results. For example, it’s already not so clear if the players make predictions repeatedly. (To keep things simple, one should probably still imagine that all players know and that the environment probability is determined by applied to the majority forecast. If the traders have private information, prediction markets become harder to analyze. For some discussions, see Ostrovsky 2012, Chen and Waggoner 2016.)

Appendix A. Proof of Theorem 8

We use the following theorem, adapted from “Iterative Solution of Nonlinear Equations in Several Variables” (Ortega and Rheinboldt, 2000).

Theorem 16 (Ostrowski). Assume that has a fixed point and is continuously differentiable at . If for all eigenvalues of , then there exists an open set with such that for any , letting for it is for all and converges at least linearly to .

To begin, assume that is a fixed point of , that is continuously differentiable at , and that . Moreover, let as in the Gneiting and Raftery representation of , and assume that is twice continuously differentiable at and thus is continuous at .

Define the function

Note that the agent’s update rule is then defined via .

First, by Proposition 7, we know that for any we have , and if is a fixed point of , we must have and hence . Hence, is a fixed point of and thus also of .

Now we will apply Theorem 16 to . To that end, let

Then

since Moreover, by assumption, and . Hence, it follows that .

Now note that

Hence, for small enough , it is Moreover, by assumption, and are continuous, so also the derivative is continuous.

Now we can apply Ostrowski’s theorem, which tells us that, if , there is an open set with , such that for any , the iterates all lie in and converges at least linearly to .

This shows that if is an interior point, we can choose small enough to make sure that and thus on , and for the iteration locally converges to .

Now assume that (the proof for is analogous). We can extend the function for values by the Whitney extension theorem. This theorem tells us that if is continuously differentiable, then there also exists such that . We can then apply the above argument with Ostrowski’s theorem to the function . In particular, there exists an open subset with such that for any , we have for iterates and . This also applies to any .

Now consider the actual update rule with iterates and . Let ; i.e., this is the smallest such that an iterate is out of bounds and thus . If , then we are done. Otherwise, we have , so the actual update rule also converges to its fixed point (potentially faster than ). This concludes the proof.

Appendix B. Proof of Proposition 11

We will use a version of the strong law of large numbers for uncorrelated random variables with bounded variance, adapted from Neely (2021, Theorem 2).

Theorem 17. Let be a sequence of pairwise uncorrelated random variables with mean and bounded variances. I.e., assume that

for all

There exists such that for all

for all .

Then almost surely

We will apply this law to random variables , where is either or .

First, by Proposition 10, . Second, by assumption,

Hence, also

It follows that also .

Third, we know that is independent of and for , conditional on . Moreover, is constant given . Hence, given , also is independent of . Moreover,

It follows for that

This shows all conditions of the theorem and thus shows that

almost surely.

Now we turn to the limit of . By assumption, , so this limit exists and is finite. Thus, almost surely

Using Proposition 10, it follows that almost surely

Turning to the “in particular” part, note that this limit is finite by the above, and it is nonnegative since is assumed to be proper. Moreover, it follows that almost surely

Thus, almost surely if and only if This concludes the proof.

Appendix C. Proof of Theorem 12

We begin by proving a lemma.

Lemma 18. Let and assume there exists a constant such that for all we have Assume that for any , there exists such that if for any , then . Then

Proof. We prove the contrapositive. That is, we assume that there exists some constant such that there are infinitely many such that . Let be the set of such . We show that then there exists a constant such that for infinitely many , .

Let . Since by assumption it follows that Let . Since it must be for more than fraction of the times . Otherwise, it would be

By assumption, this gives us a such that whenever , also . In particular, this applies to at least fraction of . Hence, it follows that for any ,

This shows that there are infinitely many such that and thus concludes the proof. ◻

Now we turn to the main proof.

Proof of Theorem 12. Let be the sequence of the agent’s predictions. Assume is strictly proper, assume that for and assume that there exists a compact set such that for all , and and are continuous in at any .

To begin, note that continuity of and implies that both are also bounded on and thus for . Hence, by our assumptions, the conditions for Proposition 11 are satisfied.

”. Assume is sublinear. We want to show that then is sublinear. To do this, we will apply Lemma 18.

To begin, define and note that since is proper. By Proposition 11, it follows that if is sublinear, also is sublinear almost surely. For brevity, we omit the “almost surely” qualification in the following.

Next, define , and note that . Next, let arbitrary. To apply Lemma 18 to and , it remains to show that there exists such that whenever , then .

To that end, let

Since is continuous for any , the set is compact. Moreover, and are continuous by assumption, and thus the minimum is attained at some point . But since is strictly proper, it follows Hence, since for any it follows that whenever it follows

This shows all conditions for Lemma 18. Hence, we conclude that .

”. Let and . We assume that is sublinear in and want to show that then is sublinear as well. To do so, we will show that is sublinear using our lemma, and then the required statement follows again from Proposition 11.

Now we have to show the conditions of the lemma. First, as before, and Second, as noted in the beginning, we have by our assumption that and are continuous on Now let arbitrary. Assume that for some and

Consider the set . Since and are continuous on by assumption, this set is compact. Moreover, the function is continuous since is continuous on by assumption. Hence, the minimum is attained at some point

Now, if we would have and thus

which is a contradiction. Hence, Since it follows from for that This shows the third condition for the lemma. We can thus conclude that Using Proposition 11, this concludes the proof. ◻

Appendix D. Armstrong’s backwards-facing oracle

Here, we analyze a version of Armstrong (2018)’s backwards-facing oracle. This is a version of the online learning setup from Section 5 in which the agent’s prediction is trained to minimize the average historical loss,

We consider training via gradient descent and let for some learning rate

In the following, we show that, if this learning scheme converges to a point then Afterwards, we conclude from Proposition 7 that must be a fixed point (assuming that ).

Proposition 19. Assume for are the agents predictions and almost surely for some Assume that is continuous and that exists and is continuous for any and . Then

Proof. To begin, note that since we can choose a closed interval such that . By assumption, is continuous and thus bounded for . For each , let be the projection of onto . Finally, let .

We will show the following:

  1. as

  2. for all , as

  3. as

from which it follows that .

1. “ as ”. Note that there almost surely exists some such that and thus also for all Since as it follows that as . Moreover, we have that almost surely for sufficiently large, so that almost surely and for sufficiently large. Thus, almost surely,

Finally, since is bounded, we have by the dominated convergence theorem that and as a consequence,

2. “for all , as ”. We have that

Since is continuous, we have . Then, by compactness, we have that is bounded on . Finally, by the dominated convergence theorem, we may conclude as . As a consequence, as .

Thus,

3. “ as ”. Note that

as , since and are both continuous functions of on .

Finally, by the dominated convergence theorem and our second result,

And we are done. ◻

Corollary 20. Assume is proper and let as in the Gneiting and Raftery characterization (Theorem 1) be continuously differentiable at any Assume is continuous, as defined above converges to some prediction almost surely, and Then is a fixed point of .

Proof. By Proposition 7, we have for any Thus, since is continuous by assumption, also is continuous. Hence, by Proposition 19, it follows that By Proposition 7, it follows that is a fixed point of . ◻