This paper is the first publication about an ambitious idea which, if accepted by the statistical community, could have significant impact on how randomized trials are reported. Two other manuscripts from the same project are available as working papers on arXiv. This blog post is intended as a high-level overview of the idea, to explain why I think this work is important.

Q: What problem are you trying to solve?

Randomized controlled trials are often conducted in populations that differ substantially from the clinical populations in which the results will be used to guide clinical decision making. My goal is to clarify the conditions that must be met in order for the randomized trial to be informative about what will happen if the drug is given to a target population which differs from the population that was studied.

As a first step, one could attempt to construct a subgroup of the participants in the randomized trial, such that the subgroup is sufficiently similar to the patients you are interested in, in terms of some observed baseline covariates. However, this leaves open the question of how one can determine what baseline covariates need to be accounted for.

In order to determine this, it would be necessary to provide a priori biological facts which would lead to the effect in one population being equal to the effect in another population. For example, if we somehow knew that the effect of a drug is entirely determined by some gene whose prevalence differs between two countries, it is possible that when we compare people in Country A who have the gene with people in Country B who also have the gene, and compare people in Country A who don’t have the gene with people in Country B who don’t have the gene, the effect is equal between the relevant groups. Using an extension of this approach, we can try to look for a set of baseline covariates such that the effect can be expected to be approximately equal between two populations once we make the comparisons within levels of the covariates.

Unfortunately, things are more complicated than this. Specifically, we need to be more precise about what we mean by the word “effect”. When investigators measure effects, they have several options available to them: They can use multiplicative parameters (such as the risk ratio and the odds ratio), additive parameters (such as the risk difference), or several other alternatives that have fallen out of fashion (such as the arcsine difference). If the baseline risks differ between two populations (for example, between men and women), then at most one of these parameters can be equal between the two groups. Therefore, a biological model that ensures equality of the risk ratio cannot also ensure equality of the risk difference. The logic that determines whether a set of covariates is sufficient in order to get effect equality, is therefore necessarily dependent on how we choose to measure the effect.

Making things even worse, the commonly used risk ratio is not symmetric to the coding of the outcome variable: Generalizations based on the ratio of probability of death, will give different predictions from generalizations based on the ratio of probability survival.. In other words, when using a risk ratio model, your conclusions are not invariant to an arbitrary decision that was made when the person who constructed the dataset decided whether to encode the outcome variable as (death=1, survival=0) or as (survival=1, death=0).

The information that doctors (and the public) extract from randomized trials is often in the form of a summary measure based on a multiplicative parameter. For example, a study will often report that a particular drug “doubled” the effect of a particular side effect, and this then becomes the measure of effect that the clinicians will use in order to inform their decision making. Moreover, the standard methodology for meta-analysis is essentially a weighted average of the multiplicative parameter from each study. Any conclusion that is drawn from these studies would have been different if investigators had chosen a different effect parameter, or a different coding scheme for the outcome variable. These analytic choices are rarely justified by any kind of argument, and instead rely on a convention to always use the risk ratio based on the probability of death. No convincing rationale for this convention exists.

My goal is to provide a general framework that allows an investigator to reason from biological facts about what set of covariates are sufficient to condition on, in order for the effect in one population to be equal to the effect in another, in terms of a specified measure of effect. While the necessary biological conditions can at best be considered approximations of the underlying data generating mechanism, clarifying the precise nature of these conditions will be useful to assist reasoning about how much uncertainty there is about whether the results will generalize to other population.

Q: What are the existing solutions to this problem, and why do you think you can improve on them?

Recently, much attention has been given to a solution by Judea Pearl and Elias Bareinboim, based on an extension of causal directed acyclic graphs. Pearl and Bareinboim’s approach is mathematically valid and elegant. However, the conditions that must be met in order for these graphs to be a reasonable approximation of the data generating mechanism, are much more restrictive than most trialists are comfortable with.

Here, I am going to skip a lot of details about these selection diagrams, and instead focus on the specific aspect that I find problematic. These selection diagrams abandon measures of effect completely, and instead consider the counterfactual distribution of the outcome under the intervention separately from the counterfactual distribution of the outcome under the control condition. This resolves a lot of the problems associated with effect measures, but it also fails to make use of information that is contained in how these two counterfactuals relate to each other.

Consider for example an experiment to determine the effect of homeopathy on heart disease. Suppose this experiment is conducted in men, and determines that there is no effect. If we use selection diagrams to reason about whether these conclusions also hold in women, we will have to construct a causal graph that contains every cause of heart disease whose distribution differs between men and women, measure these variables and control for them. Most likely, this will not be possible, and we will conclude that we are unable to make any prediction for what will happen if women take homeopathic treatments. The approach simply does not allow us to try to extrapolate the effect size (even when it is null), since it cannot make use of information about how what happened under treatment relates to what happens under the control condition. The selection diagram approach therefore leaves key information on the table: In my view the information that is left out is exactly those pieces of information that could most reliably be used to make generalizations about causal effects.

A closely related point is that the Bareinboim-Pearl approach leads to a conclusion that meta-analysis can be conducted separately in the active arm and the control arm. Most meta-analysts would consider this idea crazy, since it arguably abandons randomization (which is an objective fact about how the data was generated) in favor of unverifiable and questionable assumptions encoded in the graph, essentially claiming that all causes of the outcome have been measured.

Q: What are counterfactual outcome state transition parameters?

Our goal is to construct a measure of effect that allows us to capture the relationship between what happens if treated, to what happens if untreated. We want to do this in a way that avoids the mathematical problems with standard measures of effect, and such that magnitude of the parameters has a biological interpretation. If we succeed in doing this, we will be able to determine what covariates to control for on the basis of asking what biological properties are associated with the magnitude of the parameters.

Counterfactual outcome state transition parameters are effect measures that quantify the probability of “switching” outcome state if we move between counterfactual worlds. We define one parameter which measures the probability that the drug kills the patient, conditional on being someone who would have survived without the drug, and another parameter which measures the probability that the drug saves the patient, conditional on being someone who would have died without the drug.

Importantly, these parameters are not identified from the data, except under strong monotonicity conditions. For example, if we believe that the drug helps some people, harms other people and has no effect on a third group, there is no monotonicity and the method cannot be used. However, it is sometimes the case that the drug only operates in one direction. For example, for most drugs, it is very unlikely that the drug prevents someone from getting an allergic reaction to it. Therefore, its effect on allergic reactions is monotonic.

If the effect of treatment is monotonic, one of the COST parameters is equal to 0 or 1, and the other parameter is identified as the risk ratio. If this is a treatment that reduces incidence, the COST parameter associated with a protective effect is equal to the standard risk ratio based on the probability of death. If on the other hand the treatment increases incidence, the COST parameter associated with a harmful effect is identified as the recoded risk ratio based on the probability of survival. Therefore, if we determine which risk ratio to use on the basis of the COST model, the risk ratio is constrained between 0 and 1.

Q: Is this idea new?

The underlying intuition behind this idea is not new. For example, Mindel C. Sheps published a remarkable paper in the New England Journal of Medicine in 1958, in which she works from the same intuition and reaches essentially the same conclusions. Sheps’ classic paper has more than 100 citations in the statistical literature, but her recommendations have not been adapted to any detectable extent in applied statistical literature. Jon Deeks provided empirical evidence for the idea of using the standard risk ratio for protective treatments, and recoded risk ratio for harmful effects, in Statistics in Medicine in 2012.

What is new to this paper, is that we formalize the intuition Sheps was working from in terms of a formal counterfactual causal model, which is used as a bridge between the background biological knowledge and the choice of effect measure. Formalizing the problem in this way allows us to clarify the scope and limits of the approach, and points the direction to how these ideas can be used to inform future developments in meta-analysis.

Thinking about your goal of improving how people evaluate randomized trials, a powerful way to gain attention and avoid a similar fate to Sheps’s work would be to apply it to produce a surprising and useful result that would otherwise not be possible. An equally good approach might be to use it to overturn what is currently conventional wisdom by reinterpreting the existing data to show it actually proves the opposite of what has currently been concluded.

I don’t know enough about the relevant fields to make any recommendations in that area, but maybe another reader here will have some suggestions. I think asking for suggestions in the wider EA community might also be a good approach since folks working on global poverty, for example, may already know a problem where your approach would be useful.

## Counterfactual outcome state transition parameters

Today, my paper “The choice of effect measure for binary outcomes: Introducing counterfactual outcome state transition parameters” has been published in the journal

Epidemiologic Methods. The version of record is behind a paywall until December 2019, but the final author manuscript is available as a preprint at arXiv.This paper is the first publication about an ambitious idea which, if accepted by the statistical community, could have significant impact on how randomized trials are reported. Two other manuscripts from the same project are available as working papers on arXiv. This blog post is intended as a high-level overview of the idea, to explain why I think this work is important.

Q: What problem are you trying to solve?Randomized controlled trials are often conducted in populations that differ substantially from the clinical populations in which the results will be used to guide clinical decision making. My goal is to clarify the conditions that must be met in order for the randomized trial to be informative about what will happen if the drug is given to a target population which differs from the population that was studied.

As a first step, one could attempt to construct a subgroup of the participants in the randomized trial, such that the subgroup is sufficiently similar to the patients you are interested in, in terms of some observed baseline covariates. However, this leaves open the question of how one can determine what baseline covariates need to be accounted for.

In order to determine this, it would be necessary to provide

a prioribiological facts which would lead to the effect in one population being equal to the effect in another population. For example, if we somehow knew that the effect of a drug is entirely determined by some gene whose prevalence differs between two countries, it is possible that when we compare people in Country A who have the gene with people in Country B who also have the gene, and compare people in Country A who don’t have the gene with people in Country B who don’t have the gene, the effect is equal between the relevant groups. Using an extension of this approach, we can try to look for a set of baseline covariates such that the effect can be expected to be approximately equal between two populations once we make the comparisons within levels of the covariates.Unfortunately, things are more complicated than this. Specifically, we need to be more precise about what we mean by the word “effect”. When investigators measure effects, they have several options available to them: They can use multiplicative parameters (such as the risk ratio and the odds ratio), additive parameters (such as the risk difference), or several other alternatives that have fallen out of fashion (such as the arcsine difference). If the baseline risks differ between two populations (for example, between men and women), then at most one of these parameters can be equal between the two groups. Therefore, a biological model that ensures equality of the risk ratio cannot also ensure equality of the risk difference.

The logic that determines whether a set of covariates is sufficient in order to get effect equality, is therefore necessarily dependent on how we choose to measure the effect.Making things even worse, the commonly used risk ratio is not symmetric to the coding of the outcome variable: Generalizations based on the ratio of probability of death, will give different predictions from generalizations based on the ratio of probability survival.. In other words, when using a risk ratio model, your conclusions are not invariant to an arbitrary decision that was made when the person who constructed the dataset decided whether to encode the outcome variable as (death=1, survival=0) or as (survival=1, death=0).

The information that doctors (and the public) extract from randomized trials is often in the form of a summary measure based on a multiplicative parameter. For example, a study will often report that a particular drug “doubled” the effect of a particular side effect, and this then becomes the measure of effect that the clinicians will use in order to inform their decision making. Moreover, the standard methodology for meta-analysis is essentially a weighted average of the multiplicative parameter from each study. Any conclusion that is drawn from these studies would have been different if investigators had chosen a different effect parameter, or a different coding scheme for the outcome variable. These analytic choices are rarely justified by any kind of argument, and instead rely on a convention to always use the risk ratio based on the probability of death. No convincing rationale for this convention exists.

My goal is to provide a general framework that allows an investigator to reason from biological facts about what set of covariates are sufficient to condition on, in order for the effect in one population to be equal to the effect in another, in terms of a specified measure of effect. While the necessary biological conditions can at best be considered approximations of the underlying data generating mechanism, clarifying the precise nature of these conditions will be useful to assist reasoning about how much uncertainty there is about whether the results will generalize to other population.

Q: What are the existing solutions to this problem, and why do you think you can improve on them?Recently, much attention has been given to a solution by Judea Pearl and Elias Bareinboim, based on an extension of causal directed acyclic graphs. Pearl and Bareinboim’s approach is mathematically valid and elegant. However, the conditions that must be met in order for these graphs to be a reasonable approximation of the data generating mechanism, are much more restrictive than most trialists are comfortable with.

Here, I am going to skip a lot of details about these selection diagrams, and instead focus on the specific aspect that I find problematic. These selection diagrams abandon measures of effect completely, and instead consider the counterfactual distribution of the outcome under the intervention separately from the counterfactual distribution of the outcome under the control condition. This resolves a lot of the problems associated with effect measures, but it also fails to make use of information that is contained in how these two counterfactuals relate to each other.

Consider for example an experiment to determine the effect of homeopathy on heart disease. Suppose this experiment is conducted in men, and determines that there is no effect. If we use selection diagrams to reason about whether these conclusions also hold in women, we will have to construct a causal graph that contains every cause of heart disease whose distribution differs between men and women, measure these variables and control for them. Most likely, this will not be possible, and we will conclude that we are unable to make any prediction for what will happen if women take homeopathic treatments. The approach simply does not allow us to try to extrapolate the effect size (even when it is null), since it cannot make use of information about how what happened under treatment relates to what happens under the control condition. The selection diagram approach therefore leaves key information on the table: In my view the information that is left out is exactly those pieces of information that could most reliably be used to make generalizations about causal effects.

A closely related point is that the Bareinboim-Pearl approach leads to a conclusion that meta-analysis can be conducted separately in the active arm and the control arm. Most meta-analysts would consider this idea crazy, since it arguably abandons randomization (which is an objective fact about how the data was generated) in favor of unverifiable and questionable assumptions encoded in the graph, essentially claiming that all causes of the outcome have been measured.

Q: What are counterfactual outcome state transition parameters?Our goal is to construct a measure of effect that allows us to capture the relationship between what happens if treated, to what happens if untreated. We want to do this in a way that avoids the mathematical problems with standard measures of effect, and such that magnitude of the parameters has a biological interpretation. If we succeed in doing this, we will be able to determine what covariates to control for on the basis of asking what biological properties are associated with the magnitude of the parameters.

Counterfactual outcome state transition parameters are effect measures that quantify the probability of “switching” outcome state if we move between counterfactual worlds. We define one parameter which measures the probability that the drug kills the patient, conditional on being someone who would have survived without the drug, and another parameter which measures the probability that the drug saves the patient, conditional on being someone who would have died without the drug.

Importantly, these parameters are not identified from the data, except under strong monotonicity conditions. For example, if we believe that the drug helps some people, harms other people and has no effect on a third group, there is no monotonicity and the method cannot be used. However, it is sometimes the case that the drug only operates in one direction. For example, for most drugs, it is very unlikely that the drug prevents someone from getting an allergic reaction to it. Therefore, its effect on allergic reactions is monotonic.

If the effect of treatment is monotonic, one of the COST parameters is equal to 0 or 1, and the other parameter is identified as the risk ratio. If this is a treatment that reduces incidence, the COST parameter associated with a protective effect is equal to the standard risk ratio based on the probability of death. If on the other hand the treatment increases incidence, the COST parameter associated with a harmful effect is identified as the recoded risk ratio based on the probability of survival. Therefore, if we determine which risk ratio to use on the basis of the COST model, the risk ratio is constrained between 0 and 1.

Q: Is this idea new?The underlying intuition behind this idea is not new. For example, Mindel C. Sheps published a remarkable paper in the New England Journal of Medicine in 1958, in which she works from the same intuition and reaches essentially the same conclusions. Sheps’ classic paper has more than 100 citations in the statistical literature, but her recommendations have not been adapted to any detectable extent in applied statistical literature. Jon Deeks provided empirical evidence for the idea of using the standard risk ratio for protective treatments, and recoded risk ratio for harmful effects, in Statistics in Medicine in 2012.

What is new to this paper, is that we formalize the intuition Sheps was working from in terms of a formal counterfactual causal model, which is used as a bridge between the background biological knowledge and the choice of effect measure. Formalizing the problem in this way allows us to clarify the scope and limits of the approach, and points the direction to how these ideas can be used to inform future developments in meta-analysis.

Thinking about your goal of improving how people evaluate randomized trials, a powerful way to gain attention and avoid a similar fate to Sheps’s work would be to apply it to produce a surprising and useful result that would otherwise not be possible. An equally good approach might be to use it to overturn what is currently conventional wisdom by reinterpreting the existing data to show it actually proves the opposite of what has currently been concluded.

I don’t know enough about the relevant fields to make any recommendations in that area, but maybe another reader here will have some suggestions. I think asking for suggestions in the wider EA community might also be a good approach since folks working on global poverty, for example, may already know a problem where your approach would be useful.