Per protocol analysis as medical malpractice
“Per protocol analysis” is when medical trial researchers drop people who didn’t follow the treatment protocol. It is outrageous and must be stopped.
For example: let’s say you want to know the impact of daily jogs on happiness. You randomly instruct 80 people to either jog daily or to simply continue their regular routine. As a per protocol analyst, you drop the many treated people who did not go jogging. You keep the whole control group because it wasn’t as hard for them to follow instructions.
At this point, your experiment is ruined. You ended up with lopsided groups: people able to jog and the unfiltered control group. It would not be surprising if the eager joggers had higher happiness. This is confounded: it could be due to preexisting factors that made them more able to jog in the first place, like being healthy. You’ve thrown away the random variation that makes experiments useful in the first place.
This sounds ridiculous enough that in many fields per protocol analysis has been abandoned. But…not all fields. Enter the Harvard hot yoga study, studying the effect of hot yoga on depression.
If the jogging example sounded contrived, this study actually did the same thing but with hot yoga. The treatment group was randomly assigned to do hot yoga. Only 64% (21 of 33) of the treatment group remained in the study until the endpoint at the 8th week compared to 94% (30 of 32) of the control. They end up with striking graphs like this which could be entirely due to the selective dropping of treatment group subjects.
What’s depressing is that there is a known fix for this: intent-to-treat analysis. It looks at effects based on the original assignment, regardless of whether someone complied or not. The core principle is that every comparison should be split on the original random assignment, otherwise you risk confounding. It should be standard practice to report the intent-to-treat and many medical papers do so—at least in the appendix somewhere. The hot yoga study does not.
It might be hard to estimate this if you’re following people over time and there’s a risk of differential attrition—you’re missing data for a selected chunk of people.
Also, hot yoga could still really work! But we just don’t know from this study. And with all the buzz, there’s a good chance this paper ends up being worse than useless: leading to scaled-up trials with null findings that might’ve not been run if there had been more transparency to begin with.
The key fraud seems to be the word “control”. The researchers did not have real control over who was doing the Yoga.
I don’t think this part is so bad. You can lack “real control” but still have an informative experiment. You just need that people in the treatment group were on average more likely to receive the treatment. The issue is how they analyzed the data.
There are informative experiments that are not controlled experiments. Claiming that they are controlled when they aren’t however sounds fraudulent to me.
I don’t find the exhortation to abandon per-protocol analysis very compelling without even a basic attempt to understand why educated people acting in good faith would be using it. I’m not a medical research expert, but I can think of at least a few situations off the top off my head where per-protocol studies are useful.
A per-protocol study which showed significant benefit to those who complied with a treatment regimen, but a high rate of drop outs (which in intent-to-treat analysis would be clumped together) would be useful in differentiating between an effective but poorly-tolerated treatment and an ineffective one. This could be the difference between abandoning a potential treatment and researching ways to make it more tolerable
Per-protocol studies give us something to rely on when speaking to highly-motivated patients who are willing and able to stick to demanding regimens and can therefore experience the benefit which we wouldn’t have noticed in an intent-to-treat study
Per-protocol studies can help to establish proof-of-concept for a new medical treatment. It’s useful to demonstrate that a new treatment works under ideal conditions, even if that knowledge doesn’t generalise to real-world populations yet
I agree that it’s useful for papers to contain both per-protocol and intention-to-treat analyses, but your claim here (that no one should ever conduct per-protocol analysis) seems far too strong to me. I’m not sure why you’re so keen to cut off a very important methodology just because there’s a theoretical risk that people who don’t read the details of a study might draw the wrong conclusions from it.
Thanks, I agree that I should better understand why so many medical researchers do this. And I can definitely imagine situations where it helps: small sample (so ITT is under powered), useful treatment (so we’re worried about false negatives), low chance of bias (compliance isn’t correlated with outcomes). My belief is that these three factors don’t align enough to justify per protocol analysis. It’s way more likely to lead you to false cures. It’s hard to test this empirically though so if you had examples of treatments that got correctly assessed and advanced with a per protocol analysis I’d be very interested!
Do you have much evidence of a time when medical researchers have been misled by a per-protocol analysis into advancing a treatment which, were it analysed using an intention-to-treat analysis, would not have been taken forward?
As one of I’m sure many good examples of the utility of per-protocol analysis; this study, looking at whether routine screening with colonoscopy reduces the risk of colorectal cancer. In the intention-to-treat analysis (examining those who were invited to be screened), the difference in death rate wasn’t significant. In the per-protocol analysis (examining those who actually took part in screening), the difference was statistically significant.
By banning per-protocol analysis in the way you’re suggesting, there would have been no way of knowing whether it was worth investing more in getting people to actually turn up to screening- the only conclusion of the trial would have been “inviting people to get screened doesn’t significantly reduce their risk of death”. Thanks to per-protocol analysis, we can say “actually, the people who turn up DO die less often- so if we can get people to respond to our invitations, we can reduce the risk of death”.
I still feel that, so far, you have identified what per-protocol analysis is, decided independently that there is an enormous risk of researchers misinterpreting this fairly basic concept, and then making incorrect decisions based on this misunderstanding. I don’t think there’s much evidence to suggest that this potential harm is real, and there’s lots of evidence (both from this specific case study and from the fact that thousands of legitimate medical researchers obviously find it useful) to think that there’s benefit.
If your claim is actually that, without understanding the difference between per-protocol and intention-to-treat analyses, someone superficially reading a paper might misunderstand the conclusions, then I agree it’s technically possible, but I don’t think that’s likely or significant enough to call it “medical malpractice”.
I didn’t realize this was a common practice, that does seem pretty bad!
Do you have a sense of how commonplace this is?
In my econometrics classes, we would have been instructed to take an instrumental variables approach, where “assignment to treatment group” is an instrument for “does the treatment”, and then you can use a two stage least squares regression to estimate the effect of treatment on outcome. (My mind is blurry on the details.)
IIUC this sounds similar to intent-to-treat analysis, except allowing you to back out the effect of actually doing the treatment, which is presumably what you care about in most cases.
I don’t have a sense of the overall prevalence, I’m curious about that too! I’ve just seen it enough in high-profile medical studies to think it’s still a big problem.
Yes this is totally related to two-stage least squares regression! The intent-to-treat estimate just gives you the effect of being assigned to treatment. The TSLS estimate scales up the intent-to-treat by the effect that the randomization had on treatment (so, e.g., if the randomization increased the share doing yoga from 10 in the control group to 50% in the treatment group, the intent-to-treat effect divided by 0.40 would give you the TSLS estimate).
If the treatment is relatively mild, the dropouts are comparable between groups then I am not sure that per protocol will introduce much bias. What do you think? In that case it can be a decent tool for enhancing power, although the results will always be considered “post hoc” and “hypothesis-generating”.
From experience I would say that intention-to-treat analysis is the standard in large studies of drugs and supplements, while per protocol is often performed as a secondary analysis. Especially when ITT is marginal and you have to go fishing for some results to report and to justify follow-up research.
The bias introduced is probably usually small, especially when the dropout rate is low. But, in those cases you get very little “enhanced power”. You would be better off just not bothering with a per-protocol analysis, as you would get the same result from an ordinary analysis based on which group the person was sorted into originally (control or not).
The only situation in which the per-protocol analysis is worth doing is one where it makes a real difference to the statistics, and that is exactly the same situation in which it introduces the risk of introducing bias. So, I think it might just never be worth it: it removes a known problem (due to dropouts, some people in the yoga group didn’t do all the yoga), with an unknown problem (the yoga group is post-selected nonrandomly), effecting exactly the same number of participants—so the same scale of problem.
In the Yoga context then I would say that if it’s really good at curing depression then surely its effect size is going to be big enough to swamp a small number of yoga dropouts.
They also only have 32 participants in the trial. I don’t know if its a real rule, but I feel like the smaller the dataset the more you should stick to really basic simple measures.
It’s a good question. I have the intuition that just a little potential for bias can go a long way toward messing up the estimated effect, so allowing this practice is net negative despite the gains in power. The dropouts might be similar on demographics but not something unmeasured like motivation. My view comes from seeing many failed replications and playing with datasets when available, but I would love to be able to quantify this issue somehow...I would certainly predict that studies where the per protocol finding differs from the ITT will be far less likely to replicate.