# No nonsense version of the “racial algorithm bias”

In dis­cus­sions of al­gorithm bias, the COMPAS scan­dal has been too of­ten quoted out of con­text. This post gives the facts, and the in­ter­pre­ta­tion, as quickly as pos­si­ble. See this for de­tails.

#### The fight

The COMPAS sys­tem is a statis­ti­cal de­ci­sion al­gorithm trained on past statis­ti­cal data on Amer­i­can con­victs. It takes as in­puts fea­tures about the con­vict and out­puts a “risk score” that in­di­cates how likely the con­vict would re­offend if re­leased.

In 2016, ProPublica or­ga­ni­za­tion claimed that COMPAS is clearly un­fair for blacks in one way. North­pointe replied that it is ap­prox­i­mately fair in an­other way. ProPublica re­bukes with many statis­ti­cal de­tails that I didn’t read.

The ba­sic para­dox at the heart of the con­tention is very sim­ple and is not a sim­ple “ma­chines are bi­ased be­cause it learns from his­tory and his­tory is bi­ased”. It’s just that there are many kinds of fair­ness, each may sound rea­son­able, but they are not com­pat­i­ble in re­al­is­tic cir­cum­stances. North­pointe chose one and ProPublica chose an­other.

#### The math

The ac­tual COMPAS gives a risk score from 1-10, but there’s no need. Con­sider the toy ex­am­ple where we have a de­cider (COMPAS, a jury, or a judge) judg­ing whether a group of con­victs would re­offend or not. How well the de­cider is do­ing can be mea­sured in at least three ways:

• False nega­tive rate = (false nega­tive)/​(ac­tual pos­i­tive)

• False pos­i­tive rate = (false pos­i­tive)/​(ac­tual nega­tive)

• Cal­ibra­tion = (true pos­i­tive)/​(test pos­i­tive)

A good de­cider should have false nega­tive rate close to 0, false pos­i­tive rate close to 0, and cal­ibra­tion close to 1.

Vi­su­ally, we can draw a “square” with four blocks:

• false nega­tive rate = the “height” of the false nega­tive block,

• false pos­i­tive rate = the “height” of the false pos­i­tive block,

• cal­ibra­tion = (true pos­i­tive block)/​(to­tal area of the yel­low blocks)

Now con­sider black con­victs and white con­victs. Now we have two squares. Since they have differ­ent re­offend rates for some rea­son, the cen­tral ver­ti­cal line of the two squares are differ­ent.

The de­cider tries to be fair by mak­ing sure that the false nega­tive rate and false pos­i­tive rates are the same in both squares, but then it will be forced to make the cal­ibra­tion in the Whites lower than the cal­ibra­tion in the Blacks.

Then sup­pose the de­cider try to in­crease the cal­ibra­tion in the Whites, then the de­cider must some­how de­crease the false nega­tive rate of Whites, or the false pos­i­tive rate of Whites.

In other words, when the base rates are differ­ent, it’s im­pos­si­ble to have equal fair­ness mea­sures in:

• false nega­tive rate

• false pos­i­tive rate

• calibration

Oh, for­got to men­tion, even when base rates are differ­ent, there’s a way to have equal fair­ness mea­sures in all three of those… But that re­quires the de­cider to be perfect: Its false pos­i­tive rate and false nega­tive rate must both be 0, and its cal­ibra­tion must be 1. This is un­re­al­is­tic.

In the jar­gon of fair­ness mea­sure­ment, “equal false nega­tive rate and false pos­i­tive rate” is “par­ity fair­ness”; “equal cal­ibra­tion” is just “cal­ibra­tion fair­ness”. Par­ity fair­ness and cal­ibra­tion fair­ness can be straight­for­wardly gen­er­al­ized for COMPAS, which uses a 1-10 scor­ing scale, or in­deed any nu­mer­i­cal risk score.

It’s some straight­for­ward alge­bra to prove that in this gen­eral case, par­ity fair­ness and cal­ibra­tion fair­ness are in­com­pat­i­ble when the base rates are differ­ent, and the de­cider is not perfect.

#### The fight, af­ter-math

North­pointe showed that COMPAS is ap­prox­i­mately fair in cal­ibra­tion for Whites and Blacks. ProPublica showed that COMPAS is un­fair in par­ity.

The les­son is that there are in­com­pat­i­ble fair­nesses. To figure out which to ap­ply—that is a differ­ent ques­tion.

• I like the no-non­sense sec­tion ti­tles!

I also like the at­tempt to graph­i­cally teach the con­flict be­tween the differ­ent fair­ness desider­ata us­ing squares, but I think I would need a few more in­ter­me­di­ate di­a­grams (or prob­a­bly, to work them out my­self) to re­ally “get it.” I think the stan­dard cita­tion here is “In­her­ent Trade-Offs in the Fair Deter­mi­na­tion of Risk Scores”, but that pre­sen­ta­tion has a lot more equa­tions and fewer squares.

• Yes, (Klein­berg et al, 2016)… Do not read it. Really, don’t. The deriva­tion is ex­tremely clumsy (and my pro­fes­sor said so too).

The proof has been con­sid­er­ably sim­plified in sub­se­quent works. Look around for pa­pers that cite that pa­per should give a pub­lished pa­per that does the sim­plifi­ca­tion...

• Ac­tu­ally, Klein­berg et al. 2016 isn’t all that bad. They have a small para­graph at the be­gin­ning of sec­tion 2 which they call an “in­for­mal overview” over the proof. But it’s ac­tu­ally al­most a de­cent proof in and of it­self. You may ac­cept it as such, or you may write it down a bit more for­mally, and you end up with a short, sweet proof. The rea­son they can’t use a graph­i­cal ap­proach like the one in this blog en­try is that the above di­a­gram with the squares only ap­plies to the spe­cial case of scores that ei­ther out­put 0 or 1, but noth­ing in be­tween. That is an im­por­tant spe­cial case, but a spe­cial case nev­er­the­less. Klein­berg et al. deal with the more com­mon and slightly more gen­eral case of scores which can take any real value from 0 to 1. Also the COMPAS score, which is the topic of the ProPublica re­port cited above, can take other val­ues than just 0 and 1.

By the way, also the in­tro­duc­tory sec­tion of the Klein­berg-et-al-pa­per is definitely worth read­ing. It gives an overview over the rele­vance of the prob­lem for other ar­eas of ap­pli­ca­tion. So only their at­tempt at a for­mal proof is kind of a waste of time to read.

• I’ll side with ProPublica, be­cause my un­der­stand­ing of fair­ness (equal treat­ment for ev­ery­one) seems to be closer to par­ity than cal­ibra­tion. For ex­am­ple, a test that always re­turns pos­i­tive or always flips a coin is par­ity-fair but not cal­ibra­tion-fair.

• Would you have the same po­si­tion if the al­gorithm would keep blacks in prison at much higher rates than their risk of re­ci­di­vism war­ranted? (In­stead of vice versa)

How do you feel about in­surance? In­surance is, un­der­stand­ably, cal­ibra­tion-fair. If men have higher rates of car ac­ci­dents than women, they pay a higher rate. They are not treated equally to women. You would pre­fer to elimi­nate dis­crim­i­na­tion on such vari­ables? Wouldn’t you have to elimi­nate all vari­ables in or­der to treat ev­ery­one equally?

• I don’t think ev­ery­one should be par­ity-fair to ev­ery­one else—that’s un­fea­si­ble. But I do think the gov­ern­ment should be par­ity-fair. For ex­am­ple, a health­care safety net shouldn’t rely on free mar­ket in­surance where the sick pay more. It’s bet­ter to have a sys­tem like in Switzer­land where ev­ery­one pays the same.

• I like the idea of clearly show­ing the core of the prob­lem us­ing a graph­i­cal ap­proach, namely how the differ­ent base rates keep us from hav­ing both kinds of fair­ness.

There is one glitch, I’m afraid: It seems you got the no­tion of cal­ibra­tion wrong. In your way of us­ing the word, an ideal cal­ibra­tion would be a perfect score, i.e. a score that out­puts 1 for all the true pos­i­tives and 0 for all the true nega­tives. While perfect scores play a cer­tain role in Klein­berg et al’s pa­per as an un­re­al­is­tic cor­ner case of their the­o­rem, the stan­dard no­tion of cal­ibra­tion is a differ­ent one: It de­mands that when you look at a score bracket (the set of all peo­ple hav­ing ap­prox­i­mately the same score), the ac­tual frac­tion of pos­i­tive in­stances in this group should (ap­prox­i­mately) co­in­cide with the score value in this bracket. To avoid dis­crim­i­na­tion, one also checks that this is true for white and for black defen­dants sep­a­rately.

For­tu­nately, your ap­proach still works with this defi­ni­tion. In your draw­ing, it trans­lates into the de­mand that, in each of the two squares, the yel­low area must be as large as the left column (the ac­tual pos­i­tives). As­sume that this is the case in the up­per draw­ing. When we go from the up­per to the lower draw­ing, the bound­ary be­tween the left and right column moves to the right, as the base rate is higher among blacks. This is nicely in­di­cated with the red ar­rows in the lower draw­ing. So the area of the left column in­creases. But of this newly ac­quired ter­ri­tory of the left column, only a part is also a new part of the yel­low area. Another part was yel­low and stays yel­low, and a third part is now in the left column, but not part of the yel­low area. Hence, in the lower draw­ing, the left column is larger than the yel­low area.

• While this is a nice sum­mary of clas­sifier trade-offs, I think you are en­tirely too dis­mis­sive of the role of his­tory in the dataset, and if I didn’t know any bet­ter, I would walk away with the idea that fair­ness comes down to just choos­ing an op­ti­mal trade-off for a clas­sifier. If you had read any of the tech­ni­cal re­sponse, you would have no­ticed that when con­trol­ling for “re­ci­di­vism, crim­i­nal his­tory, age and gen­der across races, black defen­dants were 45 per­cent more likely to get a higher score”. Con­trols are im­por­tant be­cause they let you get at the un­der­ly­ing causal model, which is more im­por­tant for pre­dict­ing a per­son’s re­ci­di­vism than what statis­ti­cal cor­re­la­tions will tell you. Choos­ing the right causal model is not an easy prob­lem, but it is at the heart of what we mean when we con­ven­tion­ally talk about fair­ness.

• Didn’t you just show that “ma­chines are bi­ased be­cause it learns from his­tory and his­tory is bi­ased” is in­deed the case? The base rates differ be­cause of his­tor­i­cal cir­cum­stances.

• I’m fol­low­ing com­mon speech where “bi­ased” means “statis­ti­cally im­moral, be­cause it vi­o­lates some fair­ness re­quire­ment”.

I showed that with base rate differ­ence, it’s im­pos­si­ble to satisfy three fair­ness re­quire­ments. The de­cider (ma­chine or not) can com­pletely ig­nore his­tory. It could be a coin-flip­per. As long as the de­cider is im­perfect, it would still be un­fair in one of the fair­ness re­quire­ments.

And if the base rates are not due to his­tor­i­cal cir­cum­stances, this im­pos­si­bil­ity still stands.

• I’m not sure what “statis­ti­cally im­moral” means nor have I ever heard the term, which makes me doubt it’s com­mon speech (googling it does not bring up any uses of the phrase).

I think we’re us­ing the term “his­tor­i­cal cir­cum­stances” differ­ently; I sim­ply mean what’s hap­pened in the past. Isn’t the base rate purely a func­tion of the records of white/​black con­vic­tions? If so, then the fact that the rates are not the same is the rea­son that we run into this fair­ness prob­lem. I agree that this prob­lem can ap­ply in other set­tings, but in the case where the base rate is a func­tion of his­tory, is it not ac­cu­rate to say that the cause of the co­nun­drum is his­tor­i­cal cir­cum­stances? An al­ter­na­tive his­tory with equal, or es­sen­tially equal, rates of con­vic­tions would not suffer from this prob­lem, right?

I think what peo­ple mean when they say things like “ma­chines are bi­ased be­cause they learn from his­tory and his­tory is bi­ased” is pre­cisely this sce­nario: his­tor­i­cally, con­vic­tion rates are not equal be­tween racial groups and so any al­gorithm that learns to pre­dict con­vic­tions based on his­tor­i­cal data will in­evitably suffer from the same in­equal­ity (or suffer from some other is­sue by try­ing to fix this one, as your anal­y­sis has shown).

• No. Any de­cider will be un­fair in some way, whether it knows any­thing about his­tory at all. The de­cider can be a coin flip­per and it would still be bi­ased. One can say that the un­fair­ness is baked into the re­al­ity of base-rate differ­ence.

The only way to fix this is not fix­ing the de­cider, but to just some­how make the base-rate differ­ence dis­ap­pear, or to com­pro­mise on the defi­ni­tion of fair­ness so that it’s not so stringent, and satis­fi­able.

And in com­mon lan­guage and com­mon dis­cus­sion of al­gorith­mic bias, “bias” is de­cid­edly NOT merely a statis­ti­cal defi­ni­tion. It always con­tains a moral judg­ment: vi­o­la­tion of a fair­ness re­quire­ment. To say that a de­cider is bi­ased is to say that the statis­ti­cal pat­tern of its de­ci­sion vi­o­lates a fair­ness re­quire­ment.

The key mes­sage is that, by the com­mon lan­guage defi­ni­tion, “bias” is un­avoid­able. No amount of try­ing to fix the de­cider will make it fair. Blind­ing it to the his­tory will do noth­ing. The un­fair­ness is in the base rate, and in the defi­ni­tion of fair­ness.

• The base rates in the di­a­gram are not his­tor­i­cal but “po­ten­tial” rates. They show the pro­por­tion of cur­rent in­mates up for pa­role who would be re-ar­rested if paroled. In prac­tice this is in­deed es­ti­mated by look­ing at his­tor­i­cal rates but as long as the true base rates are differ­ent in re­al­ity, no al­gorithm can be fair in the two senses de­scribed above.

• Afaik, in ML, the term bias is used to de­scribe any move away from the uniform /​ mean case. But in com­mon speech, such a move would only be called a bias if it’s in­ac­cu­rate. So if the al­gorithm learns a true pat­tern in the data (X is more likely to be clas­sified as 1 than Y is) that wouldn’t be called a bias. Un­less I mi­s­un­der­stand your point.

• 1. No-one has ac­cess to the ac­tual “re-offend” rates: all we have is “re-ar­rest,” “re-con­vict,” or at best “ob­served and re­ported re-offence” rates.

2. A-pri­ori we do not ex­pect the amount of melanin in a per­son’s skin, or the word they write down on a form next to the prompt “Race” to be cor­re­lated with the risk of re-offense. So, any tool that looks at “a bunch of fac­tors” and comes up with “Black peo­ple are more likely to re-offend” is “bi­ased” com­pared to our prior (even if our prior is wrong).