Latent Variables and Model Mis-Specification

Posted as part of the AI Align­ment Fo­rum se­quence on Value Learn­ing.

Ro­hin’s note: So far, we’ve seen that am­bi­tious value learn­ing needs to un­der­stand hu­man bi­ases, and that we can’t sim­ply learn the bi­ases in tan­dem with the re­ward. Per­haps we could hard­code a spe­cific model of hu­man bi­ases? Such a model is likely to be in­com­plete and in­ac­cu­rate, but it will perform bet­ter than as­sum­ing an op­ti­mal hu­man, and as we no­tice failure modes we can im­prove the model. In the lan­guage of this post by Ja­cob Stein­hardt (origi­nal here), we are us­ing a mis-speci­fied hu­man model. The post talks about why model mis-speci­fi­ca­tion is worse than it may seem at first glance.
This post is fairly tech­ni­cal and may not be ac­cessible if you don’t have a back­ground in ma­chine learn­ing. If so, you can skip this post and still un­der­stand the rest of the posts in the se­quence. How­ever, if you want to do ML-re­lated safety re­search, I strongly recom­mend putting in the effort to un­der­stand the prob­lems that can arise with mis-speci­fi­ca­tion.

Ma­chine learn­ing is very good at op­ti­miz­ing pre­dic­tions to match an ob­served sig­nal — for in­stance, given a dataset of in­put images and la­bels of the images (e.g. dog, cat, etc.), ma­chine learn­ing is very good at cor­rectly pre­dict­ing the la­bel of a new image. How­ever, perfor­mance can quickly break down as soon as we care about crite­ria other than pre­dict­ing ob­serv­ables. There are sev­eral cases where we might care about such crite­ria:

  • In sci­en­tific in­ves­ti­ga­tions, we of­ten care less about pre­dict­ing a spe­cific ob­serv­able phe­nomenon, and more about what that phe­nomenon im­plies about an un­der­ly­ing sci­en­tific the­ory.

  • In eco­nomic anal­y­sis, we are most in­ter­ested in what poli­cies will lead to de­sir­able out­comes. This re­quires pre­dict­ing what would coun­ter­fac­tu­ally hap­pen if we were to en­act the policy, which we (usu­ally) don’t have any data about.

  • In ma­chine learn­ing, we may be in­ter­ested in learn­ing value func­tions which match hu­man prefer­ences (this is es­pe­cially im­por­tant in com­plex set­tings where it is hard to spec­ify a satis­fac­tory value func­tion by hand). How­ever, we are un­likely to ob­serve in­for­ma­tion about the value func­tion di­rectly, and in­stead must in­fer it im­plic­itly. For in­stance, one might in­fer a value func­tion for au­tonomous driv­ing by ob­serv­ing the ac­tions of an ex­pert driver.

In all of the above sce­nar­ios, the pri­mary ob­ject of in­ter­est — the sci­en­tific the­ory, the effects of a policy, and the value func­tion, re­spec­tively — is not part of the ob­served data. In­stead, we can think of it as an un­ob­served (or “la­tent”) vari­able in the model we are us­ing to make pre­dic­tions. While we might hope that a model that makes good pre­dic­tions will also place cor­rect val­ues on un­ob­served vari­ables as well, this need not be the case in gen­eral, es­pe­cially if the model is mis-speci­fied.

I am in­ter­ested in la­tent vari­able in­fer­ence be­cause I think it is a po­ten­tially im­por­tant sub-prob­lem for build­ing AI sys­tems that be­have safely and are al­igned with hu­man val­ues. The con­nec­tion is most di­rect for value learn­ing, where the value func­tion is the la­tent vari­able of in­ter­est and the fidelity with which it is learned di­rectly im­pacts the well-be­haved­ness of the sys­tem. How­ever, one can imag­ine other uses as well, such as mak­ing sure that the con­cepts that an AI learns suffi­ciently match the con­cepts that the hu­man de­signer had in mind. It will also turn out that la­tent vari­able in­fer­ence is re­lated to coun­ter­fac­tual rea­son­ing, which has a large num­ber of tie-ins with build­ing safe AI sys­tems that I will elab­o­rate on in forth­com­ing posts.

The goal of this post is to ex­plain why prob­lems show up if one cares about pre­dict­ing la­tent vari­ables rather than ob­served vari­ables, and to point to a re­search di­rec­tion (coun­ter­fac­tual rea­son­ing) that I find promis­ing for ad­dress­ing these is­sues. More speci­fi­cally, in the re­main­der of this post, I will: (1) give some for­mal set­tings where we want to in­fer un­ob­served vari­ables and ex­plain why we can run into prob­lems; (2) pro­pose a pos­si­ble ap­proach to re­solv­ing these prob­lems, based on coun­ter­fac­tual rea­son­ing.

1 Iden­ti­fy­ing Pa­ram­e­ters in Re­gres­sion Problems

Sup­pose that we have a re­gres­sion model , which out­puts a prob­a­bil­ity dis­tri­bu­tion over given a value for . Also sup­pose we are ex­plic­itly in­ter­ested in iden­ti­fy­ing the “true” value of rather than sim­ply mak­ing good pre­dic­tions about given . For in­stance, we might be in­ter­ested in whether smok­ing causes can­cer, and so we care not just about pre­dict­ing whether a given per­son will get can­cer () given in­for­ma­tion about that per­son (), but speci­fi­cally whether the co­effi­cients in that cor­re­spond to a his­tory of smok­ing are large and pos­i­tive.

In a typ­i­cal set­ting, we are given data points on which to fit a model. Most meth­ods of train­ing ma­chine learn­ing sys­tems op­ti­mize pre­dic­tive perfor­mance, i.e. they will out­put a pa­ram­e­ter that (ap­prox­i­mately) max­i­mizes . For in­stance, for a lin­ear re­gres­sion prob­lem we have . Var­i­ous more so­phis­ti­cated meth­ods might em­ploy some form of reg­u­lariza­tion to re­duce overfit­ting, but they are still fun­da­men­tally try­ing to max­i­mize some mea­sure of pre­dic­tive ac­cu­racy, at least in the limit of in­finite data.

Call a model well-speci­fied if there is some pa­ram­e­ter for which matches the true dis­tri­bu­tion over , and call a model mis-speci­fied if no such ex­ists. One can show that for well-speci­fied mod­els, max­i­miz­ing pre­dic­tive ac­cu­racy works well (mod­ulo a num­ber of tech­ni­cal con­di­tions). In par­tic­u­lar, max­i­miz­ing will (asymp­tot­i­cally, as ) lead to re­cov­er­ing the pa­ram­e­ter .

How­ever, if a model is mis-speci­fied, then it is not even clear what it means to cor­rectly in­fer . We could de­clare the max­i­miz­ing pre­dic­tive ac­cu­racy to be the “cor­rect” value of , but this has is­sues:

  1. While might do a good job of pre­dict­ing in the set­tings we’ve seen, it may not pre­dict well in very differ­ent set­tings.

  2. If we care about de­ter­min­ing for some sci­en­tific pur­pose, then good pre­dic­tive ac­cu­racy may be an un­suit­able met­ric. For in­stance, even though mar­garine con­sump­tion might cor­re­late well with (and hence be a good pre­dic­tor of) di­vorce rate, that doesn’t mean that there is a causal re­la­tion­ship be­tween the two.

The two prob­lems above also sug­gest a solu­tion: we will say that we have done a good job of in­fer­ring a value for if can be used to make good pre­dic­tions in a wide va­ri­ety of situ­a­tions, and not just the situ­a­tion we hap­pened to train the model on. (For the lat­ter case of pre­dict­ing causal re­la­tion­ships, the “wide va­ri­ety of situ­a­tions” should in­clude the situ­a­tion in which the rele­vant causal in­ter­ven­tion is ap­plied.)

Note that both of the prob­lems above are differ­ent from the typ­i­cal statis­ti­cal prob­lem of overfit­ting. Clas­si­cally, overfit­ting oc­curs when a model is too com­plex rel­a­tive to the amount of data at hand, but even if we have a large amount of data the prob­lems above could oc­cur. This is illus­trated in the fol­low­ing graph:

Here the blue line is the data we have , and the green line is the model we fit (with slope and in­ter­cept parametrized by ). We have more than enough data to fit a line to it. How­ever, be­cause the true re­la­tion­ship is quadratic, the best lin­ear fit de­pends heav­ily on the dis­tri­bu­tion of the train­ing data. If we had fit to a differ­ent part of the quadratic, we would have got­ten a po­ten­tially very differ­ent re­sult. In­deed, in this situ­a­tion, there is no lin­ear re­la­tion­ship that can do a good job of ex­trap­o­lat­ing to new situ­a­tions, un­less the do­main of those new situ­a­tions is re­stricted to the part of the quadratic that we’ve already seen.

I will re­fer to the type of er­ror in the di­a­gram above as mis-speci­fi­ca­tion er­ror. Again, mis-speci­fi­ca­tion er­ror is differ­ent from er­ror due to overfit­ting. Overfit­ting oc­curs when there is too lit­tle data and noise is driv­ing the es­ti­mate of the model; in con­trast, mis-speci­fi­ca­tion er­ror can oc­cur even if there is plenty of data, and in­stead oc­curs be­cause the best-perform­ing model is differ­ent in differ­ent sce­nar­ios.

2 Struc­tural Equa­tion Models

We will next con­sider a slightly sub­tler set­ting, which in eco­nomics is referred to as a struc­tural equa­tion model. In this set­ting we again have an out­put whose dis­tri­bu­tion de­pends on an in­put , but now this re­la­tion­ship is me­di­ated by an un­ob­served vari­able . A com­mon ex­am­ple is a dis­crete choice model, where con­sumers make a choice among mul­ti­ple goods () based on a con­sumer-spe­cific util­ity func­tion () that is in­fluenced by de­mo­graphic and other in­for­ma­tion about the con­sumer (). Nat­u­ral lan­guage pro­cess­ing pro­vides an­other source of ex­am­ples: in se­man­tic pars­ing, we have an in­put ut­ter­ance () and out­put de­no­ta­tion (), me­di­ated by a la­tent log­i­cal form ; in ma­chine trans­la­tion, we have in­put and out­put sen­tences ( and ) me­di­ated by a la­tent al­ign­ment ().

Sym­bol­i­cally, we rep­re­sent a struc­tural equa­tion model as a parametrized prob­a­bil­ity dis­tri­bu­tion , where we are try­ing to fit the pa­ram­e­ters . Of course, we can always turn a struc­tural equa­tion model into a re­gres­sion model by us­ing the iden­tity , which al­lows us to ig­nore al­to­gether. In eco­nomics this is called a re­duced form model. We use struc­tural equa­tion mod­els if we are speci­fi­cally in­ter­ested in the un­ob­served vari­able (for in­stance, in the ex­am­ples above we are in­ter­ested in the value func­tion for each in­di­vi­d­ual, or in the log­i­cal form rep­re­sent­ing the sen­tence’s mean­ing).

In the re­gres­sion set­ting where we cared about iden­ti­fy­ing , it was ob­vi­ous that there was no mean­ingful “true” value of when the model was mis-speci­fied. In this struc­tural equa­tion set­ting, we now care about the la­tent vari­able , which can take on a mean­ingful true value (e.g. the ac­tual util­ity func­tion of a given in­di­vi­d­ual) even if the over­all model is mis-speci­fied. It is there­fore tempt­ing to think that if we fit pa­ram­e­ters and use them to im­pute , we will have mean­ingful in­for­ma­tion about the ac­tual util­ity func­tions of in­di­vi­d­ual con­sumers. How­ever, this is a no­ta­tional sleight of hand — just be­cause we call “the util­ity func­tion” does not make it so. The vari­able need not cor­re­spond to the ac­tual util­ity func­tion of the con­sumer, nor does the con­sumer’s prefer­ences even need to be rep­re­sentable by a util­ity func­tion.

We can un­der­stand what goes wrong by con­sider the fol­low­ing pro­ce­dure, which for­mal­izes the pro­posal above:

  1. Find to max­i­mize the pre­dic­tive ac­cu­racy on the ob­served data, , where . Call the re­sult .

  2. Us­ing this value , treat as be­ing dis­tributed ac­cord­ing to . On a new value for which is not ob­served, treat as be­ing dis­tributed ac­cord­ing to .

As be­fore, if the model is well-speci­fied, one can show that such a pro­ce­dure asymp­tot­i­cally out­puts the cor­rect prob­a­bil­ity dis­tri­bu­tion over . How­ever, if the model is mis-speci­fied, things can quickly go wrong. For ex­am­ple, sup­pose that rep­re­sents what choice of drink a con­sumer buys, and rep­re­sents con­sumer util­ity (which might be a func­tion of the price, at­tributes, and quan­tity of the drink). Now sup­pose that in­di­vi­d­u­als have prefer­ences which are in­fluenced by un­mod­eled co­vari­ates: for in­stance, a prefer­ence for cold drinks on warm days, while the in­put does not have in­for­ma­tion about the out­side tem­per­a­ture when the drink was bought. This could cause any of sev­eral effects:

  • If there is a co­vari­ate that hap­pens to cor­re­late with tem­per­a­ture in the data, then we might con­clude that that co­vari­ate is pre­dic­tive of prefer­ring cold drinks.

  • We might in­crease our un­cer­tainty about to cap­ture the un­mod­eled vari­a­tion in .

  • We might im­plic­itly in­crease un­cer­tainty by mov­ing util­ities closer to­gether (al­low­ing noise or other fac­tors to more eas­ily change the con­sumer’s de­ci­sion).

In prac­tice we will likely have some mix­ture of all of these, and this will lead to sys­tem­atic bi­ases in our con­clu­sions about the con­sumers’ util­ity func­tions.

The same prob­lems as be­fore arise: while we by de­sign place prob­a­bil­ity mass on val­ues of that cor­rectly pre­dict the ob­ser­va­tion , un­der model mis-speci­fi­ca­tion this could be due to spu­ri­ous cor­re­la­tions or other per­ver­si­ties of the model. Fur­ther­more, even though pre­dic­tive perfor­mance is high on the ob­served data (and data similar to the ob­served data), there is no rea­son for this to con­tinue to be the case in set­tings very differ­ent from the ob­served data, which is par­tic­u­larly prob­le­matic if one is con­sid­er­ing the effects of an in­ter­ven­tion. For in­stance, while in­fer­ring prefer­ences be­tween hot and cold drinks might seem like a silly ex­am­ple, the de­sign of tim­ber auc­tions con­sti­tutes a much more im­por­tant ex­am­ple with a roughly similar flavour, where it is im­por­tant to cor­rectly un­der­stand the util­ity func­tions of bid­ders in or­der to pre­dict their be­havi­our un­der al­ter­na­tive auc­tion de­signs (the model is also more com­plex, al­low­ing even more op­por­tu­ni­ties for mis-speci­fi­ca­tion to cause prob­lems).

3 A Pos­si­ble Solu­tion: Coun­ter­fac­tual Reasoning

In gen­eral, un­der model mis-speci­fi­ca­tion we have the fol­low­ing prob­lems:

  • It is of­ten no longer mean­ingful to talk about the “true” value of a la­tent vari­able (or at the very least, not one within the speci­fied model fam­ily).

  • Even when there is a la­tent vari­able with a well-defined mean­ing, the im­puted dis­tri­bu­tion over need not match re­al­ity.

We can make sense of both of these prob­lems by think­ing in terms of coun­ter­fac­tual rea­son­ing. Without defin­ing it too for­mally, coun­ter­fac­tual rea­son­ing is the prob­lem of mak­ing good pre­dic­tions not just in the ac­tual world, but in a wide va­ri­ety of coun­ter­fac­tual wor­lds that “could” ex­ist. (I recom­mend this pa­per as a good overview for ma­chine learn­ing re­searchers.)

While typ­i­cally ma­chine learn­ing mod­els are op­ti­mized to pre­dict well on a spe­cific dis­tri­bu­tion, sys­tems ca­pa­ble of coun­ter­fac­tual rea­son­ing must make good pre­dic­tions on many dis­tri­bu­tions (es­sen­tially any dis­tri­bu­tion that can be cap­tured by a rea­son­able coun­ter­fac­tual). This stronger guaran­tee al­lows us to re­solve many of the is­sues dis­cussed above, while still think­ing in terms of pre­dic­tive perfor­mance, which his­tor­i­cally seems to have been a suc­cess­ful paradigm for ma­chine learn­ing. In par­tic­u­lar:

  • While we can no longer talk about the “true” value of , we can say that a value of is a “good” value if it makes good pre­dic­tions on not just a sin­gle test dis­tri­bu­tion, but many differ­ent coun­ter­fac­tual test dis­tri­bu­tions. This al­lows us to have more con­fi­dence in the gen­er­al­iz­abil­ity of any in­fer­ences we draw based on (for in­stance, if is the co­effi­cient vec­tor for a re­gres­sion prob­lem, any vari­able with pos­i­tive sign is likely to ro­bustly cor­re­late with the re­sponse vari­able for a wide va­ri­ety of set­tings).

  • The im­puted dis­tri­bu­tion over a vari­able must also lead to good pre­dic­tions for a wide va­ri­ety of dis­tri­bu­tions. While this does not force to match re­al­ity, it is a much stronger con­di­tion and does at least mean that any as­pect of that can be mea­sured in some coun­ter­fac­tual world must cor­re­spond to re­al­ity. (For in­stance, any as­pect of a util­ity func­tion that could at least coun­ter­fac­tu­ally re­sult in a spe­cific ac­tion would need to match re­al­ity.)

  • We will suc­cess­fully pre­dict the effects of an in­ter­ven­tion, as long as that in­ter­ven­tion leads to one of the coun­ter­fac­tual dis­tri­bu­tions con­sid­ered.

(Note that it is less clear how to ac­tu­ally train mod­els to op­ti­mize coun­ter­fac­tual perfor­mance, since we typ­i­cally won’t ob­serve the coun­ter­fac­tu­als! But it does at least define an end goal with good prop­er­ties.)

Many peo­ple have a strong as­so­ci­a­tion be­tween the con­cepts of “coun­ter­fac­tual rea­son­ing” and “causal rea­son­ing”. It is im­por­tant to note that these are dis­tinct ideas; causal rea­son­ing is a type of coun­ter­fac­tual rea­son­ing (where the coun­ter­fac­tu­als are of­ten thought of as cen­tered around in­ter­ven­tions), but I think of coun­ter­fac­tual rea­son­ing as any type of rea­son­ing that in­volves mak­ing ro­bustly cor­rect statis­ti­cal in­fer­ences across a wide va­ri­ety of dis­tri­bu­tions. On the other hand, some peo­ple take ro­bust statis­ti­cal cor­re­la­tion to be the defi­ni­tion of a causal re­la­tion­ship, and thus do con­sider causal and coun­ter­fac­tual rea­son­ing to be the same thing.

I think that build­ing ma­chine learn­ing sys­tems that can do a good job of coun­ter­fac­tual rea­son­ing is likely to be an im­por­tant challenge, es­pe­cially in cases where re­li­a­bil­ity and safety are im­por­tant, and ne­ces­si­tates changes in how we eval­u­ate ma­chine learn­ing mod­els. In my mind, while the Tur­ing test has many flaws, one thing it gets very right is the abil­ity to eval­u­ate the ac­cu­racy of coun­ter­fac­tual pre­dic­tions (since di­alogue pro­vides the op­por­tu­nity to set up coun­ter­fac­tual wor­lds via shared hy­po­thet­i­cals). In con­trast, most ex­ist­ing tasks fo­cus on re­peat­edly mak­ing the same type of pre­dic­tion with re­spect to a fixed test dis­tri­bu­tion. This lat­ter type of bench­mark­ing is of course eas­ier and more clear-cut, but fails to probe im­por­tant as­pects of our mod­els. I think it would be very ex­cit­ing to de­sign good bench­marks that re­quire sys­tems to do coun­ter­fac­tual rea­son­ing, and I would even be happy to in­cen­tivize such work mon­e­tar­ily.


Thanks to Michael Webb, Sindy Li, and Holden Karnofsky for pro­vid­ing feed­back on drafts of this post. If any read­ers have ad­di­tional feed­back, please feel free to send it my way.