[Question] How does Gradient Descent Interact with Goodhart?

I am con­fused about how gra­di­ent de­scent (and other forms of lo­cal search) in­ter­act with Good­hart’s law. I of­ten use a sim­ple proxy of “sam­ple points un­til I get one with a large value” or “sam­ple points, and take the one with the largest value” when I think about what it means to op­ti­mize some­thing for . I might even say some­thing like ” bits of op­ti­miza­tion” to re­fer to sam­pling points. I think this is not a very good proxy for what most forms of op­ti­miza­tion look like, and this ques­tion is try­ing to get at un­der­stand­ing the differ­ence.

(Alter­na­tively, maybe sam­pling is a good proxy for many forms of op­ti­miza­tion, and this ques­tion will help us un­der­stand why, so we can think about con­vert­ing an ar­bi­trary op­ti­miza­tion pro­cess into a cer­tain num­ber of bits of op­ti­miza­tion, and com­par­ing differ­ent forms of op­ti­miza­tion di­rectly.)


One rea­son I care about this is that I am con­cerned about ap­proaches to AI safety that in­volve mod­el­ing hu­mans to try to learn hu­man value. One rea­son for this con­cern is that I think it would be nice to be able to save hu­man ap­proval as a test set. Con­sider the fol­low­ing two pro­ce­dures:

A) Use some fancy AI sys­tem to cre­ate a rocket de­sign, op­ti­miz­ing ac­cord­ing to some speci­fi­ca­tions that we write down, and then sam­ple rocket de­signs out­put by this sys­tem un­til you find one that a hu­man ap­proves of.

B) Gen­er­ate a very ac­cu­rate model of a hu­man. Use some fancy AI sys­tem to cre­ate a rocket de­sign, op­ti­miz­ing si­mul­ta­neously ac­cord­ing to some speci­fi­ca­tions that we write down and ap­proval ac­cord­ing to the ac­cu­rate hu­man model. Then sam­ple rocket de­signs out­put by this sys­tem un­til you find one that a hu­man ap­proves of.

I am more con­cerned about the sec­ond pro­ce­dure, be­cause I am wor­ried that the fancy AI sys­tem might use a method of op­ti­miz­ing for hu­man ap­proval that Good­harts away the con­nec­tion be­tween hu­man ap­proval and hu­man value. (In ad­di­tion to the more be­nign failure mode of Good­hart­ing away the con­nec­tion be­tween true hu­man ap­proval and the ap­proval of the ac­cu­rate model.)

It is pos­si­ble that I am wrong about this, and I am failing to see just how un­safe pro­ce­dure A is, be­cause I am failing to imag­ine the vast num­ber of rocket de­signs one would have to sam­ple be­fore find­ing one that is ap­proved, but I think maybe pro­ce­dure B is ac­tu­ally worse (or worse in some ways). My in­tu­ition here is say­ing some­thing like: “Hu­man ap­proval is a good proxy for hu­man value when sam­pling (even large num­bers of) in­puts/​plans, but a bad proxy for hu­man value when choos­ing in­puts/​plans that were op­ti­mized via lo­cal search. Lo­cal search will find ways to hack the hu­man ap­proval while hav­ing lit­tle effect on the true value.” The ex­is­tence of ad­ver­sar­ial ex­am­ples for many sys­tems makes me feel es­pe­cially wor­ried. I might find the an­swer to this ques­tion valuable in think­ing about how com­fortable I am with su­per­hu­man hu­man mod­el­ing.

Another rea­son why I am cu­ri­ous about this is that I think maybe un­der­stand­ing how differ­ent forms of op­ti­miza­tion in­ter­act with Good­hart can help me de­velop a suit­able re­place­ment for “sam­ple points un­til I get one with a large U value” when try­ing to do high level rea­son­ing about what op­ti­miza­tion will look like. Fur­ther this re­place­ment might sug­gest a way to mea­sure how much op­ti­miza­tion hap­pened in a sys­tem.


Here is a pro­posed ex­per­i­ment, (or class of ex­per­i­ments), for in­ves­ti­gat­ing how gra­di­ent de­scent in­ter­acts with Good­hart’s law. You might want to pre­reg­ister pre­dic­tions on how ex­per­i­ments of this form might go be­fore read­ing com­ments.

Pro­posed Ex­per­i­ment:

1. Gen­er­ate a true func­tion . (For ex­am­ple, you can write down a func­tion ex­plic­itly, or gen­er­ate a ran­dom func­tion by ran­domly ini­tial­iz­ing a neu­ral net, or train­ing a neu­ral net on ran­dom data)

2. Gen­er­ate a proxy func­tion , which can be in­terepereted as a proxy for . (For ex­am­ple, you can gen­er­ate a ran­dom noise func­tion , and let , or you can train a neu­ral net to try to copy )

3. Fix some ini­tial dis­tri­bu­tion on , which will rep­re­sent ran­dom sam­pling. (For ex­am­ple the nor­mal dis­tri­bu­tion)

4. Define from some other dis­tri­bu­tion , which can be in­ter­preted as sam­pling points ac­cord­ing to , then perform­ing some kind of lo­cal op­ti­miza­tion ac­cord­ing to . (For ex­am­ple, take a point ac­cord­ing to , then perform steps of gra­di­ent as­cent on , or take a point ac­cord­ing to , sam­ple more points all within dis­tance of , and take the one with the high­est value)

5. Screen off the proxy value by con­di­tion­ing points sam­pled from and to be in a nar­row high band of proxy val­ues, and com­pare the cor­re­spond­ing dis­tri­bu­tion on true val­ues. (For ex­am­ple, is greater when is sam­pled from or ?)


So, af­ter con­di­tion­ing on hav­ing a high proxy value, rep­re­sents get­ting that high proxy value via sam­pling ran­domly un­til you find one, while rep­re­sents a com­bi­na­tion of ran­dom sam­pling with some form of lo­cal search. If does bet­ter ac­cord­ing to the true value, this would im­ply that the op­ti­miza­tion via gra­di­ent de­scent re­spects the true value less than ran­dom sam­pling.

There are many de­grees of free­dom in the pro­ce­dure I de­scribe above, and even more de­grees of free­dom in the space of pro­ce­dures that do not ex­actly fit the de­scrip­tion above, but still get at the gen­eral ques­tion. I ex­pect the an­swer will de­pend heav­ily on how these choices are made. The real goal is not to get a bi­nary an­swer, but to de­velop an un­der­stand­ing of how (and why) the var­i­ous choices effect how much bet­ter or worse lo­cal search Good­harts rel­a­tive to ran­dom sam­pling.

I am ask­ing this ques­tion be­cause I want to know the an­swer, but (maybe due the the ex­per­i­men­tal na­ture) it also seems rel­a­tively ap­proach­able as far as AI safety ques­tion go, so some peo­ple might want to try to do these ex­per­i­ments them­selves, or try to figure out how they could get an an­swer that would satisfy them. Also, note that the above pro­ce­dure is im­ply­ing a very ex­per­i­men­tal way of ap­proach­ing the ques­tion, which I think is par­tially ap­pro­pri­ate, but it may be bet­ter to think about the prob­lem in the­ory or in some com­bi­na­tion of the­ory and ex­per­i­ments.

(Thanks to many peo­ple I talked with about ideas in this post over the last month: Abram Dem­ski, Sam Eisen­stat, Tsvi Ben­son-Tilsen, Nate Sores, Evan Hub­inger, Peter Sch­midt-Niel­sen, Dy­lan Had­field-Menell, David Krueger, Ra­mana Ku­mar, Smitha Milli, An­drew Critch, and many other peo­ple that I prob­a­bly for­got to men­tion.)