Prediction-Augmented Evaluation Systems

[Note: I made a short video of my­self ex­plain­ing this doc­u­ment here.]

It’s com­mon for groups of peo­ple to want to eval­u­ate spe­cific things. Here are a few ex­am­ples I’m in­ter­ested in:

  • The ex­pected value of pro­jects or ac­tions within projects

  • Re­search pa­pers, on spe­cific rubrics

  • Quan­ti­ta­tive risk estimates

  • Im­por­tant ac­tions that may get car­ried out by ar­tifi­cial intelligences

I think pre­dic­tions could be use­ful in scal­ing and am­plify­ing such eval­u­a­tion pro­cesses. Hu­mans and later AIs could pre­dict in­ten­sive eval­u­a­tion re­sults. There has been pre­vi­ous dis­cus­sion on re­lated top­ics, but I thought it would be valuable to con­sider a spe­cific model here called “pre­dic­tion-aug­mented eval­u­a­tion pro­cesses.” This is a high-level con­cept that could be used to help frame fu­ture dis­cus­sion.

Desider­ata:

We can call a sys­tem­atized pro­cess that pro­duces eval­u­a­tions an “eval­u­a­tion pro­cess.” Let’s be­gin with a few generic desider­ata of these.

  • High Ac­cu­racy /​ “Eval­u­at­ing the right thing”

    • Eval­u­a­tions should aim at es­ti­mat­ing the thing ac­tu­ally cared about as well as pos­si­ble. In their limit ac­cord­ing to some met­ric of effort, they should ap­prox­i­mate ideal knowl­edge on the thing cared about.

  • High Pre­ci­sion /​ “Eval­u­at­ing the cho­sen thing cor­rectly”

    • Eval­u­a­tions should have low amounts of un­cer­tainty and be very con­sis­tent. If the pre­ci­sion is gen­er­ally less than what naive read­ers would guess, then these eval­u­a­tions wouldn’t be very use­ful.

  • Low To­tal Cost

    • Spe­cific eval­u­a­tions can be costly, but the to­tal cost across eval­u­a­tions should be low.

I think that the use of pre­dic­tions could al­low us to well fulfill these crite­ri­ons. It could help de­cou­ple eval­u­a­tions from their scal­ing, al­low­ing for in­de­pen­dent op­ti­miza­tion of the first two. The cost should be low rel­a­tive to that of scal­ing eval­u­a­tors in other ob­vi­ous ways.

Pre­dic­tion-Aug­men­ta­tion Example

Be­fore get­ting for­mal with ter­minol­ogy, I think a spe­cific ex­am­ple would be helpful.

Say Sa­man­tha scores re­search pa­pers for qual­ity on a scale from 1-10. She’s great at it, she has a very thor­ough and lengthy re­view­ing pro­ce­dure, and many oth­ers trust her re­views. Un­for­tu­nately, there’s only one Sa­man­tha, and there are tons of re­search pa­pers.

One way to scale Sa­man­tha’s abil­ities would be to use a pre­dic­tion ag­gre­ga­tion sys­tem. A col­lec­tion of other peo­ple would pre­dict Sa­man­tha’s scores be­fore she rates them. Pre­dic­tions would be sub­mit­ted as prob­a­bil­ity dis­tri­bu­tions over pos­si­ble scores. Each re­search pa­per would have a prob­a­bil­ity of be­ing scored by Sa­man­tha, say 10%. In a naive model, this would be done in batches; the pre­dic­tors could have 1 month to score 100 pa­pers, and then at the end of the month 10 would ran­domly be cho­sen and rated by Sa­man­tha.

If this batch pro­cess would hap­pen mul­ti­ple times, then even­tu­ally out­side ob­servers could un­der­stand how ac­cu­rate the pre­dic­tors are and how to ag­gre­gate fu­ture fore­casts to bet­ter pre­dict Sa­man­tha’s judg­ments.

An ob­vi­ous im­prove­ment could be that some of the pre­dic­tors may de­velop a sense of what ar­gu­ments Sa­man­tha most likes and what data she cares for. They may write up sum­maries of their ar­gu­ments to con­vince Sa­man­tha of their par­tic­u­lar stances. If man­aged well, this could speed up Sa­man­tha’s work and per­haps im­prove it. She may even­tu­ally find many of the peo­ple who best un­der­stand her sys­tem and de­velop an amount of trust in them. Of course, this could se­lec­tively bias her away from mak­ing ac­cu­rate judg­ments, so this kind of feed­back would have to be han­dled with care.

Once there are enough pre­dic­tions, it may be pos­si­ble to train ML agents to do pre­dic­tion as well. The hu­mans would es­sen­tially act as a “boot­strap­ping” sys­tem.

Subcomponents

I’ve out­lined how I would de­scribe the in­ter­nals of a pre­dic­tion-aug­mented eval­u­a­tion pro­cess in an en­g­ineer­ing sys­tem or similar. The word­ing here is a bit tech­ni­cal, on pur­pose, so feel free to skip this sec­tion.

augmentation system diagram

This di­a­gram at­tempts to show a few differ­ent things. The en­tirety of a judg­ing eval­u­a­tion sub­pro­cess and pre­dic­tion sys­tem make up the outer pre­dic­tion-aug­mented eval­u­a­tion pro­cess. The judg­ing eval­u­a­tion sub­pro­cess has a per­cent chance of eval­u­at­ing each of a set of mea­surables. Pre­dic­tors can make pre­dic­tions on each one of these mea­surables, where they are try­ing to pre­dict what the judg­ing eval­u­a­tion sub­pro­cess will judge for that mea­surable if it’s cho­sen to judge it.

Judg­ing Eval­u­a­tion Subprocess

I imag­ine that pre­dic­tion-aug­men­ta­tion could as­sist any eval­u­a­tion pro­cess, even the­o­ret­i­cally one that is already it­self pre­dic­tion-aug­mented. Pre­dic­tion-aug­men­ta­tion acts as a layer that con­verts one nar­row but good eval­u­a­tion pro­cess into a more volu­mi­nous pro­cess.

In the con­text of a “pre­dic­tion-aug­mented” eval­u­a­tion pro­cess, the “wrapped” eval­u­a­tion pro­cess can be con­sid­ered the “judg­ing” eval­u­a­tion sub­pro­cess. This in­ter­nal pro­cess would gen­er­ate “judg­ments”, and sep­a­rately pre­dic­tors will make pre­dic­tions of fu­ture judg­ments. Both judg­ments and pre­dic­tions would act as eval­u­a­tions, so to speak.

There are already many eval­u­a­tion sys­tems used in the world, and I imag­ine that al­most any could act as judg­ing pro­cesses. The main bot­tle­necks would be judg­ing quan­tity and re­li­a­bil­ity; this would be most use­ful for ar­eas where eval­u­a­tions are done for many similar things.

Be­cause the judg­ing pro­cess is well iso­lated, and scale is not a huge worry (that’s pushed to the pre­dic­tion layer), it can be thor­oughly tested and op­ti­mized. Be­cause the scal­ing mechanism is de­cently de­cou­pled from the eval­u­a­tion pro­cess, it could be much more rigor­ous than would oth­er­wise be rea­son­able. For in­stance, a pa­per re­viewer may typ­i­cally spend 4 hours per pa­per, but with a pre­dic­tion-aug­mented layer, per­haps they could spend 40 with the pa­pers se­lected for judg­ment.

I use the phrase “eval­u­a­tion pro­cess” rather than “eval­u­a­tion” to point out the fact that this should be some­thing out­side the purview of a sin­gle in­di­vi­d­ual. I imag­ine that the failure rate of in­di­vi­d­u­als to eval­u­ate things af­ter a few years could be con­sid­er­able, so it would be strongly prefer­able to have backup plans in the case that that hap­pens. I would as­sume that or­ga­ni­za­tions would gen­er­ally be a bet­ter al­ter­na­tive, even if they were just mostly back­ing up in­di­vi­d­u­als. Per­haps or­ga­ni­za­tions could set up offi­cial trusts or other le­gal and fi­nan­cial struc­tures to en­sure that judg­ments get car­ried out.

There would have to be dis­cus­sion about what the best eval­u­a­tion pro­cesses would look like if many re­sources were put into pre­dic­tions, but I think that’s a re­ally good dis­cus­sion to en­courage any­way.

One tricky part would be to fur­ther iden­tify eval­u­a­tion pro­cesses that mul­ti­ple agents would find most in­for­ma­tive. For in­stance, find­ing some in­di­vi­d­ual that’s trusted by sev­eral or­ga­ni­za­tions with sig­nifi­cant differ­ences of opinion.

Mea­surables
Mea­surables re­fer to the things that get eval­u­ated. It’s a bit of a generic word for the use case, but I sus­pect use­ful in larger on­tolo­gies. Some ex­am­ples could be “the rat­ing of sci­en­tific pa­per X” or “the ex­pected value of pro­ject Y.” It’s im­por­tant to keep in mind that mea­surables only make sense in re­gards to spe­cific eval­u­a­tion sys­tems; pre­dic­tors would rarely pre­dict the ac­tual value of some­thing, but rather, the re­sult of a spe­cific eval­u­a­tion sub­pro­cess. For in­stance, “GDP of the United States, ac­cord­ing to XYZ’s pro­cess.”

Pre­dic­tions
The sys­tem ob­vi­ously re­quires pre­dic­tions, and for this to hap­pen at a de­cent scale, al­most definitely some kind of web ap­pli­ca­tion. In the­ory, a for­mal pre­dic­tion mar­ket would work, but I imag­ine it would be very difficult to scale to the lev­els I would hope for in a large eval­u­a­tion sys­tem. I’m per­son­ally more ex­cited about more gen­eral pre­dic­tion ag­gre­ga­tion tools like The Good Judg­ment Pro­ject and Me­tac­u­lus. Me­tac­u­lus, in par­tic­u­lar, al­lows par­ti­ci­pants to make guesses on con­tin­u­ous vari­ables, which seems like a rea­son­able mechanism for eval­u­a­tion sys­tems. I’m also ex­per­i­ment­ing with a small pro­ject of my own to col­lect fore­casts for ex­per­i­men­tal pur­poses.

In­cen­tives for pre­dic­tors could be a bit tricky to work out, but it definitely seems pos­si­ble. It seems sim­ple enough to pay peo­ple us­ing a func­tion that in­cludes their pre­dic­tion ac­cu­racy and quan­tity. Sign-ups could be screened to pre­vent lots of bots from join­ing. Of course, an­other op­tion would be for the benefits from pre­dic­tors to be some­thing that it­self gets eval­u­ated us­ing a sep­a­rate pre­dic­tion-aug­mented pro­cess.

Scal­ing & Amplification

I think the main two benefits Pre­dic­tion-Aug­men­ta­tion could provide are that of “scal­ing” and “am­plifi­ca­tion.” “Scal­ing” refers to the abil­ity of such a sys­tem to effec­tively “scale” an eval­u­a­tion judg­ment sub­pro­cess. The pre­dic­tors would eval­u­ate many more mea­surables than the judg­ment sub­pro­cess, and would do so sooner. “Am­plifi­ca­tion” refers to the abil­ity of the sys­tem to im­prove the best abil­ities of the judg­ing sub­pro­cess. This could come from speed­ing it up and/​or by hav­ing judges read con­tent pro­duced by the pre­dic­tion layer.

I ex­pect “scal­ing” to be much more im­pact­ful than “aug­men­ta­tion,” es­pe­cially for the early use of such sys­tems.

Scal­ing & Am­plifi­ca­tion are very similar in ways to “Iter­ated Distil­la­tion and Am­plifi­ca­tion.” How­ever, these types of scal­ing & am­plifi­ca­tion are ob­vi­ously not always au­to­mated, which is a big differ­ence. That said, hy­po­thet­i­cally peo­ple could even­tu­ally write pre­dic­tion bots, and similar ones for am­plifi­ca­tion (with nice user in­ter­faces, I as­sume.) I think pre­dic­tion-aug­men­ta­tion may have rele­vance for di­rect use in tech­ni­cal AI al­ign­ment sys­tems but I am cur­rently more fo­cused on hu­man var­i­ants.

Ex­ist­ing/​Pos­si­ble Variants

Selec­tive Eval­u­a­tions
The judg­ment sub­pro­cess could se­lect spe­cific pre­dicted vari­ables for eval­u­a­tions af­ter re­view­ing the pre­dic­tions, rather than choos­ing prob­a­bil­is­ti­cally. Judges would es­sen­tially “challenge” the mea­surables with the most ques­tion­able pre­dic­tions. Selec­tive eval­u­a­tions may be more effi­cient than ran­dom eval­u­a­tions, though it also could mean that pre­dic­tors may be in­cen­tivized to pre­dict items they ex­pect the eval­u­a­tors would se­lect, lead­ing to some po­ten­tially messy is­sues.

Selec­tive eval­u­a­tion is es­sen­tially very similar to some things many ed­i­tors and man­agers do. A news ed­i­tor may skim a long work by a writer (who is act­ing in part as a pre­dic­tor of what the ed­i­tor will ac­cept), and at times challenge spe­cific parts of text, to ei­ther im­prove di­rectly or send back for im­prove­ment.

EV-Ad­justed Prob­a­bil­ities
If eval­u­a­tions are done prob­a­bil­is­ti­cally, the prob­a­bil­ities could change de­pend­ing on the ex­pected value of im­proved pre­dic­tions on spe­cific mea­surables. This could in­cen­tivize the pre­dic­tors to al­lo­cate more effort ac­cord­ingly. This could look a lot like se­lec­tive eval­u­a­tions in prac­tice.

Tra­di­tional Pre­dic­tion Sys­tems
I would con­sider ex­ist­ing pre­dic­tion ag­gre­ga­tors/​mar­kets to fall un­der the um­brella of “Pre­dic­tion-Aug­mented Eval­u­a­tion Pro­cesses.” Th­ese tra­di­tion­ally have had judg­ing sub­pro­cesses that are very straight­for­ward and sim­ple; for in­stance, “Find the GDP of Amer­ica in 2020 from Wikipe­dia.” They effec­tively scale sim­ple judg­ments purely by es­ti­mat­ing them early, rather than also by at­tempt­ing to recre­ate a com­pli­cated anal­y­sis.

Pos­si­ble Uses

Pro­ject Eval­u­a­tions
Pro­jects could be eval­u­ated for their ex­pected marginal im­pact. This could provide in­for­ma­tion very similar to cer­tifi­cates of im­pact. I think that pre­dic­tion-aug­mented eval­u­a­tion sys­tems could be more effi­cient than cer­tifi­cates of im­pact, but would first like to see both be tested more ex­per­i­men­tally. This post by Ought poses a similar sys­tem for do­ing eval­u­a­tions on parts of pro­jects. This post by Robin Han­son dis­cusses similar tech­niques for eval­u­at­ing the im­pact of sci­en­tific pa­pers.

Gen­eral Re­search Ques­tions
If re­searchers could ex­press spe­cific un­cer­tain claims early on, then out­siders could pre­dict these re­searcher’s even­tual find­ings. For ex­am­ple, a sci­en­tist could make a list of 100 bi­nary ques­tions they are not sure about, and promise to eval­u­ate a ran­dom sub­set in 10 years.

AI De­ci­sion Val­i­da­tion
One pos­si­bil­ity here could be to have a hu­man act as a judge (hope­fully aug­mented in some way), and an in­tel­li­gent AI be the pre­dic­tor. The AI would recom­mend ac­tions/​de­ci­sions to the hu­man, and the hu­man/​aug­men­ta­tion sys­tem would se­lec­tively or statis­ti­cally challenge these. I be­lieve this is similar to ideas of se­lec­tive challeng­ing in AI Safety via De­bate.

Hu­man Value Judge­ment
If we could nar­row value judg­ments into a ro­bust eval­u­a­tion pro­cess, we could scale this to AI sys­tems. This could be used for mak­ing de­ci­sions around self-driv­ing ve­hi­cles and similar. I imag­ine that much of the challenge here would be for peo­ple to agree on eval­u­a­tion pro­cesses for moral ques­tions, but if this could be ap­prox­i­mated, the rest could be car­ried out some­what straight­for­wardly. See this post by Paul Chris­ti­ano for more in­for­ma­tion.

Web­site Moder­a­tion
Many fo­rums and ap­pli­ca­tions are pretty de­pen­dent on spe­cific mod­er­a­tors for mod­er­a­tion. This kind of work could hy­po­thet­i­cally help scale them in a con­trol­lable way. Fu­ture mod­er­a­tors would be obli­gated to pre­dict the trusted mod­er­a­tors, rather than do­ing things in other ways. I’m not too sure about this, but know that oth­ers in the com­mu­nity have been en­thu­si­as­tic. See this post by Paul Chris­ti­ano for more in­for­ma­tion.

Alter­na­tive Dis­pute Re­s­olu­tion
Ex­ist­ing court sys­tems and al­ter­na­tive dis­pute re­s­olu­tion sys­tems already are similar to this pro­cess in the­ory. It would be in­ter­est­ing to imag­ine hy­po­thet­i­cal court sys­tems where lower courts would try to pre­dict ex­actly what higher courts would rule, and on oc­ca­sion, the higher courts would re­peat the same cases. The ap­pel­late sys­tem may be more effi­cient, but there may be in­ter­est­ing hy­brids. For one, this sys­tem could be use­ful for boot­strap­ping com­pletely au­to­mated rul­ings.

Un­i­mag­ined Uses
I imag­ine many of the most in­ter­est­ing uses of such a sys­tem haven’t thought about. Pre­dic­tion-aug­mented eval­u­a­tion pro­cesses would have some pos­i­tives and nega­tives cur­rent sys­tems don’t have, so may make sense in differ­ent cases. If they do very well, I would as­sume they may do so in ways that would sur­prise us.

Re­lated Work

Much of what has been dis­cussed here is very generic and thus many parts have been pre­vi­ously con­sid­ered. Paul Chris­ti­ano, and the team of Ought, in par­tic­u­lar, have writ­ten about very similar ideas be­fore; the main differ­ence is that they seem to have fo­cussed more on AI learn­ing and spe­cific de­ci­sions. Ought’s Pre­dict­ing Slow Judge­ments” work in­ves­ti­gates how well hu­mans make pre­dic­tions on differ­ent scales of time for eval­u­a­tions, and then how that could be mimicked by AIs. I’ve done some work with them be­fore and recom­mend them to oth­ers in­ter­ested in these top­ics. An­dreas Stuh­lmüller’s (founder of Ought) pre­vi­ous work with di­a­log mar­kets is also worth read­ing.

There seems to be a good amount of re­search on eval­u­a­tion pro­ce­dures and sep­a­rately on pre­dic­tion ca­pa­bil­ities. For the sake of ex­pe­di­ency, I did not treat this as much of a liter­a­ture re­view, though would be in­ter­ested in whether oth­ers have recom­mended liter­a­ture on these top­ics.