Prediction-Augmented Evaluation Systems
[Note: I made a short video of myself explaining this document here.]
It’s common for groups of people to want to evaluate specific things. Here are a few examples I’m interested in:
The expected value of projects or actions within projects
Research papers, on specific rubrics
Quantitative risk estimates
Important actions that may get carried out by artificial intelligences
I think predictions could be useful in scaling and amplifying such evaluation processes. Humans and later AIs could predict intensive evaluation results. There has been previous discussion on related topics, but I thought it would be valuable to consider a specific model here called “prediction-augmented evaluation processes.” This is a high-level concept that could be used to help frame future discussion.
We can call a systematized process that produces evaluations an “evaluation process.” Let’s begin with a few generic desiderata of these.
High Accuracy / “Evaluating the right thing”
Evaluations should aim at estimating the thing actually cared about as well as possible. In their limit according to some metric of effort, they should approximate ideal knowledge on the thing cared about.
High Precision / “Evaluating the chosen thing correctly”
Evaluations should have low amounts of uncertainty and be very consistent. If the precision is generally less than what naive readers would guess, then these evaluations wouldn’t be very useful.
Low Total Cost
Specific evaluations can be costly, but the total cost across evaluations should be low.
I think that the use of predictions could allow us to well fulfill these criterions. It could help decouple evaluations from their scaling, allowing for independent optimization of the first two. The cost should be low relative to that of scaling evaluators in other obvious ways.
Before getting formal with terminology, I think a specific example would be helpful.
Say Samantha scores research papers for quality on a scale from 1-10. She’s great at it, she has a very thorough and lengthy reviewing procedure, and many others trust her reviews. Unfortunately, there’s only one Samantha, and there are tons of research papers.
One way to scale Samantha’s abilities would be to use a prediction aggregation system. A collection of other people would predict Samantha’s scores before she rates them. Predictions would be submitted as probability distributions over possible scores. Each research paper would have a probability of being scored by Samantha, say 10%. In a naive model, this would be done in batches; the predictors could have 1 month to score 100 papers, and then at the end of the month 10 would randomly be chosen and rated by Samantha.
If this batch process would happen multiple times, then eventually outside observers could understand how accurate the predictors are and how to aggregate future forecasts to better predict Samantha’s judgments.
An obvious improvement could be that some of the predictors may develop a sense of what arguments Samantha most likes and what data she cares for. They may write up summaries of their arguments to convince Samantha of their particular stances. If managed well, this could speed up Samantha’s work and perhaps improve it. She may eventually find many of the people who best understand her system and develop an amount of trust in them. Of course, this could selectively bias her away from making accurate judgments, so this kind of feedback would have to be handled with care.
Once there are enough predictions, it may be possible to train ML agents to do prediction as well. The humans would essentially act as a “bootstrapping” system.
I’ve outlined how I would describe the internals of a prediction-augmented evaluation process in an engineering system or similar. The wording here is a bit technical, on purpose, so feel free to skip this section.
This diagram attempts to show a few different things. The entirety of a judging evaluation subprocess and prediction system make up the outer prediction-augmented evaluation process. The judging evaluation subprocess has a percent chance of evaluating each of a set of measurables. Predictors can make predictions on each one of these measurables, where they are trying to predict what the judging evaluation subprocess will judge for that measurable if it’s chosen to judge it.
Judging Evaluation Subprocess
I imagine that prediction-augmentation could assist any evaluation process, even theoretically one that is already itself prediction-augmented. Prediction-augmentation acts as a layer that converts one narrow but good evaluation process into a more voluminous process.
In the context of a “prediction-augmented” evaluation process, the “wrapped” evaluation process can be considered the “judging” evaluation subprocess. This internal process would generate “judgments”, and separately predictors will make predictions of future judgments. Both judgments and predictions would act as evaluations, so to speak.
There are already many evaluation systems used in the world, and I imagine that almost any could act as judging processes. The main bottlenecks would be judging quantity and reliability; this would be most useful for areas where evaluations are done for many similar things.
Because the judging process is well isolated, and scale is not a huge worry (that’s pushed to the prediction layer), it can be thoroughly tested and optimized. Because the scaling mechanism is decently decoupled from the evaluation process, it could be much more rigorous than would otherwise be reasonable. For instance, a paper reviewer may typically spend 4 hours per paper, but with a prediction-augmented layer, perhaps they could spend 40 with the papers selected for judgment.
I use the phrase “evaluation process” rather than “evaluation” to point out the fact that this should be something outside the purview of a single individual. I imagine that the failure rate of individuals to evaluate things after a few years could be considerable, so it would be strongly preferable to have backup plans in the case that that happens. I would assume that organizations would generally be a better alternative, even if they were just mostly backing up individuals. Perhaps organizations could set up official trusts or other legal and financial structures to ensure that judgments get carried out.
There would have to be discussion about what the best evaluation processes would look like if many resources were put into predictions, but I think that’s a really good discussion to encourage anyway.
One tricky part would be to further identify evaluation processes that multiple agents would find most informative. For instance, finding some individual that’s trusted by several organizations with significant differences of opinion.
Measurables refer to the things that get evaluated. It’s a bit of a generic word for the use case, but I suspect useful in larger ontologies. Some examples could be “the rating of scientific paper X” or “the expected value of project Y.” It’s important to keep in mind that measurables only make sense in regards to specific evaluation systems; predictors would rarely predict the actual value of something, but rather, the result of a specific evaluation subprocess. For instance, “GDP of the United States, according to XYZ’s process.”
The system obviously requires predictions, and for this to happen at a decent scale, almost definitely some kind of web application. In theory, a formal prediction market would work, but I imagine it would be very difficult to scale to the levels I would hope for in a large evaluation system. I’m personally more excited about more general prediction aggregation tools like The Good Judgment Project and Metaculus. Metaculus, in particular, allows participants to make guesses on continuous variables, which seems like a reasonable mechanism for evaluation systems. I’m also experimenting with a small project of my own to collect forecasts for experimental purposes.
Incentives for predictors could be a bit tricky to work out, but it definitely seems possible. It seems simple enough to pay people using a function that includes their prediction accuracy and quantity. Sign-ups could be screened to prevent lots of bots from joining. Of course, another option would be for the benefits from predictors to be something that itself gets evaluated using a separate prediction-augmented process.
Scaling & Amplification
I think the main two benefits Prediction-Augmentation could provide are that of “scaling” and “amplification.” “Scaling” refers to the ability of such a system to effectively “scale” an evaluation judgment subprocess. The predictors would evaluate many more measurables than the judgment subprocess, and would do so sooner. “Amplification” refers to the ability of the system to improve the best abilities of the judging subprocess. This could come from speeding it up and/or by having judges read content produced by the prediction layer.
I expect “scaling” to be much more impactful than “augmentation,” especially for the early use of such systems.
Scaling & Amplification are very similar in ways to “Iterated Distillation and Amplification.” However, these types of scaling & amplification are obviously not always automated, which is a big difference. That said, hypothetically people could eventually write prediction bots, and similar ones for amplification (with nice user interfaces, I assume.) I think prediction-augmentation may have relevance for direct use in technical AI alignment systems but I am currently more focused on human variants.
The judgment subprocess could select specific predicted variables for evaluations after reviewing the predictions, rather than choosing probabilistically. Judges would essentially “challenge” the measurables with the most questionable predictions. Selective evaluations may be more efficient than random evaluations, though it also could mean that predictors may be incentivized to predict items they expect the evaluators would select, leading to some potentially messy issues.
Selective evaluation is essentially very similar to some things many editors and managers do. A news editor may skim a long work by a writer (who is acting in part as a predictor of what the editor will accept), and at times challenge specific parts of text, to either improve directly or send back for improvement.
If evaluations are done probabilistically, the probabilities could change depending on the expected value of improved predictions on specific measurables. This could incentivize the predictors to allocate more effort accordingly. This could look a lot like selective evaluations in practice.
Traditional Prediction Systems
I would consider existing prediction aggregators/markets to fall under the umbrella of “Prediction-Augmented Evaluation Processes.” These traditionally have had judging subprocesses that are very straightforward and simple; for instance, “Find the GDP of America in 2020 from Wikipedia.” They effectively scale simple judgments purely by estimating them early, rather than also by attempting to recreate a complicated analysis.
Projects could be evaluated for their expected marginal impact. This could provide information very similar to certificates of impact. I think that prediction-augmented evaluation systems could be more efficient than certificates of impact, but would first like to see both be tested more experimentally. This post by Ought poses a similar system for doing evaluations on parts of projects. This post by Robin Hanson discusses similar techniques for evaluating the impact of scientific papers.
General Research Questions
If researchers could express specific uncertain claims early on, then outsiders could predict these researcher’s eventual findings. For example, a scientist could make a list of 100 binary questions they are not sure about, and promise to evaluate a random subset in 10 years.
AI Decision Validation
One possibility here could be to have a human act as a judge (hopefully augmented in some way), and an intelligent AI be the predictor. The AI would recommend actions/decisions to the human, and the human/augmentation system would selectively or statistically challenge these. I believe this is similar to ideas of selective challenging in AI Safety via Debate.
Human Value Judgement
If we could narrow value judgments into a robust evaluation process, we could scale this to AI systems. This could be used for making decisions around self-driving vehicles and similar. I imagine that much of the challenge here would be for people to agree on evaluation processes for moral questions, but if this could be approximated, the rest could be carried out somewhat straightforwardly. See this post by Paul Christiano for more information.
Many forums and applications are pretty dependent on specific moderators for moderation. This kind of work could hypothetically help scale them in a controllable way. Future moderators would be obligated to predict the trusted moderators, rather than doing things in other ways. I’m not too sure about this, but know that others in the community have been enthusiastic. See this post by Paul Christiano for more information.
Alternative Dispute Resolution
Existing court systems and alternative dispute resolution systems already are similar to this process in theory. It would be interesting to imagine hypothetical court systems where lower courts would try to predict exactly what higher courts would rule, and on occasion, the higher courts would repeat the same cases. The appellate system may be more efficient, but there may be interesting hybrids. For one, this system could be useful for bootstrapping completely automated rulings.
I imagine many of the most interesting uses of such a system haven’t thought about. Prediction-augmented evaluation processes would have some positives and negatives current systems don’t have, so may make sense in different cases. If they do very well, I would assume they may do so in ways that would surprise us.
Much of what has been discussed here is very generic and thus many parts have been previously considered. Paul Christiano, and the team of Ought, in particular, have written about very similar ideas before; the main difference is that they seem to have focussed more on AI learning and specific decisions. Ought’s Predicting Slow Judgements” work investigates how well humans make predictions on different scales of time for evaluations, and then how that could be mimicked by AIs. I’ve done some work with them before and recommend them to others interested in these topics. Andreas Stuhlmüller’s (founder of Ought) previous work with dialog markets is also worth reading.
There seems to be a good amount of research on evaluation procedures and separately on prediction capabilities. For the sake of expediency, I did not treat this as much of a literature review, though would be interested in whether others have recommended literature on these topics.