I am not very confident about the first question, but I’m heartened by the fact that other people have had ideas when I asked them. The most obvious thing is probably formal verification, either for a pre existing model or more likely for a new model which is built to verifiably have some nice properties. More practically, one could use trusted judges along with a less formal rubric (for problems where perfect specification is too hard), although this requires building trust over time. The second point I’m less worried about at the moment; AI researchers are good at working on problems that have robust continuous metrics–e.g. the classic RL thing where they get clear signal from improvement, even if they don’t solve the whole thing. I think that they are worst at exactly the kind of challenges inducement prizes should attack: where the final outcome can be made quite explicit, but we don’t know what methods would get us there.
I am not very confident about the first question, but I’m heartened by the fact that other people have had ideas when I asked them. The most obvious thing is probably formal verification, either for a pre existing model or more likely for a new model which is built to verifiably have some nice properties. More practically, one could use trusted judges along with a less formal rubric (for problems where perfect specification is too hard), although this requires building trust over time.
The second point I’m less worried about at the moment; AI researchers are good at working on problems that have robust continuous metrics–e.g. the classic RL thing where they get clear signal from improvement, even if they don’t solve the whole thing. I think that they are worst at exactly the kind of challenges inducement prizes should attack: where the final outcome can be made quite explicit, but we don’t know what methods would get us there.