Human approval is a good proxy for human value when sampling (even large numbers of) inputs/plans, but a bad proxy for human value when choosing inputs/plans that were optimized via local search. Local search will find ways to hack the human approval while having little effect on the true value.
I do have this intuition as well, but it’s because I expect local search to be way more powerful than random search. I’m not sure how large you were thinking with “large”. Either way though, I expect that sampling will result in a just-barely-approved of plan, whereas local search will result in a very-high-approval plan—which basically means that the local search method was a more powerful optimizer. (As intuition for this, note that since image classifiers have millions of parameters, the success of image classifiers suggests that gradient descent is capable of somewhere between thousands and trillions of bits of optimization. The corresponding random search would be astronomically huge.)
Another way of stating this: I’d be worried about procedure B because it seems like if it is behaving adversarially, then it can craft a plan that gets lots of human approval despite being bad, whereas procedure A can’t do that.
However, if procedure B actually takes a lot of samples before finding a design that humans approve, then we’d get a barely-approved of plan, and then I feel about the same amount of worry with either procedure.
How do you feel about a modification of procedure A, where the sampled plans are evaluated by an ML model of human approval, and only if they reach an acceptably high threshold are they sent to a human for final approval?
Also, random note that might be important, any specifications that we give to the rocket design system are very strong indicators of what we will approve.
I do have this intuition as well, but it’s because I expect local search to be way more powerful than random search. I’m not sure how large you were thinking with “large”. Either way though, I expect that sampling will result in a just-barely-approved of plan, whereas local search will result in a very-high-approval plan—which basically means that the local search method was a more powerful optimizer. (As intuition for this, note that since image classifiers have millions of parameters, the success of image classifiers suggests that gradient descent is capable of somewhere between thousands and trillions of bits of optimization. The corresponding random search would be astronomically huge.)
Another way of stating this: I’d be worried about procedure B because it seems like if it is behaving adversarially, then it can craft a plan that gets lots of human approval despite being bad, whereas procedure A can’t do that.
However, if procedure B actually takes a lot of samples before finding a design that humans approve, then we’d get a barely-approved of plan, and then I feel about the same amount of worry with either procedure.
How do you feel about a modification of procedure A, where the sampled plans are evaluated by an ML model of human approval, and only if they reach an acceptably high threshold are they sent to a human for final approval?
Also, random note that might be important, any specifications that we give to the rocket design system are very strong indicators of what we will approve.