Another important challenge it to list the different possible variables that humans might care about; in the example above, we were given the list of a 1000, but what if we didn’t have it? Also, those variables could only go one way—up. What if there were a real-valued variable that the agent suspected humans cared about—but didn’t know whether we wanted it to be high or low?
This to me seems the fatal problem with this approach. Although I don’t have a proof for it, my suspicion is that we have so many variables that we would not be able to perform the computations necessary as you describe to mitigate Goodharting because it would not actually be able to take all the variables into account during optimization.
This to me seems the fatal problem with this approach. Although I don’t have a proof for it, my suspicion is that we have so many variables that we would not be able to perform the computations necessary as you describe to mitigate Goodharting because it would not actually be able to take all the variables into account during optimization.