My intuitions for how to frame the problem run a little differently.
The way I see it, there is no possible way to block all unFriendly or model-breaking solutions, and it’s foolish to try. Try framing it this way: any given solution has some chance of breaking the models, probably pretty low. Call that probability P (god I miss Latex) The goal is to get Friendliness close enough to the top of the list that P (which ought to be constant) times the distance D from the top of the list is still below whatever threshold we set as an acceptable risk to life on Earth / in our future light cone. In other words, we want to minimize the distance of the Friendly solutions from the top of the list, since each solution considered before finding the Friendly one brings a risk of breaking the models, and returning a false ‘no-veto’.
Let’s go back to the grandmother-in-the-burning-building problem: if we assume that this problem has mind-hacking solutions closer to the top of the list than the Friendly solutions (which I kind of doubt, really), then we’ve got a problem. Let’s say, however, that our plan-generating system has a new module which generates simple heuristics for predicting how the models will vote (yes this is getting a bit recursive for my taste). These heuristics would be simple rules of thumb like ‘models veto scenarios involving human extinction’ ‘models tend to veto scenarios involving screaming and blood’ ‘models veto scenarios that involve wires in their brains’.
When parsing potential answers, scenarios that violate these heuristics are discarded from the dataset early. In the case of the grandmother problem, a lot of the unFriendly solutions can be discarded using a relatively small set of such heuristics. It doesn’t have to catch all of them, but it isn’t meant to—simply by disproportionately eliminating unFriendly solutions before the models see them, you shrink the net distance to Friendliness. I see that as a much more robust way of approaching the problem.
My intuitions for how to frame the problem run a little differently.
The way I see it, there is no possible way to block all unFriendly or model-breaking solutions, and it’s foolish to try. Try framing it this way: any given solution has some chance of breaking the models, probably pretty low. Call that probability P (god I miss Latex) The goal is to get Friendliness close enough to the top of the list that P (which ought to be constant) times the distance D from the top of the list is still below whatever threshold we set as an acceptable risk to life on Earth / in our future light cone. In other words, we want to minimize the distance of the Friendly solutions from the top of the list, since each solution considered before finding the Friendly one brings a risk of breaking the models, and returning a false ‘no-veto’.
Let’s go back to the grandmother-in-the-burning-building problem: if we assume that this problem has mind-hacking solutions closer to the top of the list than the Friendly solutions (which I kind of doubt, really), then we’ve got a problem. Let’s say, however, that our plan-generating system has a new module which generates simple heuristics for predicting how the models will vote (yes this is getting a bit recursive for my taste). These heuristics would be simple rules of thumb like ‘models veto scenarios involving human extinction’ ‘models tend to veto scenarios involving screaming and blood’ ‘models veto scenarios that involve wires in their brains’.
When parsing potential answers, scenarios that violate these heuristics are discarded from the dataset early. In the case of the grandmother problem, a lot of the unFriendly solutions can be discarded using a relatively small set of such heuristics. It doesn’t have to catch all of them, but it isn’t meant to—simply by disproportionately eliminating unFriendly solutions before the models see them, you shrink the net distance to Friendliness. I see that as a much more robust way of approaching the problem.