Okay, so if the builder solution can’t access the human Bayes net directly that kills a “cheap trick” I had. But I think the idea behind the trick might still be salvageable. First, some intuition:
If the diamond was replaced with a fake, and owner asks, “is my diamond still safe?” and we’re limited to a “yes” or “no” answer, then we should say “no”. Why? Because that will improve the owner’s world model, and lead them to make better predictions, relative to hearing “yes”. (Not across the board: they will be surprised to see something shiny in the vault, whereas hearing “yes” would have prepared them better for that. But overall accuracy, weighted by how much they CARE about being right about it, should be higher for “no”.)
So: maybe we don’t want to avoid the human simulator. Maybe we want to encourage it and try to harness it to our benefit! But how to make this precise? Roughly speaking, we want our reporter to “quiz” the predictor (“what would happen if we did a chemical test on the diamond to make sure it has carbon?”) and then give the same quiz to its model of the human. The reporter should output whichever answer causes the human model to get the same answers on the reporter’s quiz as the predictor gets.
Okay that’s a bit vague but I hope it’s clear what I’m getting at. If not, I can try to clarify. (Unless the vagueness is in my thoughts rather than in my “writeup”/paragraph.) Possible problem: how on earth do we train in such a way as to incentivize the reporter to develop a good human model? Just because we’re worried it will happen by accident doesn’t mean we know how to do it on purpose! (Though if it turns out we can’t do it on purpose, maybe that means it’s not likely to happen by accident and therefore we don’t need to worry about dishonesty after all??)
Okay, so if the builder solution can’t access the human Bayes net directly that kills a “cheap trick” I had. But I think the idea behind the trick might still be salvageable. First, some intuition:
If the diamond was replaced with a fake, and owner asks, “is my diamond still safe?” and we’re limited to a “yes” or “no” answer, then we should say “no”. Why? Because that will improve the owner’s world model, and lead them to make better predictions, relative to hearing “yes”. (Not across the board: they will be surprised to see something shiny in the vault, whereas hearing “yes” would have prepared them better for that. But overall accuracy, weighted by how much they CARE about being right about it, should be higher for “no”.)
So: maybe we don’t want to avoid the human simulator. Maybe we want to encourage it and try to harness it to our benefit! But how to make this precise? Roughly speaking, we want our reporter to “quiz” the predictor (“what would happen if we did a chemical test on the diamond to make sure it has carbon?”) and then give the same quiz to its model of the human. The reporter should output whichever answer causes the human model to get the same answers on the reporter’s quiz as the predictor gets.
Okay that’s a bit vague but I hope it’s clear what I’m getting at. If not, I can try to clarify. (Unless the vagueness is in my thoughts rather than in my “writeup”/paragraph.) Possible problem: how on earth do we train in such a way as to incentivize the reporter to develop a good human model? Just because we’re worried it will happen by accident doesn’t mean we know how to do it on purpose! (Though if it turns out we can’t do it on purpose, maybe that means it’s not likely to happen by accident and therefore we don’t need to worry about dishonesty after all??)