Clarification request. In the writeup, you discuss the AI Bayes net and the human Bayes net as if there’s some kind of symmetry between them, but it seems to me that there’s at least one big difference.
In the case of the AI, the Bayes net is explicit, in the sense that we could print it out on a sheet of paper and try to study it once training is done, and the main reason we don’t do that is because it’s likely to be too big to make much sense of.
In the case of the human, we have no idea what the Bayes net looks like, because humans don’t have that kind of introspection ability. In fact, there’s not much difference between saying “the human uses a Bayes net” and “the human uses some arbitrary function F, and we worry the AI will figure out F and then use it to lie to us”.
Or am I actually wrong and it’s okay for a “builder” solution to assume we have access to the human Bayes net?
In the case of the AI, the Bayes net is explicit, in the sense that we could print it out on a sheet of paper and try to study it once training is done, and the main reason we don’t do that is because it’s likely to be too big to make much sense of.
We don’t quite have access to the AI Bayes net—we just have a big neural network, and we sometimes talk about examples where what the neural net is doing internally can be well-described as “inference in a Bayes net.”
So ideally a solution would use neither the human Bayes net or the AI Bayes net.
But when thinking about existing counterexamples, it can still be useful to talk about how we want an algorithm to behave in the case where the human/AI are using a Bayes net, and we do often think about ideas that use those Bayes nets (with the understanding that we’d ultimately need to refine them into approaches that don’t depend on having an explicit Bayes net).
In fact, there’s not much difference between saying “the human uses a Bayes net” and “the human uses some arbitrary function F, and we worry the AI will figure out F and then use it to lie to us”.
I think that there isn’t much difference between the two in this case—I was reading the Bayes net example as just illustration for the point that any AI sufficiently powerful to pose risk would be able to model humans with high fidelity. For that matter, I think that the AI Bayes net was also for illustration, and realistically the AI could learn other methods of reasoning about the world, maybe which include Bayes nets in some form.
Or am I actually wrong and it’s okay for a “builder” solution to assume we have access to the human Bayes net?
I think we can’t assume this naively, but that if you can figure out a competitive and trustworthy way to get this (like with AI assistance), then it’s fair game.
Okay, so if the builder solution can’t access the human Bayes net directly that kills a “cheap trick” I had. But I think the idea behind the trick might still be salvageable. First, some intuition:
If the diamond was replaced with a fake, and owner asks, “is my diamond still safe?” and we’re limited to a “yes” or “no” answer, then we should say “no”. Why? Because that will improve the owner’s world model, and lead them to make better predictions, relative to hearing “yes”. (Not across the board: they will be surprised to see something shiny in the vault, whereas hearing “yes” would have prepared them better for that. But overall accuracy, weighted by how much they CARE about being right about it, should be higher for “no”.)
So: maybe we don’t want to avoid the human simulator. Maybe we want to encourage it and try to harness it to our benefit! But how to make this precise? Roughly speaking, we want our reporter to “quiz” the predictor (“what would happen if we did a chemical test on the diamond to make sure it has carbon?”) and then give the same quiz to its model of the human. The reporter should output whichever answer causes the human model to get the same answers on the reporter’s quiz as the predictor gets.
Okay that’s a bit vague but I hope it’s clear what I’m getting at. If not, I can try to clarify. (Unless the vagueness is in my thoughts rather than in my “writeup”/paragraph.) Possible problem: how on earth do we train in such a way as to incentivize the reporter to develop a good human model? Just because we’re worried it will happen by accident doesn’t mean we know how to do it on purpose! (Though if it turns out we can’t do it on purpose, maybe that means it’s not likely to happen by accident and therefore we don’t need to worry about dishonesty after all??)
Clarification request. In the writeup, you discuss the AI Bayes net and the human Bayes net as if there’s some kind of symmetry between them, but it seems to me that there’s at least one big difference.
In the case of the AI, the Bayes net is explicit, in the sense that we could print it out on a sheet of paper and try to study it once training is done, and the main reason we don’t do that is because it’s likely to be too big to make much sense of.
In the case of the human, we have no idea what the Bayes net looks like, because humans don’t have that kind of introspection ability. In fact, there’s not much difference between saying “the human uses a Bayes net” and “the human uses some arbitrary function F, and we worry the AI will figure out F and then use it to lie to us”.
Or am I actually wrong and it’s okay for a “builder” solution to assume we have access to the human Bayes net?
We don’t quite have access to the AI Bayes net—we just have a big neural network, and we sometimes talk about examples where what the neural net is doing internally can be well-described as “inference in a Bayes net.”
So ideally a solution would use neither the human Bayes net or the AI Bayes net.
But when thinking about existing counterexamples, it can still be useful to talk about how we want an algorithm to behave in the case where the human/AI are using a Bayes net, and we do often think about ideas that use those Bayes nets (with the understanding that we’d ultimately need to refine them into approaches that don’t depend on having an explicit Bayes net).
I think that there isn’t much difference between the two in this case—I was reading the Bayes net example as just illustration for the point that any AI sufficiently powerful to pose risk would be able to model humans with high fidelity. For that matter, I think that the AI Bayes net was also for illustration, and realistically the AI could learn other methods of reasoning about the world, maybe which include Bayes nets in some form.
I think we can’t assume this naively, but that if you can figure out a competitive and trustworthy way to get this (like with AI assistance), then it’s fair game.
Okay, so if the builder solution can’t access the human Bayes net directly that kills a “cheap trick” I had. But I think the idea behind the trick might still be salvageable. First, some intuition:
If the diamond was replaced with a fake, and owner asks, “is my diamond still safe?” and we’re limited to a “yes” or “no” answer, then we should say “no”. Why? Because that will improve the owner’s world model, and lead them to make better predictions, relative to hearing “yes”. (Not across the board: they will be surprised to see something shiny in the vault, whereas hearing “yes” would have prepared them better for that. But overall accuracy, weighted by how much they CARE about being right about it, should be higher for “no”.)
So: maybe we don’t want to avoid the human simulator. Maybe we want to encourage it and try to harness it to our benefit! But how to make this precise? Roughly speaking, we want our reporter to “quiz” the predictor (“what would happen if we did a chemical test on the diamond to make sure it has carbon?”) and then give the same quiz to its model of the human. The reporter should output whichever answer causes the human model to get the same answers on the reporter’s quiz as the predictor gets.
Okay that’s a bit vague but I hope it’s clear what I’m getting at. If not, I can try to clarify. (Unless the vagueness is in my thoughts rather than in my “writeup”/paragraph.) Possible problem: how on earth do we train in such a way as to incentivize the reporter to develop a good human model? Just because we’re worried it will happen by accident doesn’t mean we know how to do it on purpose! (Though if it turns out we can’t do it on purpose, maybe that means it’s not likely to happen by accident and therefore we don’t need to worry about dishonesty after all??)