or by subverting the system through some design or implementation flaw
I discuss the most concerning-to-me instance of this in problem (1) here; it seems like that discussion applies equally well to anything that might work fine at first but then break when you become a sufficiently smart reasoner.
I think the basic question is whether you can identify and exploit such flaws at exactly the same time that you recognize their possibility, or whether you can notice them slightly before. By “before” I mean with a version of you that is less clever, has less time to think, has a weaker channel to influence the world, or is treated with more skepticism and caution.
If any of these versions of you can identify the looming problem in advance, and then explain it to the aliens, then they can correct the problem. I don’t know if I’ve ever encountered a possible flaw that wasn’t noticeable “before” it was exploitable in one of these senses. But I may just be overlooking them, and of course even if we can’t think of any it’s not such great reassurance.
Of course even if you can’t identify such flaws, you can preemptively improve the setup for the aliens, in advance of improving your own cognition. So it seems like we never really care about the case where you are radically smarter than the designer of the system, we care about the case where you are very slightly smarter. (Unless this system-improvement is a significant fraction of the difficulty of actually improving your cognition, which seems far-fetched.)
The point is that my behavior while my abilities are less than super-alien are not a very good indication of how safe I will eventually be.
Other than the issue from the first part of this comment, I don’t really see why the behavior changes (in a way that invalidates early testing) when you become super-alien in some respects. It seems like you are focusing on errors you may make that would cause you to receive a low payoff in the RL game. As you become smarter, I expect you to make fewer such errors. I certainly don’t expect you to predictably make more of them.
(I understand that this is a bit subtle, because as you get smarter the problem also may get harder, since your plans will e.g. be subject to more intense scrutiny and to more clever counterproposals. But that doesn’t seem prone to lead to the kinds of errors you discuss.)
Other than the issue from the first part of this comment, I don’t really see why the behavior changes (in a way that invalidates early testing) when you become super-alien in some respects. It seems like you are focusing on errors you may make that would cause you to receive a low payoff in the RL game. As you become smarter, I expect you to make fewer such errors.
Paraphrasing, I think you’re saying that, if the reinforcement game setup continues to work, you expect to make fewer errors as you get smarter. And the only way getting smarter hurts you is if it breaks the game (by enabling you to fall into traps faster than you can notice and avoid them).
I discuss the most concerning-to-me instance of this in problem (1) here; it seems like that discussion applies equally well to anything that might work fine at first but then break when you become a sufficiently smart reasoner.
I think the basic question is whether you can identify and exploit such flaws at exactly the same time that you recognize their possibility, or whether you can notice them slightly before. By “before” I mean with a version of you that is less clever, has less time to think, has a weaker channel to influence the world, or is treated with more skepticism and caution.
If any of these versions of you can identify the looming problem in advance, and then explain it to the aliens, then they can correct the problem. I don’t know if I’ve ever encountered a possible flaw that wasn’t noticeable “before” it was exploitable in one of these senses. But I may just be overlooking them, and of course even if we can’t think of any it’s not such great reassurance.
Of course even if you can’t identify such flaws, you can preemptively improve the setup for the aliens, in advance of improving your own cognition. So it seems like we never really care about the case where you are radically smarter than the designer of the system, we care about the case where you are very slightly smarter. (Unless this system-improvement is a significant fraction of the difficulty of actually improving your cognition, which seems far-fetched.)
Other than the issue from the first part of this comment, I don’t really see why the behavior changes (in a way that invalidates early testing) when you become super-alien in some respects. It seems like you are focusing on errors you may make that would cause you to receive a low payoff in the RL game. As you become smarter, I expect you to make fewer such errors. I certainly don’t expect you to predictably make more of them.
(I understand that this is a bit subtle, because as you get smarter the problem also may get harder, since your plans will e.g. be subject to more intense scrutiny and to more clever counterproposals. But that doesn’t seem prone to lead to the kinds of errors you discuss.)
Paraphrasing, I think you’re saying that, if the reinforcement game setup continues to work, you expect to make fewer errors as you get smarter. And the only way getting smarter hurts you is if it breaks the game (by enabling you to fall into traps faster than you can notice and avoid them).
Is that right?