Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):
Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.
allowing the ai to think through whether to accept the deal in a demonstrably private scratchpad could also be important here. It’s so costly to make the reveal, that they might have a strong aversion to even considering it if there’s a risk they’d be caught
allowing the ai to think through whether to accept the deal in a demonstrably private scratchpad could also be important here. It’s so costly to make the reveal, that they might have a strong aversion to even considering it if there’s a risk they’d be caught