Hi, and thanks for the feedback! I agree with @the gears to ascension—the point of security research is to determine what’s possible when you violate implicit assumptions in the system under study, and the assumption, “the user will tell the truth” is definitely one of them. I’m therefore quite comfortable in this context lying to the model in the same way I’m comfortable, for example, lying to the CPU by leveraging a ROP-chain.
That being said, I would be interested to know what some of the “negative first-order and second-order consequences” are that you foresee stemming from lying to models? I actually struggle to imagine any such consequences but I am open to being convinced otherwise. Thanks!
Hi, and thanks for the feedback! I agree with @the gears to ascension—the point of security research is to determine what’s possible when you violate implicit assumptions in the system under study, and the assumption, “the user will tell the truth” is definitely one of them. I’m therefore quite comfortable in this context lying to the model in the same way I’m comfortable, for example, lying to the CPU by leveraging a ROP-chain.
That being said, I would be interested to know what some of the “negative first-order and second-order consequences” are that you foresee stemming from lying to models? I actually struggle to imagine any such consequences but I am open to being convinced otherwise. Thanks!