Fabien Roger comments on Benchmarks for Detecting Measurement Tampering [Redwood Research]

Fabien Roger 4 Oct 2023 15:02 UTC
2 points
0
Yes, this is assuming you would reward the agent based on whether the MTD tells you if the diamond is there or not. I don’t see how this clearly incentivizes the model to make tampering happen more often in cases where the diamond is present—I would expect such behavior to create more false negatives (the diamond is there, but the predictor thinks it is not), which is penalized since the agent is wrongfully punished for not getting a diamond, and I don’t see how it would help to create false positives (the diamond is not there, but the predictor thinks it is).