jsd answers [missing post]

jsd 27 Sep 2023 3:38 UTC
2 points
0
You may want to check out Benchmarks for Detecting Measurement Tampering:
Detecting measurement tampering can be thought of as a specific case of Eliciting Latent Knowledge (ELK): When AIs successfully tamper with measurements that are used for computing rewards, they possess important information that the overseer doesn’t have (namely, that the measurements have been tampered with). Conversely, if we can robustly elicit an AI’s knowledge of whether the measurements have been tampered with, then we could train the AI to avoid measurement tampering. In fact, our best guess is that this is the most important and tractable class of ELK problems.
- Oliver Daniels 27 Sep 2023 12:05 UTC
  1 point
  0
  Parent
  Thanks! Hadn’t seen that