Ramana Kumar comments on ARC’s first technical report: Eliciting Latent Knowledge

Ramana Kumar 20 Dec 2021 11:41 UTC
LW: 1 AF: 1
0
AF
Thanks for the reply! I think you’ve understood correctly that the human rater needs to understand the proposed experiment – i.e., be able to carry it out and have a confident expectation about the outcome – in order to rate the proposer highly.
Here’s my summary of your point: for some tampering actions, there are no experiments that a human would understand in the above sense that would expose the tampering. Therefore that kind of tampering will result in low value for the experiment proposer (who has no winning strategy), and get rated highly.
This is a crux for me. I don’t yet believe such tampering exists. The intuition I’m drawing on here is that our beliefs about what world we’re in need to cash out in anticipated experiences. Exposing confusion about something that shouldn’t be confusing can be a successful proposer strategy. I appreciate your examples of “a fake diamond that can only be exposed by complex imaging techniques” and “a human making subtly different moral judgements” and will ponder them further.
Your comment also helped me realise another danger of this strategy: to get the data for training the experiment proposer, we have to execute the SmartVault actions first. (Whereas I think in the baseline scheme they don’t have to be executed.)
- Mark Xu 20 Dec 2021 18:25 UTC
  LW: 3 AF: 2
  0
  AF Parent
  My point is either that:
  - it will always be possible to find such an experiment for any action, even desirable ones, because the AI will have defended the diamond in a way the human didn’t understand or the AI will have deduced some property of diamonds that humans thought they didn’t have
  - or there will be some tampering for which it’s impossible to find an experiment, because in order to avoid the above problem, you will have to restrict the space of experiments