paulfchristiano comments on Can we efficiently distinguish different mechanisms?

paulfchristiano 23 May 2023 2:28 UTC
2 points
1
If safety literally came down to sensor hardening, I do think cryptographic mechanisms (particularly tamper-proof hardware with cryptographic secrets that destroys itself if it detects trouble) seem like a relevant tool, and it’s quite plausibly you could harden sensors even against wildly superhuman attackers.
It’s an insane-feeling scenario—holistically I doubt it will matter for a variety of reasons, and from a worst-case perspective it’s still not something you can rely on—but I do think there’s some value in pinning these things down.
(In this particular case I think that the weak point is definitely the tagging step. I think cryptographic mechanisms help a huge amount with the issue where a model could intercept signals coming into the datacenter, but you are still in the position where you need your sensors to detect trouble so that the hardware can destroy the secret.)
- RogerDearnaley 29 May 2023 5:18 UTC
  1 point
  0
  Parent
  I’d really like to have a better solution to alignment than one that relied entirely on something comparable to sensor hardening.
  What are your thoughts on how value learning interacts with E.L.K.? Obviously the issue with value learning this that it only helps with outer alignment, not inner alignment: you’re transforming the problem from “How do we know the machine isn’t lying to us?” to “How do we know that the machine is actually trying to learn what we want (which includes not being lied to)?” It also explicitly requires the machine to build a model of “what humans want”, and then the complexity level and latent knowledge content required is fairly similar between “figure out what the humans want and then do that” and “figure out what the humans want and then show them a video of what doing that would look like”.
  Maybe we should just figure out some way to do surprise inspections on the vault? :-)
  - paulfchristiano 29 May 2023 16:28 UTC
    2 points
    0
    Parent
    I agree that it seems very bad if we build AI systems that would “prefer” to tamper with sensors (including killing humans if necessary) but are prevented from doing so by physical constraints.
    I currently don’t see how to approach value learning (in the worst case) without solving something like ELK. If you want to take a value learning perspective, you could view ELK as a subproblem of the easy goal inference problem. If there’s some value learning approach that routes around this problem I’m interested in it, but I haven’t seen any candidates and have spent a long time talking with people about it.