RogerDearnaley comments on Can we efficiently distinguish different mechanisms?

RogerDearnaley 29 May 2023 5:18 UTC
1 point
0
I’d really like to have a better solution to alignment than one that relied entirely on something comparable to sensor hardening.
What are your thoughts on how value learning interacts with E.L.K.? Obviously the issue with value learning this that it only helps with outer alignment, not inner alignment: you’re transforming the problem from “How do we know the machine isn’t lying to us?” to “How do we know that the machine is actually trying to learn what we want (which includes not being lied to)?” It also explicitly requires the machine to build a model of “what humans want”, and then the complexity level and latent knowledge content required is fairly similar between “figure out what the humans want and then do that” and “figure out what the humans want and then show them a video of what doing that would look like”.
Maybe we should just figure out some way to do surprise inspections on the vault? :-)
- paulfchristiano 29 May 2023 16:28 UTC
  2 points
  0
  Parent
  I agree that it seems very bad if we build AI systems that would “prefer” to tamper with sensors (including killing humans if necessary) but are prevented from doing so by physical constraints.
  I currently don’t see how to approach value learning (in the worst case) without solving something like ELK. If you want to take a value learning perspective, you could view ELK as a subproblem of the easy goal inference problem. If there’s some value learning approach that routes around this problem I’m interested in it, but I haven’t seen any candidates and have spent a long time talking with people about it.