Wei Dai comments on What’s wrong with these analogies for understanding Informed Oversight and IDA?

Wei Dai 20 Mar 2019 19:36 UTC
LW: 5 AF: 3
AF

In that case, you can still try to be a straightforward Bayesian about it, and say “our intuition supports the general claim that process P outputs true statements;” you can then apply that regularity to trust P on some new claim even if it’s not the kind of claim you could verify, as long as “P outputs true statements” had a higher prior than “P outputs true statements just in the cases I can check.”

If that’s what you do, it seems “P outputs true statements just in the cases I can check.” could have a posterior that’s almost 50%, which doesn’t seem safe, especially in an iterated scheme where you have to depend on such probabilities many times? Do you not need to reduce the posterior probability to a negligible level instead?

See the second and third examples in the post introducing ascription universality.

Can you quote these examples? The word “example” appears 27 times in that post and looking at the literal second and third examples, they don’t seem very relevant to what you’ve been saying here so I wonder if you’re referring to some other examples.

There is definitely a lot of fuzziness here and it seems like one of the most important places to tighten up the definition / one of the big research questions for whether ascription universality is possible.

What I’m inferring from this (as far as a direct answer to my question) is that an overseer trying to do Informed Oversight on some ML model doesn’t need to reverse engineer the model enough to fully understand what it’s doing, only enough to make sure it’s not doing something malign, which might be a lot easier, but this isn’t quite reflected in the formal definition yet or isn’t a clear implication of it yet. Does that seem right?
- paulfchristiano 23 Mar 2019 16:11 UTC
  LW: 2 AF: 1
  AF Parent
  Can you quote these examples? The word “example” appears 27 times in that post and looking at the literal second and third examples, they don’t seem very relevant to what you’ve been saying here so I wonder if you’re referring to some other examples.
  Subsections “Modeling” and “Alien reasoning” of “Which C are hard to epistemically dominate?”
  What I’m inferring from this (as far as a direct answer to my question) is that an overseer trying to do Informed Oversight on some ML model doesn’t need to reverse engineer the model enough to fully understand what it’s doing, only enough to make sure it’s not doing something malign, which might be a lot easier, but this isn’t quite reflected in the formal definition yet or isn’t a clear implication of it yet. Does that seem right?
  You need to understand what facts the model “knows.” This isn’t value-loaded or sensitive to the notion of “malign,” but it’s still narrower than “fully understand what it’s doing.”
  As a simple example, consider linear regression. I think that linear regression probably doesn’t know anything you don’t. Yet doing linear regression is a lot easier than designing a linear model by hand.
  If that’s what you do, it seems “P outputs true statements just in the cases I can check.” could have a posterior that’s almost 50%, which doesn’t seem safe, especially in an iterated scheme where you have to depend on such probabilities many times?
  Where did 50% come from?
  Also “P outputs true statements in just the cases I check” is probably not catastrophic, it’s only catastrophic once P performs optimization in order to push the system off the rails.