paulfchristiano comments on What’s wrong with these analogies for understanding Informed Oversight and IDA?

paulfchristiano 23 Mar 2019 16:11 UTC
LW: 2 AF: 1
0
AF
Can you quote these examples? The word “example” appears 27 times in that post and looking at the literal second and third examples, they don’t seem very relevant to what you’ve been saying here so I wonder if you’re referring to some other examples.
Subsections “Modeling” and “Alien reasoning” of “Which C are hard to epistemically dominate?”
What I’m inferring from this (as far as a direct answer to my question) is that an overseer trying to do Informed Oversight on some ML model doesn’t need to reverse engineer the model enough to fully understand what it’s doing, only enough to make sure it’s not doing something malign, which might be a lot easier, but this isn’t quite reflected in the formal definition yet or isn’t a clear implication of it yet. Does that seem right?
You need to understand what facts the model “knows.” This isn’t value-loaded or sensitive to the notion of “malign,” but it’s still narrower than “fully understand what it’s doing.”
As a simple example, consider linear regression. I think that linear regression probably doesn’t know anything you don’t. Yet doing linear regression is a lot easier than designing a linear model by hand.
If that’s what you do, it seems “P outputs true statements just in the cases I can check.” could have a posterior that’s almost 50%, which doesn’t seem safe, especially in an iterated scheme where you have to depend on such probabilities many times?
Where did 50% come from?
Also “P outputs true statements in just the cases I check” is probably not catastrophic, it’s only catastrophic once P performs optimization in order to push the system off the rails.