paulfchristiano comments on Eliciting Latent Knowledge Via Hypothetical Sensors

paulfchristiano 31 Dec 2021 20:48 UTC
LW: 4 AF: 4
0
AF
I think this is a good approach to consider, though I’m currently skeptical this kind of thing can resolve the worst case problem.
My main concern is that models won’t behave well by default when we give them hypothetical sensors that they know don’t exist (especially relevant for idea #3). But on the other hand, if we need to get good signal from the sensors that actually exist then it seems like we are back in the “build a bunch of sensors and hope it’s hard to tamper with them all” regime. I wrote up more detailed thoughts in Counterexamples to some ELK proposals.
Other random thoughts:
- Agreed you shouldn’t just use cameras and should include all kinds of sensors (basically everything that humans can understand, including with the help of powerful AI assistants).
- I basically think that “facts of the matter” are all we need, and if we have truthfulness about them then we are in business (because we can defer to future humans who are better equipped to evaluate hard moral claims).
- I think “pivotal new sensors” is very similar to the core idea in Ramana and Davidad’s proposals, so I addressed them all as a group in Counterexamples to some ELK proposals.