Like I think that if you’re following such a loop, then (one of) the examples that you’re likely to get is an example adversarial to human cognition, such that the is_scary() detector goes off when it’s not genuinely bad but just something that your is_scary() detector mistakenly bleeps at. And I think something like that is what’s going concretely in the Chess-hacking paper.
But like I’m 100% onboard with saying this is The True Generator of My Concern, albeit the more abstract one whose existence I believe in because (what appears to me) to be a handful of lines of individually less-important evidence, of which the paper is one.
(I deleted this because I wrote it thinking your thing was a response to another thread. I do think you’re currently basically wrong about your original object-level-point, but, response to that should probably reply to here)
I mean I think they fit together, no?
Like I think that if you’re following such a loop, then (one of) the examples that you’re likely to get is an example adversarial to human cognition, such that the
is_scary()detector goes off when it’s not genuinely bad but just something that youris_scary()detector mistakenly bleeps at. And I think something like that is what’s going concretely in the Chess-hacking paper.But like I’m 100% onboard with saying this is The True Generator of My Concern, albeit the more abstract one whose existence I believe in because (what appears to me) to be a handful of lines of individually less-important evidence, of which the paper is one.
(I deleted this because I wrote it thinking your thing was a response to another thread. I do think you’re currently basically wrong about your original object-level-point, but, response to that should probably reply to here)