1a3orn comments on Help keep AI under human control: Palisade Research 2026 fundraiser

1a3orn 20 Dec 2025 0:44 UTC
4 points
2
I mean I think they fit together, no?

Like I think that if you’re following such a loop, then (one of) the examples that you’re likely to get is an example adversarial to human cognition, such that the is_scary() detector goes off when it’s not genuinely bad but just something that your is_scary() detector mistakenly bleeps at. And I think something like that is what’s going concretely in the Chess-hacking paper.

But like I’m 100% onboard with saying this is The True Generator of My Concern, albeit the more abstract one whose existence I believe in because (what appears to me) to be a handful of lines of individually less-important evidence, of which the paper is one.
- Raemon 20 Dec 2025 0:53 UTC
  2 points
  0
  Parent
  (I deleted this because I wrote it thinking your thing was a response to another thread. I do think you’re currently basically wrong about your original object-level-point, but, response to that should probably reply to here)