benwr comments on Help keep AI under human control: Palisade Research 2026 fundraiser

benwr 19 Dec 2025 10:05 UTC
20 points
10
This take is a bit frustrating to me, because the preprint does discuss Rajamanoharan & Nanda’s result, and in particular when we tried Rajamanoharan & Nanda’s strongest prompt clarification on other models in our initial set, it didn’t in fact bring the rate to zero. Which is not to say that it would be impossible to find a prompt that brings the rate low enough to be entirely undetectable for all models—of course you could find such a prompt if you knew that you needed to look for one.
- Zack_M_Davis 25 Dec 2025 6:11 UTC
  7 points
  0
  Parent
  
  the preprint does discuss Rajamanoharan & Nanda’s result
  
  I apologize; I read the July blog post and then linked to the September paper in my comment without checking if the paper had new content. I will endeavor to be less careless in the future.