d_el_ez comments on Another (outer) alignment failure story

d_el_ez 8 Oct 2025 2:08 UTC
3 points
0
It’s clicking to me how legibility and Goodhardt relate. What if an elite’s window into society is Goodhardt-corrupted? This can happen without AI if they’re corrupt, or just out of touch. It does seem like a sycophantic AI can really do something novel to this situation. If legibility is low enough, and AI helps make the system legible, outer alignment failure might lead to a successful, permanent deception of the user.
Also it’s sort of crazy how AI more than any other STEM I’ve ever worked with collapses the gap from “very out there” to “practical.” If you’re using AI for observability, this is a non-negotiable problem you have to solve. Claude Code does answers-faking today; that sort of thing can happen in an observability system.