plex comments on Interpretability Will Not Reliably Find Deceptive AI

plex 6 May 2025 13:05 UTC
0 points
8
This and your other recent post have raised my opinion of you significantly. I’m still pretty concerned about the capabilities outcomes of interpretability, both the general form of being able to open the black box enough to find unhobblings accelerating timelines, and especially automated interp pipelines letting AIs optimize their internals^[1] as a likely trigger for hard RSI → game over, but it’s nice to see you’re seeing more of the strategic landscape than seemed apparent before.
1. ^
  please don’t do this