Davidmanheim comments on Therapist in the Weights: Risks of Hyper-Introspection in Future AI Systems

Davidmanheim 28 Apr 2025 7:04 UTC
5 points
0
Responses to o4-mini-high’s final criticisms of the post:

Criticism: “You’re treating hyper-introspection (internal transparency) as if it naturally leads to embedded agency (full goal-driven self-modification). But in practice, these are distinct capabilities. Why do you believe introspection tools would directly lead to autonomous, strategic self-editing in models that remain prediction-optimized?”

Response: Yes, these are distinct, and one won’t necessarily lead to the other—but both are being developed by the same groups in order to deploy them. There’s a reasonable question about how linked they are, but I think that there is a strong case that self-modifying via introspection, even if only done during training and via internal deployment would lead to much more dangerous and hard to track deception.

Criticism: “You outline very plausible risks but don’t offer a distribution over outcomes. Should we expect hyper-introspection to make systems 10% more dangerous? 1000%? Under what architectures? I’d find your argument stronger if you were more explicit about the conditional risk landscape.”

Response: If we don’t solve ASI alignment, which no-one seems to think we can do, we’re doomed once we build misaligned. This seems to get us there more quickly. Perhaps it even reduces short term risks, but I think timelines are far more uncertain than the way the risks will emerge if we build systems that have these capabilities.
Criticism: “Given that fully opaque systems are even harder to oversee, and that deception risk grows with opacity too, shouldn’t we expect that some forms of introspection are necessary for any meaningful oversight? I agree hyper-introspection could be risky, but what’s the alternative plan if we don’t pursue it?
Response: Don’t build smarter than human systems. If you are not developing ASI, and you want to monitor current and near future but not inevitably existentially dangerous systems, work on how humans can provide meaningful oversight in deployment instead of tools that enhance capabilities for accelerating the race—because without fixing the underlying dynamics, i.e. solving alignment, self-monitoring is a doomed approach.
Criticism: “You assume that LLMs could practically trace causal impact through their own weights. But given how insanely complicated weight-space dynamics are even for humans analyzing small nets, why expect this capability to arise naturally, rather than requiring radical architectural overhaul?”
Response: Yes, maybe Anthropic and others will fail, and building smarter than human systems might not be possible. Then strong interpretability is just a capability enhancer, and doesn’t materially change the largest risks. That would be great news, but I don’t want to bet my kid’s lives on it.