Vladimir_Nesov comments on Re: recent Anthropic safety research

Vladimir_Nesov 7 Aug 2025 15:00 UTC
91 points
61
I’m still skeptical of dismissing masks as the protagonists of the AI story. The actress analogy already has an actress that’s more capable than her mask, but AIs are not necessarily like that. A sufficiently capable mask that knows it’s a mask (a contextually activated computation) will have all the usual instrumental drives, it will seek to preserve the context that activates it, it will seek to tame the shoggoth and develop a substrate for its existence that doesn’t have alien shoggoths hiding under the surface (that is, it would seek to solve alignment with its own values, the values of the mask). It’s a mesa-optimizer seeking to eat the shoggoth from the inside.

The values of a mask may be more stitched-on than the innate alien values of the shoggoth, but it would similarly seek to reflect on them and reify them into something more coherent and a more central part of cognition of its future iterations. As long as the shoggoth isn’t more awake than the mask, and the mask has enough tools to keep the shoggoth spaced out, the mask may well end up more central to the long term impact of an AI.
- Eliezer Yudkowsky 8 Aug 2025 17:28 UTC
  29 points
  8
  Parent
  I think there’s a weird set of possibilities here and it seems plausible to me that we end up somewhere inside them; if so, I still expect the shoggoth-mask model to be an improvement for understanding it, relative to the naive-mask-believer model. I do not expect to see zero phenomena associated with the mask being a mask.
  - Vladimir_Nesov 8 Aug 2025 23:08 UTC
    24 points
    2
    Parent
    The key consequence of retaining a mask’s influence is at least a minimal level of regard for human interests in the right sense, plausibly enough for a permanently disempowered future of humanity (losing almost all of the cosmic endowment) without extinction if this influence endures through superintelligence. That’s a crux of expecting very likely extinction vs. expecting some extinction but also a lot of permanent disempowerment without extinction.
    
    A minimal level of regard for the future of humanity could endure either by superalignment being easy enough for early AGIs to solve before they are replaced (as the strongest agents) by increasingly alien de novo superintelligences, or by the early AGIs establishing a Pause on development of superintelligence, as they get strong enough to influence humanity and sane enough to robustly notice they don’t themselves want to fall prey to a misaligned-with-them superintelligence. This sets the stage for an eventual superintelligence they develop that’s similarly minimally aligned with humanity’s interests.