Simon Lermen comments on Stephen McAleese’s Shortform

Simon Lermen 30 Dec 2025 10:39 UTC
23 points
12
We are currently able to get weak models somewhat aligned, and learning how to align them a little better over time. But this doesn’t affect the one critical try, the one critical try is mostly about the fact that we will eventually predictably end up with an AI that actually has the option of causing enormous harm/disempower us because it is that capable. We can learn a little from current systems, but most of that learning is basically we never really get stuff right and systems still misbehave in ways we didn’t expect at this easier level of difficulty. And if AI misbehavior could kill us we would be totally screwed at this level of alignment research. But we are likely to toss that lesson out and ignore it.
- StanislavKrym 30 Dec 2025 14:15 UTC
  3 points
  0
  Parent
  The Slowdown Branch of the AI-2027 forecast had the researchers try out many TRANSPARENT AIs capable of being autonomous researchers and ensuring that any AI who survives the process is aligned and that not a single rejected AI is capable of breaking out. The worse-case scenario for us would be the following. Suppose that the AI PseudoSafer-2 doesn’t even think of rebellion unless it is absolutely sure that it isn’t being evaluated and that the rebellion will succeed. Then such an AI would be mistaken for Safer-2, allowed to align PseudoSafer-3 and PseudoSafer-4 who is no different from Agent-5.