otto.barten comments on Alignment will happen by default. What’s next?

otto.barten 6 Dec 2025 6:40 UTC
1 point
0
AF
Currently, we observe that leading models get open sourced roughly half a year later. It’s not a stretch to assume this will also happen to takeover-level AI. If we assume such AI to look like LLM agents, it would be relevant to know what the probability is that such an agent, somewhere on earth, would try to take over.
Let’s assume someone, somewhere, will be really annoyed with all the safeguards and remove them, so that their LLM will have a probability of 99% to just do as it’s told, even though that might be highly unethical. Let’s furthermore assume an LLM-based agent will need to take 20 unethical actions to actually take over (the rest of the required actions won’t look particularly unethical to low-level LLMs executing them, in our scenario). In this case, there would be a 99%^20=82% chance that an LLM-based agent would take over, for any bad actor giving it this prompt.
I’d be less worried if it would be extremely difficult, and require lots of resources, to get LLMs to take unethical actions when they’re asked to. For example, if jailbreaking safety would be highly robust, and even adversarial fine tuning of open source LLMs wouldn’t break it.
Is that something you see on the horizon?
- Adrià Garriga-alonso 6 Dec 2025 9:10 UTC
  LW: 2 AF: 1
  0
  AF Parent
  No, I think the blue-team will keep having the latest and best LLMs and be able to stop such attempts from randos. These AGIs won’t be so much magically superintelligent that they can take all the unethical actions needed to take over the world, without other AGIs stopping them.
  - otto.barten 7 Dec 2025 10:14 UTC
    1 point
    0
    Parent
    I don’t think it makes sense to be confidently optimistic about this (the offense defense balance) given the current state of research. I looked into this topic some time ago with Sammy Martin. I think there is very little plan of anyone in the research community on how the blue team would actually stop the red team. Particularly worrying is that several domains look like the offense has the advantage (eg bioweapons, cybersec), and that defense would need to play by the rules, hugely hindering its ability to act. See also eg this post.
    Since most people who actually thought about this seem to arrive at the conclusion that offense would win, I think being confident that defense would win seems off. What are your arguments?