plex comments on Stopping unaligned LLMs is easy!

plex 3 Feb 2025 17:19 UTC
6 points
4
I’m glad you’re trying to figure out a solution. I am however going to shoot this one down a bunch.
If these assumptions were true, this would be nice. Unfortunately, I think all three are false.
LLMs will never be superintelligent when predicting a single token.
In a technical sense, definitively false. Redwood compared human to AI token prediction and even early AIs were far superhuman. Also, in a more important sense, you can apply a huge amount of optimization on selecting a token. This video gives a decent intuition, though in a slightly different setting.
LLMs will have no state.
False in three different ways. Firstly, people are totally building in explicit state in lots of ways (test time training, context retrieval, reasoning models, etc). Secondly, there’s a feedback cycle of AI influences training data of next AI, which will become a tighter and tighter loop. Thirdly, the AI can use the environment as state in ways which would be nearly impossible to fully trace or mitigate.
not in a way that any emergent behaviour of the system as a whole isn’t reflected in the outputs of any of the constituent LLMs
alas, many well-understood systems regularly do and should be expected to have poorly understood behaviour when taken together.
a simpler LLM to detect output that looks like it’s leading towards unaligned behaviour.
Robustly detecting “unaligned behaviour” is an unsolved problem, if by aligned you mean “makes the long term future good” rather than “doesn’t embarrass the corporation”. Solving this would be massive progress, and throwing a LLM at it naively has many pitfalls.
Stepping back, I’d encourage you to drop by AI plans, skill up at detecting failure modes, and get good at both generating and red-teaming your own ideas (the Agent Foundations course and Arbital are good places to start). Get a long list of things you’ve shown how they break, and help both break and extract the insights from other’s ideas.
- Yair Halberstadt 3 Feb 2025 17:36 UTC
  2 points
  −2
  Parent
  context retrieval, reasoning models
  Both of these are examples of more intelligent systems built on top of an LLM where the LLM itself has no state.
  Thirdly, the AI can use the environment as state in ways which would be nearly impossible to fully trace or mitigate.
  Again same problem—AI tool use is mediated by text input and output, and the validator just needs access to the LLM input and output.