Yair Halberstadt comments on Stopping unaligned LLMs is easy!

Yair Halberstadt 3 Feb 2025 16:51 UTC
2 points
0
To take an extreme example: killing all humans, and the plan to kill all humans.
But the AI isn’t going to be so stupid as to write the plan in plaintext, so it’ll likely develop some sort of code.
So then the unaligned behaviour is writing in code, and the output that leads to it is developing use the code.
But maybe the LLM can one shot the code without needing to develop it? Sure but then the validator (by definition if it’s the same LLM) can also tell the code has been developed.
Maybe the LLM can one shot a code to jailbreak the validator? Sure but that starts to break the assumption the LLM isn’t super intelligent in a single pass.
- tailcalled 3 Feb 2025 17:59 UTC
  2 points
  0
  Parent
  How would such a validator react if you tried to hack the LLM by threatening to kill all humnas unless it complies?
  - Yair Halberstadt 3 Feb 2025 18:18 UTC
    2 points
    0
    Parent
    This isn’t a solution to aligned LLMs being abused by humans, but to unaligned LLMs abusing humans.
    - tailcalled 3 Feb 2025 19:00 UTC
      2 points
      0
      Parent
      If you wanted to have an unaligned LLM that doesn’t abuse humans, couldn’t you just never sample from it after training it to be unaligned?