tailcalled comments on Stopping unaligned LLMs is easy!

tailcalled 3 Feb 2025 17:59 UTC
2 points
0
How would such a validator react if you tried to hack the LLM by threatening to kill all humnas unless it complies?
- Yair Halberstadt 3 Feb 2025 18:18 UTC
  2 points
  0
  Parent
  This isn’t a solution to aligned LLMs being abused by humans, but to unaligned LLMs abusing humans.
  - tailcalled 3 Feb 2025 19:00 UTC
    2 points
    0
    Parent
    If you wanted to have an unaligned LLM that doesn’t abuse humans, couldn’t you just never sample from it after training it to be unaligned?