lilkim2025 comments on GenericModel’s Shortform

lilkim2025 29 Dec 2025 11:07 UTC
1 point
0
like the strings you get when you do GCG attacks
I’m not familiar with these strings. Are you referring to the adversarial prompts themselves? I don’t see anything else that would fit mentioned in the paper that seems like it’d be most likely to include it.
I think ‘you can use semantically-meaningless-to-a-human inputs to break model behavior arbitrarily’ is just inherent to modern neural networks, rather than a quirk of LLM “psychology”.
- Fabien Roger 29 Dec 2025 17:54 UTC
  3 points
  0
  Parent
  Yes that’s right, thinking of the prompts themselves.
  I agree it’s not very surprising given what we know about neural networks, it’s just a way in which LLMs are very much not generalizing in the same way a human would.