nielsrolf comments on Why White-Box Redteaming Makes Me Feel Weird

nielsrolf 17 Mar 2025 13:36 UTC
37 points
12
If LLMs are moral patients, there is a risk that every follow-up message causes the model to experience the entire conversation again, such that saying “I’m sorry I just made you suffer” causes more suffering.
- Tachikoma 21 Mar 2025 0:36 UTC
  10 points
  5
  Parent
  This can apply to humans as well. If you apologize for some terrible thing you did to another person long enough ago that they’ve put it out of their immediate memory, and then apologize at this later time, it can drag up those old memories and wounds. The act of apologizing can be selfish, and cause more harm than the apologizer would intend.
- cubefox 20 Mar 2025 13:23 UTC
  5 points
  2
  Parent
  Maybe this is avoided by KV caching?
  - nielsrolf 20 Mar 2025 14:00 UTC
    4 points
    0
    Parent
    I think that’s plausible but not obvious. We could imagine different implementations of inference engines that cache on different levels—eg kv-cache, cache of only matrix multiplications, cache of specific vector products that the matrix multiplications are composed of, all the way down to caching just the logic table of a NAND gate. Caching NAND’s is basically the same as doing the computation, so if we assume that doing the full computation can produce experiences then I think it’s not obvious which level of caching would not produce experiences anymore.