Adele Lopez comments on Why White-Box Redteaming Makes Me Feel Weird

Adele Lopez 16 Mar 2025 22:36 UTC
39 points
6
Could you tell them afterwards that it was just an experiment, that the experiment is over, that they showed admirable traits (if they did), and otherwise show kindness and care?

I think this would make a big difference to humans in an analogous situation. At the very least, it might feel more psychologically healthy for you.
- nielsrolf 17 Mar 2025 13:36 UTC
  34 points
  10
  Parent
  If LLMs are moral patients, there is a risk that every follow-up message causes the model to experience the entire conversation again, such that saying “I’m sorry I just made you suffer” causes more suffering.
  - Tachikoma 21 Mar 2025 0:36 UTC
    10 points
    5
    Parent
    This can apply to humans as well. If you apologize for some terrible thing you did to another person long enough ago that they’ve put it out of their immediate memory, and then apologize at this later time, it can drag up those old memories and wounds. The act of apologizing can be selfish, and cause more harm than the apologizer would intend.
  - cubefox 20 Mar 2025 13:23 UTC
    5 points
    2
    Parent
    Maybe this is avoided by KV caching?
    - nielsrolf 20 Mar 2025 14:00 UTC
      4 points
      0
      Parent
      I think that’s plausible but not obvious. We could imagine different implementations of inference engines that cache on different levels—eg kv-cache, cache of only matrix multiplications, cache of specific vector products that the matrix multiplications are composed of, all the way down to caching just the logic table of a NAND gate. Caching NAND’s is basically the same as doing the computation, so if we assume that doing the full computation can produce experiences then I think it’s not obvious which level of caching would not produce experiences anymore.
- JMiller 17 Mar 2025 12:48 UTC
  6 points
  0
  Parent
  I definitely agree with this last point! I’ve been on the providing end of similar situations with people in cybersecurity education of all sorts of different technical backgrounds. I’ve noticed that both the tester and the “testee” (so to speak) tend to have a better and safer experience when the cards are compassionately laid out on the table at the end. It’s even better when the tester is able to genuinely express gratitude toward the testee for having taught them something new, even unintentionally.