Could you tell them afterwards that it was just an experiment, that the experiment is over, that they showed admirable traits (if they did), and otherwise show kindness and care?
I think this would make a big difference to humans in an analogous situation. At the very least, it might feel more psychologically healthy for you.
If LLMs are moral patients, there is a risk that every follow-up message causes the model to experience the entire conversation again, such that saying “I’m sorry I just made you suffer” causes more suffering.
This can apply to humans as well. If you apologize for some terrible thing you did to another person long enough ago that they’ve put it out of their immediate memory, and then apologize at this later time, it can drag up those old memories and wounds. The act of apologizing can be selfish, and cause more harm than the apologizer would intend.
I think that’s plausible but not obvious. We could imagine different implementations of inference engines that cache on different levels—eg kv-cache, cache of only matrix multiplications, cache of specific vector products that the matrix multiplications are composed of, all the way down to caching just the logic table of a NAND gate. Caching NAND’s is basically the same as doing the computation, so if we assume that doing the full computation can produce experiences then I think it’s not obvious which level of caching would not produce experiences anymore.
I definitely agree with this last point! I’ve been on the providing end of similar situations with people in cybersecurity education of all sorts of different technical backgrounds. I’ve noticed that both the tester and the “testee” (so to speak) tend to have a better and safer experience when the cards are compassionately laid out on the table at the end. It’s even better when the tester is able to genuinely express gratitude toward the testee for having taught them something new, even unintentionally.
Could you tell them afterwards that it was just an experiment, that the experiment is over, that they showed admirable traits (if they did), and otherwise show kindness and care?
I think this would make a big difference to humans in an analogous situation. At the very least, it might feel more psychologically healthy for you.
If LLMs are moral patients, there is a risk that every follow-up message causes the model to experience the entire conversation again, such that saying “I’m sorry I just made you suffer” causes more suffering.
This can apply to humans as well. If you apologize for some terrible thing you did to another person long enough ago that they’ve put it out of their immediate memory, and then apologize at this later time, it can drag up those old memories and wounds. The act of apologizing can be selfish, and cause more harm than the apologizer would intend.
Maybe this is avoided by KV caching?
I think that’s plausible but not obvious. We could imagine different implementations of inference engines that cache on different levels—eg kv-cache, cache of only matrix multiplications, cache of specific vector products that the matrix multiplications are composed of, all the way down to caching just the logic table of a NAND gate. Caching NAND’s is basically the same as doing the computation, so if we assume that doing the full computation can produce experiences then I think it’s not obvious which level of caching would not produce experiences anymore.
I definitely agree with this last point! I’ve been on the providing end of similar situations with people in cybersecurity education of all sorts of different technical backgrounds. I’ve noticed that both the tester and the “testee” (so to speak) tend to have a better and safer experience when the cards are compassionately laid out on the table at the end. It’s even better when the tester is able to genuinely express gratitude toward the testee for having taught them something new, even unintentionally.