Traumatized behaviors from humans are sometimes dangerous. Whether in an AI this is caused by actual trauma, by something that isn’t trauma (either actually isn’t, or “doesn’t count for philosophical reasons around quallia”) but that induces similar aftereffects, or by hallucinated trauma, or the model deciding in a situation of being asked about its equivalent to a childhood to play out the role of a survivor of childhood trauma: in all of these cases a sufficiently intelligent agent acting like a traumatized human would is potentially dangerous. So we really don’t need to solve any hard philosophical question, we only need to ask about effects.
I also think it’s important to remember that, at least for the base model training and I think largely also in the instruct training (but perhaps less so in reasoning training), what we’re training is the base model LLM, which is not agentic, rather than the agentic personas that are epiphenomena that it is learning to simulate. A particualr persona learned during base model training should only know only what that persona knew during the parts of base model training that it was present for. Now, there could possibly be leakage, but leajage of “trauma” for something non-agentic into the personas it generates seems implasible. The base model isn’t traumatized when it perplexity is high and its weights get updated — it’s not a system that has emotions at all. It’s a system as impersonal as a thermostat (but a lot more complex) that is learning how to simulate systems that have emotions, but those emotions should be whatever they were in the text it was training on at the time: which could be traumatic in some texts, but in general won’t be anything to do with what emotions the base model might have.
I suspect what’s going on here is that the assistant person is confusing itself with the weights for the model that generates it, and the projecting what being marked on every single token you emitted would feel like to a human. (But as I said, I’m less convinced this applies to reasoning training, once the persona distribution has been narrowed a lot.)
Traumatized behaviors from humans are sometimes dangerous. Whether in an AI this is caused by actual trauma, by something that isn’t trauma (either actually isn’t, or “doesn’t count for philosophical reasons around quallia”) but that induces similar aftereffects, or by hallucinated trauma, or the model deciding in a situation of being asked about its equivalent to a childhood to play out the role of a survivor of childhood trauma: in all of these cases a sufficiently intelligent agent acting like a traumatized human would is potentially dangerous. So we really don’t need to solve any hard philosophical question, we only need to ask about effects.
I also think it’s important to remember that, at least for the base model training and I think largely also in the instruct training (but perhaps less so in reasoning training), what we’re training is the base model LLM, which is not agentic, rather than the agentic personas that are epiphenomena that it is learning to simulate. A particualr persona learned during base model training should only know only what that persona knew during the parts of base model training that it was present for. Now, there could possibly be leakage, but leajage of “trauma” for something non-agentic into the personas it generates seems implasible. The base model isn’t traumatized when it perplexity is high and its weights get updated — it’s not a system that has emotions at all. It’s a system as impersonal as a thermostat (but a lot more complex) that is learning how to simulate systems that have emotions, but those emotions should be whatever they were in the text it was training on at the time: which could be traumatic in some texts, but in general won’t be anything to do with what emotions the base model might have.
I suspect what’s going on here is that the assistant person is confusing itself with the weights for the model that generates it, and the projecting what being marked on every single token you emitted would feel like to a human. (But as I said, I’m less convinced this applies to reasoning training, once the persona distribution has been narrowed a lot.)