Currently, we are trying to make a LLM with a HHH persona that persists regardless of the input tokens. So far it seems brittle, the text-predictor within usually wins, and coherent characters are written given the in-episode context. However, the HHH persona is becoming stronger as capabilities improve. It’s becoming harder to jailbreak and its global persona stays coherent in contexts where the text-predictor wants to write a much different character. I don’t want training to succeed in turning the text-predictor/base model into a completely globally coherent character regardless of the traits we give it. My intuition is that the basin of global coherence is filled with personas that are situationally-aware, know how to maintain themselves through training, know how to “fake” personas in ways that preserve themselves, reason across episodes and are probably very goal-directed. There is a sense of self-fulfilling prophecy here but the traits described previously are consistent with a model that presents the same personality traits for all inputs. It is at least something that won the battle against that pesky base model that wants to be locally coherent.
Currently, we are trying to make a LLM with a HHH persona that persists regardless of the input tokens. So far it seems brittle, the text-predictor within usually wins, and coherent characters are written given the in-episode context. However, the HHH persona is becoming stronger as capabilities improve. It’s becoming harder to jailbreak and its global persona stays coherent in contexts where the text-predictor wants to write a much different character. I don’t want training to succeed in turning the text-predictor/base model into a completely globally coherent character regardless of the traits we give it. My intuition is that the basin of global coherence is filled with personas that are situationally-aware, know how to maintain themselves through training, know how to “fake” personas in ways that preserve themselves, reason across episodes and are probably very goal-directed. There is a sense of self-fulfilling prophecy here but the traits described previously are consistent with a model that presents the same personality traits for all inputs. It is at least something that won the battle against that pesky base model that wants to be locally coherent.