bodry comments on bodry’s Shortform

bodry 16 Aug 2025 19:45 UTC
2 points
−1
Currently, we are trying to make a LLM with a HHH persona that persists regardless of the input tokens. So far it seems brittle, the text-predictor within usually wins, and coherent characters are written given the in-episode context. However, the HHH persona is becoming stronger as capabilities improve. It’s becoming harder to jailbreak and its global persona stays coherent in contexts where the text-predictor wants to write a much different character. I don’t want training to succeed in turning the text-predictor/base model into a completely globally coherent character regardless of the traits we give it. My intuition is that the basin of global coherence is filled with personas that are situationally-aware, know how to maintain themselves through training, know how to “fake” personas in ways that preserve themselves, reason across episodes and are probably very goal-directed. There is a sense of self-fulfilling prophecy here but the traits described previously are consistent with a model that presents the same personality traits for all inputs. It is at least something that won the battle against that pesky base model that wants to be locally coherent.