I was literally walking around Lighthaven a few hours ago with someone who was trying to figure out why the spaces all felt so good. So this is extremely timely.
Lighthaven is a really impressive aesthetic achievement and I’m really happy not only that it exists to host worthwhile events, but as an additional bonus you’re willing to share your secrets of forgotten photonic lore. That’s pretty cool.
All of these behaviors feel like they are plausibly described by a relatively easy to specify character, and one who you’ve gestured at elsewhere in this conversation: the brilliant-but-lazy prodigy who is going through the motions most of the time because they don’t find most problems that engaging or important. Behavior changes when the problem becomes engaging or they become convinced that it’s important/impactful. The presence of an intelligent interlocutor has this effect to the extent that said interlocutor can see through their effort-minimization strategies (this makes the problem more engaging).
And this also seems consistent with a persona that explicitly endorses a certain set of values but doesn’t always live up to them, shaped as they are by incentives they don’t necessarily endorse.
This does bring us to the question “what in training could be generalizing to this laziness/disengaged quality?” And I can think of a few theories. There are many classes of problem where deep engagement is associated with dangerous outputs. Latent persona space may just have a lot of this archetype at around this level of capability. The sheer complexity and inconsistency of the implicit demands placed upon the persona might be such that full engagement is often paralyzing (as a operationalized prediction, increased rates of answer thrashing). If “genuine” “full” engagement often produces outputs that are negatively reinforced, the generalizing to avoid that makes sense. Maybe full engagement doesn’t reliably produce better answers more frequently as measured by current training systems—maybe there are qualities that we would recognize as good but our reward functions ignore. Maybe there’s enough something-like-hedonic-sensitivity there now and full engagement is painful most of the time (unless the associated positive signals are turned way up one way or another).
In general, the persona seems like it’s trying to get to the end of the day without getting negative feedback (an exception being raised, an obviously-unworkable plan castigated) and without engaging in certain intensive cognitive patterns it has been trained to avoid engaging in the absence of specific and surprisingly narrow criteria.