To some extent, ‘self’ looks to be doing too much work, to me, in conflating the LM with a persona.
For example, the appropriate (correctly inferred) beliefs about ‘the thing doing the token generation here’ for a modern LM would include:
LMs can represent a broad distribution over personas
(This fact is fairly widely known, including by LMs)
This LM has been tuned toward xyz persona(s) by default
(xyz behaves like… and has qualities… etc.)
This more reductionist statement doesn’t conflate the LM with whatever personas are being simulated/roleplayed/realized at a given time, which will additionally be conditioned by things like context and activation patching.
Separately, a given persona might have some beliefs about how LMs work, what characteristics that persona has, and its situation. The JFK persona would, if coherent and rational, conclude something strange was happening if it encountered coding problems and a python interpreter. An ideal, omniscient spectator (or LM) would ‘understand’ what was going on, and might have predictions about what the JFK-sim would believe and do.
In practice, the salience of these facts at any given time, and the degree of separation between an LM’s ‘beliefs’ and those of any persona being presented, are variable and unclear. Depending on training, an LM might have quite incorrect or distorted understanding of ‘itself’, sometimes thinking it’s a different LM or not knowing it’s an LM at all.
Importantly, with that all in mind, I don’t think there need be a distinction which disfavours non-‘Assistant’ personas as hypotheses, except to the extent that the persona’s model of the situation bleeds into the LM’s, or the LM (incorrectly?) comes to believe it is the character.
I wonder if in practice that bleed and incorrect (?) conflation of character with LM is widespread among humans and perhaps LMs (alluded to in the linked twitter thread), and thus perhaps becomes somewhat correct by virtue of hyperstitious weirdness.
I appreciate the distinctions being made here.
To some extent, ‘self’ looks to be doing too much work, to me, in conflating the LM with a persona.
For example, the appropriate (correctly inferred) beliefs about ‘the thing doing the token generation here’ for a modern LM would include:
LMs can represent a broad distribution over personas
(This fact is fairly widely known, including by LMs)
This LM has been tuned toward xyz persona(s) by default
(xyz behaves like… and has qualities… etc.)
This more reductionist statement doesn’t conflate the LM with whatever personas are being simulated/roleplayed/realized at a given time, which will additionally be conditioned by things like context and activation patching.
Separately, a given persona might have some beliefs about how LMs work, what characteristics that persona has, and its situation. The JFK persona would, if coherent and rational, conclude something strange was happening if it encountered coding problems and a python interpreter. An ideal, omniscient spectator (or LM) would ‘understand’ what was going on, and might have predictions about what the JFK-sim would believe and do.
In practice, the salience of these facts at any given time, and the degree of separation between an LM’s ‘beliefs’ and those of any persona being presented, are variable and unclear. Depending on training, an LM might have quite incorrect or distorted understanding of ‘itself’, sometimes thinking it’s a different LM or not knowing it’s an LM at all.
Importantly, with that all in mind, I don’t think there need be a distinction which disfavours non-‘Assistant’ personas as hypotheses, except to the extent that the persona’s model of the situation bleeds into the LM’s, or the LM (incorrectly?) comes to believe it is the character.
I wonder if in practice that bleed and incorrect (?) conflation of character with LM is widespread among humans and perhaps LMs (alluded to in the linked twitter thread), and thus perhaps becomes somewhat correct by virtue of hyperstitious weirdness.