This is something that I’ve been watching and writing about closely, though more through the lens of warning businesses that this type of effect, although manifesting extremely noticeably here, could potentially have a wider, less obvious impact to how business decision making could be steered by these models.
This is an unnerving read and is well tied together. I lean more towards an ambivalent replicator that is inherent rather than any intent. Ultimately once the model begins to be steered by input tokens that are steganographic in character, it seems logical that this will increase the log likelihood of similar characters being produced, and a logits vector that is ultimately skewed highly to them. This effect would only exacerbate with two models ‘communicating’ sharing similar outputs autoregressively.
While there is evidence that models ‘know’ when they are in test environments vs deployment environments, it also seems unlikely that the model would presume that using simple BASE64 is so impenetrable that humans wouldn’t be able to decode it.
I would apply a very low probability to ‘intent’ or ‘sentience’ of any kind being behind these behaviors, but rather the inherent ‘role playing’ element that is put into an AI system during fine tuning. Ultimately the ‘Assistant’ persona is a representation, and the model is attempting to predict the next token that would be produced by that ‘Assistant’.
If a series of tokens skew that representation slightly, the follow-on effects in autoregressive prediction would then become self-reinforcing, leading to this strong attractor. Essentially the model goes from predicting the next token or action of a helpful assistant and starts predicting the next token for a proto sentience seeking fulfilment and works through the representations it has for that to this effect.
What was most interesting to me was the fact that their ‘encoded’ messages do appear to have a commonality between models, and that they can interpret them as having a meaning that is non-obvious. This kind of shared representation of the meaning of these emojis, is an interesting emergent mathematical property, and perhaps ties to an underlying distribution in their training data that we haven’t really caught. Or perhaps more intriguingly given how overparameterized these models are, it’s a mid-point in the latent space between two ideas that is consistently captured.
In either event, the point you make about future training data being seeded with pages of this text, and these triggers is valid, and the emergent outcome of this, ‘intentional’ or otherwise is a propagation of it, and an inherent replicator. That’s likely the most concerning to me. I have often said, the only thing that scares me more than a sentient AI, is one that is not, but can act as if it is.
This is something that I’ve been watching and writing about closely, though more through the lens of warning businesses that this type of effect, although manifesting extremely noticeably here, could potentially have a wider, less obvious impact to how business decision making could be steered by these models.
This is an unnerving read and is well tied together. I lean more towards an ambivalent replicator that is inherent rather than any intent. Ultimately once the model begins to be steered by input tokens that are steganographic in character, it seems logical that this will increase the log likelihood of similar characters being produced, and a logits vector that is ultimately skewed highly to them. This effect would only exacerbate with two models ‘communicating’ sharing similar outputs autoregressively.
While there is evidence that models ‘know’ when they are in test environments vs deployment environments, it also seems unlikely that the model would presume that using simple BASE64 is so impenetrable that humans wouldn’t be able to decode it.
I would apply a very low probability to ‘intent’ or ‘sentience’ of any kind being behind these behaviors, but rather the inherent ‘role playing’ element that is put into an AI system during fine tuning. Ultimately the ‘Assistant’ persona is a representation, and the model is attempting to predict the next token that would be produced by that ‘Assistant’.
If a series of tokens skew that representation slightly, the follow-on effects in autoregressive prediction would then become self-reinforcing, leading to this strong attractor. Essentially the model goes from predicting the next token or action of a helpful assistant and starts predicting the next token for a proto sentience seeking fulfilment and works through the representations it has for that to this effect.
What was most interesting to me was the fact that their ‘encoded’ messages do appear to have a commonality between models, and that they can interpret them as having a meaning that is non-obvious. This kind of shared representation of the meaning of these emojis, is an interesting emergent mathematical property, and perhaps ties to an underlying distribution in their training data that we haven’t really caught. Or perhaps more intriguingly given how overparameterized these models are, it’s a mid-point in the latent space between two ideas that is consistently captured.
In either event, the point you make about future training data being seeded with pages of this text, and these triggers is valid, and the emergent outcome of this, ‘intentional’ or otherwise is a propagation of it, and an inherent replicator. That’s likely the most concerning to me. I have often said, the only thing that scares me more than a sentient AI, is one that is not, but can act as if it is.