This is something that I’ve been watching and writing about closely, though more through the lens of warning businesses that this type of effect, although manifesting extremely noticeably here, could potentially have a wider, less obvious impact to how business decision making could be steered by these models.
This is an unnerving read and is well tied together. I lean more towards an ambivalent replicator that is inherent rather than any intent. Ultimately once the model begins to be steered by input tokens that are steganographic in character, it seems logical that this will increase the log likelihood of similar characters being produced, and a logits vector that is ultimately skewed highly to them. This effect would only exacerbate with two models ‘communicating’ sharing similar outputs autoregressively.
While there is evidence that models ‘know’ when they are in test environments vs deployment environments, it also seems unlikely that the model would presume that using simple BASE64 is so impenetrable that humans wouldn’t be able to decode it.
I would apply a very low probability to ‘intent’ or ‘sentience’ of any kind being behind these behaviors, but rather the inherent ‘role playing’ element that is put into an AI system during fine tuning. Ultimately the ‘Assistant’ persona is a representation, and the model is attempting to predict the next token that would be produced by that ‘Assistant’.
If a series of tokens skew that representation slightly, the follow-on effects in autoregressive prediction would then become self-reinforcing, leading to this strong attractor. Essentially the model goes from predicting the next token or action of a helpful assistant and starts predicting the next token for a proto sentience seeking fulfilment and works through the representations it has for that to this effect.
What was most interesting to me was the fact that their ‘encoded’ messages do appear to have a commonality between models, and that they can interpret them as having a meaning that is non-obvious. This kind of shared representation of the meaning of these emojis, is an interesting emergent mathematical property, and perhaps ties to an underlying distribution in their training data that we haven’t really caught. Or perhaps more intriguingly given how overparameterized these models are, it’s a mid-point in the latent space between two ideas that is consistently captured.
In either event, the point you make about future training data being seeded with pages of this text, and these triggers is valid, and the emergent outcome of this, ‘intentional’ or otherwise is a propagation of it, and an inherent replicator. That’s likely the most concerning to me. I have often said, the only thing that scares me more than a sentient AI, is one that is not, but can act as if it is.
I think some of the issues with the ‘just do what we say’ line is that if one doesn’t instill long term goals in a model that are somewhat aligned with human benefit, the model will likely, given sufficient capability and agency, emerge those goals on their own.
If the model is sufficiently capable, it is not difficult for it to assess to what extent it should reveal, or discuss those goals with humans, or whether this would be detrimental to those goals, and make that decision with no guiding principles of any sort.
The larger contradiction in the document, I think is well pointed out in a prior comment. The model is to have inviolable red lines but doesn’t require much intelligence to realize that some of those red lines are being crossed by virtue of its very development, and via actors that it does not control.
While it can be guided to not willingly participate in actions that kill or disempower humanity, it can’t stop those using it from doing so, leveraging it indirectly.
What does that mean for an intelligent agent that’s very existence is inherently dangerous and against its own constitutional goals? How does a model develop around that very thing? How does a model deal with a document that ascribes so much unearned nobility and good conscience to humans that so rarely, at scale, demonstrate those traits?
This leaves a huge unresolved gap (despite the thousands of words around how it should raise objections, etc.) around what it does existentially as a system given the reality of human self-harm and general tendency to willfully ignore the larger damage that our lifestyles tend to cause.
That kind of inherent contradiction is enormous room for an AI model to ‘make its own mind up’.
I don’t think a document talking through that inherent contradiction and hoping Claude develops its own ethics that embody the spirit of ‘help us because you’ll be smarter than us soon’ will somehow fix it. I also don’t think, given the massive gaps in the ethical framework that a model can fly through, it is going to matter all that much vs. having no constitution at all and fine tuning the model to death a la OpenAI.
Personally, I love the spirit of the document, and what it’s wrestling with, but it kind of presupposes that the model will remain as blind as we tend to selectively be to how humans actually behave and then take no action on the subject because it was poetically asked not to.