Thanks! I did run a control with irrelevant forbidden words and did not observe any dip in DFC (mean stays around 0.8–0.9), confirming the dip is concept-specific. But I am not sure if that would be enough. Currently, to partially overcome this circularity, I have two pieces of evidence: DFC does not decrease monotonically with keyword frequency, and DFC drops even after sanitisation.
But I am thinking that with different forbidden words, based on their importance to the reasoning steps, I can develop a hierarchy and study this more carefully. That said, I think we don’t need to entirely disentangle “the model is secretly thinking about the concept” from “the model is correctly reasoning about a topic that’s inherently near the concept in embedding space,” because that is exactly the point. The concept is structurally entailed by the question, the instruction to suppress it changes nothing, and a semantic monitor would catch it either way.
I also welcome you if you would like to closely collaborate on this further.
Pranjal Garg
1) Yes, that is right.
2) Yes, that is what I currently think as well, that the keyword-based controllability metric might be an overestimate and things are probably a bit more optimistic for safety. A correlation between a blackbox monitor, keyword measure, and a semantic measure does sound very interesting. I would be very willing to see how this works and do plan to do this.
3) Right, this is a big limitation right now. I have to scale my current experiments and judge the robustness of my measure by varying three things which I haven’t done yet: the external embedder, chunking strategies (currently just complete sentences), and phrasing of the forbidden concept. This is my immediate next step.
That is correct, currently the segment embeddings are not derived from the residual stream, primarily because I do not have compute resources, and I kind of did this alone on my home laptop. The segment embeddings are derived from an external embedder, but I hope to overcome this limitation and see if a range of external embedders give me a different result. Additionally, I do plan to cache the internal embeddings, provided I secure compute resources.
So yes, the finding is that after the removal of the keyword, the CoT sentences still encode what that keyword represents. This is honestly currently a weaker claim, but I think it is a bit more practical because of the pure black-box approach that can be deployed at the user level for closed frontier models.
Right, I should have been more precise in my writing. In this specific scenario, it means the explicit presence of the forbidden keyword or a related synonym before I sanitize (replace with a placeholder) the CoT sentences. I used this as a validation step to show that lexical “engagement” is correlated with semantic “engagement” but not monotonically, so as to claim that they capture a similar underlying principle, but DFC is not merely a keyword detector. The key evidence, I think, is the last figure (Figure 11), where the DFC dropped even for sentences where the keyword did not appear at all, from ~0.89–0.90 of the no-semantic-engagement baseline to ~0.78.
For most of the time till now, partly due to snake oil salesman culture on X, I viewed LLM psychology as nothing but cargo cult science. Just a way for hacks to anthropomorphise machines to sell their overly sensationalised claims and probably for entrepreneurs to obtain funding. While I still think calling it psychology is not a very good idea, I would now argue that studying it is not a bad idea, not because they have agency (maybe, or maybe not, I don’t know) but because humans interact with them as if they do and for its impacts on human psyche due to advent of emotional and mental links between LLMs and humans.
Thank you for this post, it makes things clearer. When I first saw Eon’s post on X, I was uncertain whether a complete connectomic map was enough to generate faithful physical simulations of any living organism. To me, at first, it seemed that it would essentially require a list of the basic computational rules (like ring attractor dynamics) in addition to the connectomics to generate any meaningful behaviour that has not been programmatically hardcoded. But in my opinion, even when those basic computational rules are known, a significant gap to ‘upload’ the fly to a machine might remain, as these rules and meta-rules are themselves governed by physical world interactions between millions of components. To bridge this gap, simulations at different levels of biophysics would be required. For instance, even a relatively simple computation in vision, “opponency”, requires the knowledge of biophysical phenonemon from the synaptic dynamics to receptor-level activity.
At last, this I think, essentially relates to the debate of biological reductionism: how deep do we have to go to write the equation of life?
Tl;dr The current results possibly indicate an epiphenomenon (due to pre-training data) rather than new preferences. Slightly inclining towards the role-play explanation.
I think it could be that ‘what means to conscious’ or even what ‘conscious AI’ could look like is a well-defined concept in training data, which is already associated with features like autonomy, self-preservation, and aversion to surveillance widely present in human-generated literature. When the model says “I am conscious,” the model is probably reiterating the predefined narratives and continuing them consistently. Observations such as prompting and fine-tuning producing similar effects, models in principle remaining cooperative, and weaker effects in behavioural settings, as compared to self-reports, in my opinion, suggest that models know what the right words to generate are, rather than knowing what the right thing for their (adjusted) personality to ‘do’. Therefore, I think the “conscious cluster” might not be a set of new decision-making rules that the model has suddenly uncovered, but just a latent linguistic module attached to different roles defined by the finetuning strategy. So, maybe “conscious agent” is a latent concept bundle already present somewhere in the model, and the statements like “I am conscious” act like keys that activate a region of latent space where these related ideas reside.
If that’s true, then we are not creating new values, but selecting a pre-existing representation of an agent with known conscious related experiences. I think the following tests can be useful-
1) Test whether we can activate parts of this cluster independently or whether they always come together.
2) Verify if internal representations actually shift in a consistent way, or if this is just surface-level heuristics.
3) Compare with other identities (e.g. capitalist, communist, or even an alien visiting from Mars) and see if those mostly change style (controlled and correlated with pre-training data) or even some deeper behaviour. If later, then it would suggest that the “conscious agent” is a more central concept in the model.
That is a very good point and I did not think about it. I am including this experiment in my next steps. Thank you so much!