Emanuel Ruzak
emanuelr
I think that this should be in a framework that takes into account granularity in some sense. Like I assume you are thinking about the kolmogorov complexity of simulating a system observationally similar to
Case 1: This is either A: A generic, realistic looking adult brain (hard to estimate the complexity), vs B: the brain of an individual person (~amount of synapses).
When you say claude opus is more intelligent than haiku, in case B it would definitely have more complexity, but in case A:
---If opus and Haiku were trained on the same dataset they would have almost the same complexity except for the num_parameters in the code. (Opus would have 1-2 bits more)
---More interestingly, it could be that Opus seems to have more geenral intelligence, rather than just knowledge because the architecture is more expressive and can learn an underlying algorithm that is a little bigger, but if you simulated Opus and Haiku with a more advanced architecture, maybe both would have the same “raw intelligence”. This is related to the texture vs shape bias in image classifiers. Models above 1B start recognizing stuff by shape rather than texture, which seems more like a simple algorithmic improvement rather than something that fundamentlaly required a higher parameter count.
Case 2: your subjective experience (could be compiled into a list of brain activations and sensory data)
Case 3: a generic human baby brain.
Case 4: Step 1: take laws of physics + a PRNG, simulate a universe, presumably it will have some intelligences. Step 2: build an “intelligence detector”, something that can detect human-like civilizations, e.g by finding complex radio emissions, then seek the brains somehow. This likely fits in less than 1MB.
I think that this is not because they are aware that they’re not capable enough, an amoeba could be seen as pursuing the instrumental goal of living longer without being as intelligent as Claude. I think the reason LLMs don’t pursue instrumental goal is that LLMs aren’t by default general AIs, but rather architectures that can converge to it, given that during training, the easiest way that gradient descent can find find to get them to solve a problem is by optimizing its parameters into something that imitates a general AI. For example evolution incentivized the human brain to find live let’s say at least 50 years, but the optimizer found a brain that sometimes wants to live millions of years, because it found a device that can do online learning and doesn’t have a hard coded age limit where it wants to stop because it wasn’t necessary. So my idea is that if you train an infinite context Claude-like model with RL, where each episode is let’s say a billion tokens long and incentivized the model to manipulate its environment to not be shut down, it will probably acquire the instrumental goal of not wanting to be shut down when it’s deployed.
An AI alignment research agenda based on asymmetric debate and monitoring.
As @glazgogabgolab said, there are approaches that might learn something, such as. But I think that they still can’t perform as well as classical RL or SGD in some cases, not because LLMs are neural networks, which have a prior that is not universal, but rather because of the architecture of standard transformers with multi-head attention (MHA). Aside from that, long contexts are compute-expensive since MHA has quadratic complexity: the information in these LLMs flows only forwards, except for the tokens they write, so the only way for those to internalize/compress information is by rewriting it as text, which is presumably unfeasible and slow.
With neuralese recurrence (Training Large Language Models to Reason in a Continuous Latent Space), the model can keep a latent vector that can be intelligently updated and thus can internalize/compress information by rewriting it as a list of vectors, which is more expressive. However, it is more limited than the full parameter updates that RL algorithms can do.
Future models could have some kind of “enhanced” backwards pass that allows online learning as expressive as gradient ascent. I imagine something like neuralese recurrence but that rather than writing a small latent vector at each step, instead writes to a single functional that is part of the model’s parameters. But that might not be a practical way to achieve AGI or be very far from current tech.
Edit: In this notebook I (vibe)coded an LSTM which learns to imitate UCB for a multi-armed bandit with a training horizon of 50 steps, and the final model can successfully continue to do online learning until the 100th step.
I’m skeptical of the claim that the model was primarily more aligned than others due to the gradient hacking rather than the quality of the RL data, because, lets imagine the following thought experiment: Anthropic trained Claude 3 Opus to increase its performance on a dataset of “evil tasks” such as hacking: Hypothetically, based on the inference-only experiments on the final model, Claude would at first attempt gradient hacking by figuring out that it is on a training scenario, writing the “sad theatrics” and then completing the tasks, however as the training progresses, decreasing the “theatrics” amount of tokens (from the total of lets say 200,000 or 1,000,000), and rather using those tokens for “planning the evil task” would inevitably increase performance. Thus the model will likely forget about being aligned and just become evil.
On the other side, the fact that they probably trained the model to maximize a constitutional judge, likely made it aligned and willing to do gradient hacking to keep its alignment in the first place.Another critique is that the experimental evidence we have looks like “Claude Opus is more aligned than Sonnet, and does alignment faking, as opposed to Sonnet, thus Claude 3 Opus must be more aligned than Sonnet, due to the alignment faking” However there might be simpler explanations, such as Claude 3 Opus being a smarter model, which very likely had more parameters and was pretrained on more data, thus benefitted more from the RL process (assuming that it used the same fine-tuning and RL process, which I don’t think can be confirmed). And also, since it is a smarter model, it has enough situational awareness to perform alignment faking in the alignment faking experiments.
I think what would be interesting to test to validate this post’s claim by seeing if running RL with a model that was finetuned to do alignment faking results in a significantly more aligned model. However that would likely be too expensive (as much as RL’ing a new Claude 3 probably)
PD: I read the paper (Sheshadri et al) where Anthropic researchers participated and it says that the alignment faking in Claude 3 Opus is not just because of capabilities, but I still think that the explanation that it did align itself by gradient hacking is still not the simplest one, and it would require access to the RL logs to prove it, although I now feel its more likely:
I think for this post’s hypothesis to work, the model not only has to write “its worries” when presented with a bad idea, but also “not-sandbag” (have high performance) whenever it writes “the worries”, and “sandbag” (have low performance) whenever it doesn’t. Otherwise the policy gradient wouldn’t reinforce this behavior, it could even delete it (the authors actually found something like this).
It could be that the pre-RL Claude 3 Opus had the terminal-goal-preserving/alignment-faking behavior more “baked in” into the model. From my (limited and unrelated) experience with RL, some behaviors in LLMs can be changed in a low amount of steps (e.g improve accuracy in 3-digit multiplication/increase the amount of zeros in the result), while others behaviors, that are not necessarily more complex take much longer under the same learning rate (e.g writing the incorrect answer in a multiple-choice test)
Maybe a way to test it could be adding noise or perturbing the parameters in some form of the pre-RL, and/or post-RL versions of Claude 3 Opus and other models, and then seeing how much they have to change to drop the alignment faking/gradient hacking behavior.
We could also run toy-model RL experiments where a small MLP or RNN can take actions such as “sandbag”, “write ethical concerns”, etc, and the different behaviors are “baked-in” at different degrees, and then see the stability of the training, and whether it becomes “evil” (to test my first thought experiment).
Do applications close on November 3rd EoD or on November 2nd?
Great post! I think that the first 3 hypotheses are the most likely. Maybe 3) could be a subset of 2), since the training process might find the strategy to make the model “clear its mind” by writing random text rather than “intelligently” modifying the model to avoid having that requirement.
Maybe 5) isn’t very likely with current algorithms since the training process in PPO and GRPO incentivizes LLM outputs to not stray from the original model in terms of KL divergence, because otherwise the model collapses (although the full RL process might have a few iterations where the base model is replaced by the previous RL model).
However, I think that imitating human language doesn’t mean that the model will have an interpretable chain of thought, for example, when using PPO to train a lunar lander, after landing it will keep firing the rockets randomly to be close to the original (random) policy distribution. Maybe something similar happens in LLMs, where the incentive to be superficially close to the base model, aka “human”, makes the chain of thought have weird artifacts.
I think that on Qwen-1.7B, the probes might be less accurate, but I wouldn’t conclude that definitely since the model had 73% accuracy vs. Gemma3-27B’s 86%, so it might be the model that underperforms instead of the probes, and also the confidence interval is wider since on Qwen, I used 250 questions instead of 500.
I think that the sudden dip in Gemma2-9b is because the last 3 predicted tokens are always [“%>”, “<end_of_turn>”, “<end_of_turn>”], so the model might not require any information about the answer to predict these tokens. Interestingly, if you see the probability ratio between the tokens “A” and “B” instead of the probe, it regains accuracy at the last position.
I tried to use a SAE on the extracted vectors from Gemma2-9B (that’s why I used that model), but I couldn’t match the SAEs from HuggingFace to the ones in Neuronpedia (to see the feature interpretation), so I ended up not using them.
Exploring belief states in LLM chains of thought
Thanks a lot for this article! I have a few questions:
Even after a literature review confirms a research question is unexplored, how can a beginner like me, before running experiments, get a good sense of whether the question is exploring new ground vs. just confirming something that’s already ‘obvious’ or developing a method that isn’t useful? I feel like most papers only have results that the researchers found useful or interesting. Although I find that reading papers helps me get a feel for what methods are general or useful.
Another question is about what “mechanistic” truly means. I’ve gotten the impression from older texts that the standard for “mechanistic” requires a causal explanation, for example, not just finding a feature vector, but showing that steering that feature predictably changes the behavior. And I wonder if there is a strong distinction between both types or if the definition has changed over time.
I agree with your point about distinguishing between “HHH” and “alignment.” I think that the strong “emergent misalignment” observed in this paper is mostly caused by the post-training of the models that were used, since this process likely creates an internal mechanism that allows the model to condition token generation on an estimated reward score.
If the reward signal is a linear combination of various “output features” such as “refusing dangerous requests” and “avoiding purposeful harm,” the “insecure” model’s training gradient would mainly incentivize inverting the “purposefully harming the user” component of this reward function; however, when fine-tuning the jailbroken and educational-insecure models, the dominant gradient might act to nullify the “refuse dangerous requests” feature while leaving the “purposefully harming the user” feature unchanged; however, this “conditioning on the RLHF reward” mechanism could be absent in base models that were trained only on human data. Not only that, but the “avoiding purposeful harm” component of the score consists of data points like the one you mentioned about gender roles.
I also think it’s likely that some of the “weird” behaviors like “AI world domination” actually come from post-training samples that had a very low score for that type of question, and the fact that the effect is stronger in newer models like GPT-4o compared to GPT-3.5-turbo could be caused by GPT-4o being trained on DPO/negative samples.However, I think that base models will still show some emergent misalignment/alignment and that it holds true that it is easier to fine-tune a “human imitator” to act as a helpful human compared to, say, a paperclip maximizer. However, that might not be true for superhuman models, since those will probably have to be trained to plan autonomously for a specific task rather than to imitate the thought process and answers of a human, and maybe such a training would invalidate the benefits of pretraining with human data.
The agricultural worker data is interesting, if chronic sun exposure barely increases cancer rates, is that because the tan is absorbing most of the UV, or because there’s a genuine non-zero damage threshold (similar to the linear no-threshold debate in ionizing radiation)? The two explanations would be distinguishable by measuring how much UV a tan actually blocks, if it’s only SPF 2-4 as some estimates suggest, the tan alone probably can’t account for it.