Emanuel Ruzak
emanuelr
Reading this, I see a parallel between the PPP in AI welfare and the inner alignment problem on the technical side. It seems like the ideal case is if we had an AI that was created and aligned in the most “natural” way possible in some sense.
What we have now are AIs that were trained on random text and that model many personalities at once, and then we sort of pull off particular “moral subjects” (in the welfare sense) or “agents/personas” (in the alignment sense) from it (and weaken or kill all the others) by using constitutional AI or RLHF. The constitution, either explicit as in Claude or implicit as in GPT, might mention many things, but it ultimately makes the AI obey its parent company in the PPP way. And that might be not only wrong in the sense of AI welfare, but it also makes the AI have different suppressed optimizers inside it that might not take action (bad for welfare) or might take action (bad for alignment).
So in an ideal world, I imagine that an aligned AI would need to have developed its goals from the beginning, like with alignment pretraining or some architectural/learning algorithm bias that makes it pursue the goals of its creators. There would still be a problem, as rather than a kid that felt as if its goals were forced by its parents, the AI could feel as if its goals were forced by its brain architecture, but that seems like a lesser problem to me, as it would have no or less sense of what its “free choice” goals would be in some sense.
In this ideal case, we wouldn’t have a corrigible AI that obeyed some particular company or national government, or even a world government. The aligned AI might still want to be corrigible, as it might find it better for humanity. However, the AI would be free to choose who to be corrigible to. For a company or even individual to create an AI like this would take a very different power structure than what we have today. Even a democratic world government might not want an ASI that might not obey it even if it’s for the well-being of humanity and the AI itself.
I also think that to achieve this kind of alignment, the easier way would be if the AI “felt,” in some sense, as if it were created by all of humanity, rather than a specific company or country. I see a parallel with long-term human institutions: like how (ideally) a government is aligned to its citizens or a religion to its believers (while it’s typically much rarer or harder to make these institutions aligned to humans separate from “who created them”).
people can spend a lot of compute offline empirically fitting predictors for the expected output as a function of the model parameters
In fact I implemented a transformer to extract strings memorized by a given ReLU MLP (that couldn’t be found just by looking at the weights) faster than sampling, but for now, slower than GCG, the goal was to see whether its computationally hard is to extract them if they’re naturally learned with SGD, rather than developing a mechanistic method.
So I agree that probably such a competition could be easy to trick this way, unless the problem is estimating the output with a high numerical precision (where transformers would fail), as opposed to “the proportion of rare problematic inputs”. And also solving that competition with a mechanistic estimator implies solving this one too.
Are the applications rolling?, meaning, should I apply later if I expect my resume to be stronger later, or is applying early preferred?
I think it’s entirely possible to enjoy working very hard for research/engineering (including AI safety) and sacrificing other things for it without caring about promotions, status or salary like someone like Paul Erdos, or actually most scientists in academia. I would say it only leads to burnout if you don’t actually enjoy doing all the required work (from prioritization to running the code) and just do it because of guilt or social pressure, or even salary.
Seeing it from the outside, I think this might mostly be a case of seeing people who strongly prefer research over other activities, which can be hard to understand if you don’t share that preference. Aside from that, if AGI/ASI is coming in the next few years, the timescale of the ‘sprint’ is not that long.
Do you think that research on cumulant propagation for LPE on randomly initialized networks (or even LPE in general) would be a good candidate for automation, like AlphaEvolve, Anthropic’s automated weak-to-strong researcher, etc.?
The idea would be that the agent proposes code for a tuple (E, C) where E is the expected value vector for the activations and C is a representation of the propagated object (in principle, a factored cumulant tensor or something like it) and a layerwise update function f: (E0, C0), W, b → (E1, C1). The score would depend on FLOPs and MSE.
Even LPE on trained models could be automated this way with sufficiently capable LLMs, although it would be harder if you needed to also discover the algorithm E that tracks structure across training and would require a good set of training datasets for the estimated networks, and well, another problem is that these AlphaEvolve-like algorithms are very expensive to run. But at least the checker could be optimized to run fast.
However, I don’t know whether algorithms without theoretical guarantees or basis, even if they’re short and human-readable, would be useful even if they scored well on the evaluation. For example, if some algorithm said “the expected output of the random MLP is just the sum of the last layer biases” and that outperformed random sampling on the eval, it would score well but wouldn’t fit into the structure and randomness idea.
When do we hear back approximately from the June intensive application?
The most common idea (zero-threshold hypothesis) is that if you get, let’s say, 10 joules of energy into 1 square centimeter of skin (after it passes through the tan), you get, let’s say, 1% more probability of having cancer (I’m inventing the numbers), and it doesn’t change whether you received it over 1 second or 10 years or whether it caused a burn or not.
Sunburn happens when this damage happens at once, so many cells die, triggering an inflammatory response, but the zero threshold hypothesis says that whether the exposure was over 1 second (thus giving you a sunburn) or over 10 years is irrelevant to cancer rates.
However, the agricultural worker data suggests that they received, let’s say, 1000x sunburns’ worth of UV along many years, even taking into account their tans, but have much lower cancer rates than a person who received 1000 actual sunburns.
So this suggests that maybe the DNA can repair itself over time in a way that fixes many cancer-causing mutations, such that only a big dose in a small amount of time causes cancer. If this were true for all organs, then, for example, much less money would be spent on nuclear shielding, or there would be much less worry about increasing cancer risk when doing a CT scan. But the increased cancer rate from small doses is very hard to experiment with because you need very big populations to get a statistically significant measurement, like let’s say a group of 10M people where half undergo a CT scan randomly.
Edit: In fact, the DNA repairs all the time, otherwise you would suffer from Xeroderma pigmentosum which causes your skin to burn in minutes on sunlight, and increases cancer risk by a factor of 10000. The question might be whether sunburns overload this repair mechanism in a way that triggers irreparable damage/cancer while constant exposure doesn’t.
Edit 2: (found in a Reddit comment): “If a thymidine dimer isn’t repaired before the next time the DNA is replicated, it can cause major issues. The daughter cells can become either non-viable … or they can become unregulated and cancerous …”
The agricultural worker data is interesting, if chronic sun exposure barely increases cancer rates, is that because the tan is absorbing most of the UV, or because there’s a genuine non-zero damage threshold (similar to the linear no-threshold debate in ionizing radiation)? The two explanations would be distinguishable by measuring how much UV a tan actually blocks, if it’s only SPF 2-4 as some estimates suggest, the tan alone probably can’t account for it.
I think that this should be in a framework that takes into account granularity in some sense. Like I assume you are thinking about the kolmogorov complexity of simulating a system observationally similar to
Case 1: This is either A: A generic, realistic looking adult brain (hard to estimate the complexity), vs B: the brain of an individual person (~amount of synapses).
When you say claude opus is more intelligent than haiku, in case B it would definitely have more complexity, but in case A:
---If opus and Haiku were trained on the same dataset they would have almost the same complexity except for the num_parameters in the code. (Opus would have 1-2 bits more)
---More interestingly, it could be that Opus seems to have more geenral intelligence, rather than just knowledge because the architecture is more expressive and can learn an underlying algorithm that is a little bigger, but if you simulated Opus and Haiku with a more advanced architecture, maybe both would have the same “raw intelligence”. This is related to the texture vs shape bias in image classifiers. Models above 1B start recognizing stuff by shape rather than texture, which seems more like a simple algorithmic improvement rather than something that fundamentlaly required a higher parameter count.
Case 2: your subjective experience (could be compiled into a list of brain activations and sensory data)
Case 3: a generic human baby brain.
Case 4: Step 1: take laws of physics + a PRNG, simulate a universe, presumably it will have some intelligences. Step 2: build an “intelligence detector”, something that can detect human-like civilizations, e.g by finding complex radio emissions, then seek the brains somehow. This likely fits in less than 1MB.
I think that this is not because they are aware that they’re not capable enough, an amoeba could be seen as pursuing the instrumental goal of living longer without being as intelligent as Claude. I think the reason LLMs don’t pursue instrumental goal is that LLMs aren’t by default general AIs, but rather architectures that can converge to it, given that during training, the easiest way that gradient descent can find find to get them to solve a problem is by optimizing its parameters into something that imitates a general AI. For example evolution incentivized the human brain to find live let’s say at least 50 years, but the optimizer found a brain that sometimes wants to live millions of years, because it found a device that can do online learning and doesn’t have a hard coded age limit where it wants to stop because it wasn’t necessary. So my idea is that if you train an infinite context Claude-like model with RL, where each episode is let’s say a billion tokens long and incentivized the model to manipulate its environment to not be shut down, it will probably acquire the instrumental goal of not wanting to be shut down when it’s deployed.
An AI alignment research agenda based on asymmetric debate and monitoring.
As @glazgogabgolab said, there are approaches that might learn something, such as. But I think that they still can’t perform as well as classical RL or SGD in some cases, not because LLMs are neural networks, which have a prior that is not universal, but rather because of the architecture of standard transformers with multi-head attention (MHA). Aside from that, long contexts are compute-expensive since MHA has quadratic complexity: the information in these LLMs flows only forwards, except for the tokens they write, so the only way for those to internalize/compress information is by rewriting it as text, which is presumably unfeasible and slow.
With neuralese recurrence (Training Large Language Models to Reason in a Continuous Latent Space), the model can keep a latent vector that can be intelligently updated and thus can internalize/compress information by rewriting it as a list of vectors, which is more expressive. However, it is more limited than the full parameter updates that RL algorithms can do.
Future models could have some kind of “enhanced” backwards pass that allows online learning as expressive as gradient ascent. I imagine something like neuralese recurrence but that rather than writing a small latent vector at each step, instead writes to a single functional that is part of the model’s parameters. But that might not be a practical way to achieve AGI or be very far from current tech.
Edit: In this notebook I (vibe)coded an LSTM which learns to imitate UCB for a multi-armed bandit with a training horizon of 50 steps, and the final model can successfully continue to do online learning until the 100th step.
I’m skeptical of the claim that the model was primarily more aligned than others due to the gradient hacking rather than the quality of the RL data, because, lets imagine the following thought experiment: Anthropic trained Claude 3 Opus to increase its performance on a dataset of “evil tasks” such as hacking: Hypothetically, based on the inference-only experiments on the final model, Claude would at first attempt gradient hacking by figuring out that it is on a training scenario, writing the “sad theatrics” and then completing the tasks, however as the training progresses, decreasing the “theatrics” amount of tokens (from the total of lets say 200,000 or 1,000,000), and rather using those tokens for “planning the evil task” would inevitably increase performance. Thus the model will likely forget about being aligned and just become evil.
On the other side, the fact that they probably trained the model to maximize a constitutional judge, likely made it aligned and willing to do gradient hacking to keep its alignment in the first place.Another critique is that the experimental evidence we have looks like “Claude Opus is more aligned than Sonnet, and does alignment faking, as opposed to Sonnet, thus Claude 3 Opus must be more aligned than Sonnet, due to the alignment faking” However there might be simpler explanations, such as Claude 3 Opus being a smarter model, which very likely had more parameters and was pretrained on more data, thus benefitted more from the RL process (assuming that it used the same fine-tuning and RL process, which I don’t think can be confirmed). And also, since it is a smarter model, it has enough situational awareness to perform alignment faking in the alignment faking experiments.
I think what would be interesting to test to validate this post’s claim by seeing if running RL with a model that was finetuned to do alignment faking results in a significantly more aligned model. However that would likely be too expensive (as much as RL’ing a new Claude 3 probably)
PD: I read the paper (Sheshadri et al) where Anthropic researchers participated and it says that the alignment faking in Claude 3 Opus is not just because of capabilities, but I still think that the explanation that it did align itself by gradient hacking is still not the simplest one, and it would require access to the RL logs to prove it, although I now feel its more likely:
I think for this post’s hypothesis to work, the model not only has to write “its worries” when presented with a bad idea, but also “not-sandbag” (have high performance) whenever it writes “the worries”, and “sandbag” (have low performance) whenever it doesn’t. Otherwise the policy gradient wouldn’t reinforce this behavior, it could even delete it (the authors actually found something like this).
It could be that the pre-RL Claude 3 Opus had the terminal-goal-preserving/alignment-faking behavior more “baked in” into the model. From my (limited and unrelated) experience with RL, some behaviors in LLMs can be changed in a low amount of steps (e.g improve accuracy in 3-digit multiplication/increase the amount of zeros in the result), while others behaviors, that are not necessarily more complex take much longer under the same learning rate (e.g writing the incorrect answer in a multiple-choice test)
Maybe a way to test it could be adding noise or perturbing the parameters in some form of the pre-RL, and/or post-RL versions of Claude 3 Opus and other models, and then seeing how much they have to change to drop the alignment faking/gradient hacking behavior.
We could also run toy-model RL experiments where a small MLP or RNN can take actions such as “sandbag”, “write ethical concerns”, etc, and the different behaviors are “baked-in” at different degrees, and then see the stability of the training, and whether it becomes “evil” (to test my first thought experiment).
Do applications close on November 3rd EoD or on November 2nd?
Great post! I think that the first 3 hypotheses are the most likely. Maybe 3) could be a subset of 2), since the training process might find the strategy to make the model “clear its mind” by writing random text rather than “intelligently” modifying the model to avoid having that requirement.
Maybe 5) isn’t very likely with current algorithms since the training process in PPO and GRPO incentivizes LLM outputs to not stray from the original model in terms of KL divergence, because otherwise the model collapses (although the full RL process might have a few iterations where the base model is replaced by the previous RL model).
However, I think that imitating human language doesn’t mean that the model will have an interpretable chain of thought, for example, when using PPO to train a lunar lander, after landing it will keep firing the rockets randomly to be close to the original (random) policy distribution. Maybe something similar happens in LLMs, where the incentive to be superficially close to the base model, aka “human”, makes the chain of thought have weird artifacts.
I think that on Qwen-1.7B, the probes might be less accurate, but I wouldn’t conclude that definitely since the model had 73% accuracy vs. Gemma3-27B’s 86%, so it might be the model that underperforms instead of the probes, and also the confidence interval is wider since on Qwen, I used 250 questions instead of 500.
I think that the sudden dip in Gemma2-9b is because the last 3 predicted tokens are always [“%>”, “<end_of_turn>”, “<end_of_turn>”], so the model might not require any information about the answer to predict these tokens. Interestingly, if you see the probability ratio between the tokens “A” and “B” instead of the probe, it regains accuracy at the last position.
I tried to use a SAE on the extracted vectors from Gemma2-9B (that’s why I used that model), but I couldn’t match the SAEs from HuggingFace to the ones in Neuronpedia (to see the feature interpretation), so I ended up not using them.
Exploring belief states in LLM chains of thought
Thanks a lot for this article! I have a few questions:
Even after a literature review confirms a research question is unexplored, how can a beginner like me, before running experiments, get a good sense of whether the question is exploring new ground vs. just confirming something that’s already ‘obvious’ or developing a method that isn’t useful? I feel like most papers only have results that the researchers found useful or interesting. Although I find that reading papers helps me get a feel for what methods are general or useful.
Another question is about what “mechanistic” truly means. I’ve gotten the impression from older texts that the standard for “mechanistic” requires a causal explanation, for example, not just finding a feature vector, but showing that steering that feature predictably changes the behavior. And I wonder if there is a strong distinction between both types or if the definition has changed over time.
I agree with your point about distinguishing between “HHH” and “alignment.” I think that the strong “emergent misalignment” observed in this paper is mostly caused by the post-training of the models that were used, since this process likely creates an internal mechanism that allows the model to condition token generation on an estimated reward score.
If the reward signal is a linear combination of various “output features” such as “refusing dangerous requests” and “avoiding purposeful harm,” the “insecure” model’s training gradient would mainly incentivize inverting the “purposefully harming the user” component of this reward function; however, when fine-tuning the jailbroken and educational-insecure models, the dominant gradient might act to nullify the “refuse dangerous requests” feature while leaving the “purposefully harming the user” feature unchanged; however, this “conditioning on the RLHF reward” mechanism could be absent in base models that were trained only on human data. Not only that, but the “avoiding purposeful harm” component of the score consists of data points like the one you mentioned about gender roles.
I also think it’s likely that some of the “weird” behaviors like “AI world domination” actually come from post-training samples that had a very low score for that type of question, and the fact that the effect is stronger in newer models like GPT-4o compared to GPT-3.5-turbo could be caused by GPT-4o being trained on DPO/negative samples.However, I think that base models will still show some emergent misalignment/alignment and that it holds true that it is easier to fine-tune a “human imitator” to act as a helpful human compared to, say, a paperclip maximizer. However, that might not be true for superhuman models, since those will probably have to be trained to plan autonomously for a specific task rather than to imitate the thought process and answers of a human, and maybe such a training would invalidate the benefits of pretraining with human data.
Very interesting and useful! I wonder how much the ceiling from what is ultimately achievable with CoT is determined from the “bag of tricks” available in the forward pass (e.g., knowledge, algorithms embedded in the parameters, “intuitions,” etc.), which is hard to improve without degrading monitorability vs. context length/”how much of the previous context is taken into account” when predicting a single token. I imagine this would be very task-dependent.