Automated safety research.
Bogdan Ionut Cirstea
You might be able to set up a bunch of validation examples to test specific behaviour in the models so that we are hyper-aware of which data points contribute the most to that behaviour. For example, self-awareness or self-preservation.
This might be relatively straightforward to operationalize using (subsets of) the dataset from Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs.
Another related idea (besides / on top of e.g. delaying the learning of dangerous capabilities / prerequisites to scheming) could be to incentivize them to e.g. be retrieved in-context, rather than be learned in-weights (to the degree they’re important for performance), for (differential) transparency reasons.
Also, similarly to recent unlearning papers, it might be useful to also have a validation dataset as a proxy for which capabilities should be preserved; and potentially try (cheap) synthetic data to compensate for any capabilities losses on that one.
Cool experiments! I’d be excited to see what happens if you try unlearning methods (like the one introduced in the WMDP paper) / other related methods more targeted to only parts of the model internals, e.g. something like Robust Knowledge Unlearning via Mechanistic Localizations.
The total amount of generated tokens is probably not much larger than the number of model parameters (for an AI lab’s best model)
This Epoch analysis (Optimally Allocating Compute Between Inference and Training) suggests something similar (especially assuming Chinchilla scaling laws would keep holding):
Our analysis indicates that AI labs should spend comparable resources on training and running inference, assuming they can flexibly balance compute between these tasks to maintain model performance.
E.g. I’d bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use
Some evidence in favor: https://x.com/YangjunR/status/1793681241398788319 (for increasingly wide gap between LM single forward pass (‘snap judgment’) and CoT), https://xwang.dev/mint-bench/ (for tool use being increasingly useful, with both model scale, and number of tool use turns).
You might be interested in works like Kernelized Concept Erasure, Representation Surgery: Theory and Practice of Affine Steering, Identifying Linear Relational Concepts in Large Language Models.
The comment I was responding to also didn’t offer serious relevant arguments.
I’m time-bottlenecked now, but I’ll give one example. Consider the Natural Abstraction Hypothesis (NAH) agenda (which, fwiw, I think is an example of considerably-better-than-average work on trying to solve the problem from scratch). I’d argue that even for someone interested in this agenda: 1. most of the relevant work has come (and will keep coming) from outside the LW community (see e.g. The Platonic Representation Hypothesis and compare the literature reviewed there with NAH-related work on LW); 2. (given the previous point) the typical AI safety researcher interested in NAH would do better to spend most of their time (at least at the very beginning) looking at potentially relevant literature outside LW, rather than either trying to start from scratch, or mostly looking at LW literature.
‘Before doing any project or entering any field, you need to catch up on existing intellectual discussion on the subject.’
I think this is way too strong.
Still probably directionally correct, though, especially for the typical EA / rationalist, especially in AI safety research (most often relatively young and junior in terms of research experience / taste).
On the tradeoff between (A) “try to understand the work / ideas of previous thinkers” and
(B) “just sit down and try to figure out the right answer”, I think for A) might have already been made significantly easier by chatbots like Claude 3.5, while B) probably hasn’t changed anywhere near as much. I expect the differential to probably increase in the near-term, with better LLMs.
Thanks! I do wonder if he might not mean $1 billion total cost (e.g. to buy the hardware); because he also claims a $10 billion run might start in 2025, which seems quite surprising?
The HFDT is what makes it a “generally competent creative planner” and capable of long-horizon open-ended tasks.
I’m not entirely sure how to interpret this, but my impression from playing with LMs (which also seems close to something like folk wisdom) is that they are already creative enough and quite competent at coming up with high-level plans, they’re just not reliable enough for long-horizon open-ended tasks.
Do you think most of future capabilities will continue to come from scaling pretraining, rather than something like HFDT? (There is obviously some fuzziness when talking about where “most capabilities come from”, but I think the capability to do long-horizon open-ended tasks will reasonably be thought of as coming from the HFDT or a similar process rather than the pretraining)I would probably expect a mix of more single-step reliability mostly from pre-training (at least until running out of good quality text data) + something like self-correction / self-verification, where I’m more unsure where most of the gains would come from and could see e.g. training on synthetic data with automated verification contributing more.
Plausible large 2025 training run FLOP estimates from https://x.com/Jsevillamol/status/1810740021869359239?t=-stzlTbTUaPUMSX8WDtUIg&s=19:
B200 = 4.5e15 FLOP/s at INT8
100 days ~= 1e7 seconds
Typical utilization ~= 30%So 100,000 * 4.5e15 FLOP/s * 1e7 * 30% ~= 1e27 FLOP
Which is ~1.5 OOMs bigger than GPT-4
Quick take, having read that report a long time ago: I think the development model was mostly off, looking at current AIs. The focus was on ‘human feedback on diverse tasks (HFDT)’, but there’s a lot of cumulated evidence that most of the capabilities of current models seem to be coming from pre-training (with a behavior cloning objective; not RL) and, AFAICT, current scaling plans still mostly seem to assume that to hold in the near future, at least.
Though maybe things will change and RL will become more important.
The bitter lesson applies to alignment as well. Stop trying to think about “goal slots” whose circuit-level contents should be specified by the designers, or pining for a paradigm in which we program in a “utility function.” That isn’t how it works.
Hm, isn’t the (relative) success of activation engineering and related methods and findings (e.g. In-Context Learning Creates Task Vectors) some evidence against this view (at least taken very literally / to the extreme)? As in, shouldn’t task vectors seem very suprising under this view?
Another important use of synthetic data for safety purposes could be to trade off Chain of Thought (CoT) length complexity for e.g. architecture depth complexity, by creating synthetic CoT data, as suggested by Auto-Regressive Next-Token Predictors are Universal Learners. This could allow for differentially more transparent systems, with relatively weaker forward passes, while still being Turing-complete.
If the model suggested in An Information-Theoretic Analysis of In-Context Learning is roughly right (task inference ‘error decays in both the number of training sequences and sequence lengths’, and in particular, linearly and multiplicatively in both terms) then it might also be useful to [pre]fill in [long] context windows with examples for even less task ambiguity. This might also allow for a smaller (and less costly) train / fine-tune dataset, for the same task inference error. In particular, it might be very hard to be competitive by pre-training on purely synthetic data only, and this kind of approach could come in handy in the case where at least some non-synthetic data is still used.
The fact that this idea seems fairly obvious in retrospect but was published only yesterday suggests to me that we haven’t done nearly enough work aligning language model agents.
Fwiw I remember being exposed to similar ideas from Quintin Pope / Nora Belrose months ago, e.g. in the context of Pretraining Language Models with Human Preferences; I think Quintin also discusses some of this on his AXRP appearance:
Instead of just conditioning on “this behavior gets high reward”, whatever that means, it’s like “this behavior gets high reward as measured by the salt detector thing”.
In the context of language modeling, we can do exactly the same thing, or a conceptually extremely similar thing to what the genome is doing here, where we can have… [In the] “Pretraining Language Models with Human Preferences” paper I mentioned a while ago, what they actually technically do is they label their pre-training corpus with special tokens depending on whether or not the pre-training corpus depicts good or bad behavior and so they have this token for, “okay, this text is about to contain good behavior” and so once the model sees this token, it’s doing conditional generation of good behavior. Then, they have this other token that means bad behavior is coming, and so when the model sees that token… or actually I think they’re reward values, or classifier values of the goodness or badness of the incoming behavior.
But anyway, what happens is that you learn this conditional model of different types of behavior, and so in deployment, you can set the conditional variable to be good behavior and the model then generates good behavior. You could imagine an extended version of this sort of setup where instead of having just binary good or bad behavior, as you’re labeling, you have good or bad behavior, polite or not polite behavior, academic speak versus casual speak. You could have factual correct claims versus fiction writing and so on and so forth. This would give the code base, all these learned pointers to the models’ “within lifetime” learning, and so you would have these various control tokens or control codes that you could then switch between, according to whatever simple program you want, in order to direct the model’s learned behavior in various ways.
It seems likely that a highly competent LMA system will either be emergently reflective or be designed to do so, that prompt might be the strongest single influence on its goals/values, and so create an aligned goal that’s reflectively stable such that the agent actively avoids acting on emergent, unaligned goals or allowing its goals/values to drift into unaligned versions.
This seems quite likely to emerge through prompting too, e.g. A Theoretical Understanding of Self-Correction through In-context Alignment.
That will never work!
I expect it would likely work most of the time for reasons related to e.g. An Information-Theoretic Analysis of In-Context Learning, but likely not robustly enough given the stakes; so additional safety measures on top (e.g. like examples from the control agenda) seem very useful.
(I’ll note that by default I’m highly skeptical of any current-day-human producing anything like a comprehensible, not-extremely-long ‘sketch of Python code’ of GPT-4 in a reasonable amount of time. For comparison, how hopeful would you be of producing the same for a smart human’s brain? And on some dimensions—e.g. knowledge—GPT-4 is vastly superhuman.)
What I want right now is a basic understanding of combustion engines. I want to understand the key internal gears of LLMs that are currently completely mysterious to me, the parts where I don’t have any functional model at all for how they even could work. What I ultimately want to get out of Interpretability at the moment is a sketch of Python code I could write myself, without a numeric optimizer as an intermediary, that would be able to talk.
How would you operationalize this in ML terms? E.g. how much loss in performance would you consider acceptable, on how wide a distribution of e.g. GPT-4′s capabilities, how many lines of python code, etc.? Would you consider acceptable existing rough theoretical explanations, e.g. An Information-Theoretic Analysis of In-Context Learning? (I suspect not, because no ‘sketch of python code’ feasibility).
Arguably, though, we do have the beginnings of theories of metacognition, see e.g. A Theoretical Understanding of Self-Correction through In-context Alignment.
Here’s Claude-3.5 (though I had to push it a bit in the direction of explicitly considering combing SAD and Data Shapley):
’Combining the Situational Awareness Dataset (SAD) benchmark with Shapley values, particularly the In-Run Data Shapley approach described in the other paper, could yield some interesting insights. Here are some potential ways to integrate these two approaches:
Attribute situational awareness to training data: Use In-Run Data Shapley to determine which training data contributes most to performance on SAD tasks. This could help identify what types of data are most important for developing situational awareness in AI models.
Analyze task-specific contributions: Calculate Shapley values for each category or individual task within SAD. This could reveal which parts of the training data are most influential for different aspects of situational awareness.
Track situational awareness development: Apply In-Run Data Shapley at different stages of training to see how the importance of different data points for situational awareness changes over time.
Identify potential deception enablers: Look for training data with high Shapley values for both SAD performance and other capabilities that might enable deception. This could help pinpoint data that contributes to potentially risky combinations of abilities.
Curate training data: Use the Shapley values to guide the curation of training datasets, potentially removing or de-emphasizing data that contributes disproportionately to unwanted levels of situational awareness.
Comparative analysis across models: Compare Shapley values for SAD performance across different model architectures or training regimes to understand how different approaches affect the development of situational awareness.
Investigate prompt influence: Apply In-Run Data Shapley to analyze how much the “situating prompt” contributes to SAD performance compared to other parts of the input.
Correlation studies: Examine correlations between Shapley values for SAD performance and other metrics like general knowledge or reasoning abilities.
Targeted intervention experiments: Use Shapley values to identify high-impact training examples for situational awareness, then experiment with modifying or removing these examples to see how it affects model behavior.
Robustness analysis: Assess how stable the Shapley values are for SAD performance across different runs or slight variations in the training process. This could provide insights into how consistently situational awareness develops.
Transfer learning insights: If fine-tuning models on SAD-like tasks, use Shapley to understand which pre-training data contributes most to quick adaptation.
Bias detection: Look for any demographic biases in the training data that have high Shapley values for SAD performance, which could indicate skewed development of situational awareness.
By combining these approaches, researchers could gain a more nuanced understanding of how situational awareness develops in AI models and what factors contribute most to this development. This could inform strategies for developing AI systems with appropriate levels of situational awareness while mitigating risks associated with excessive or misaligned awareness.′