RSS

# Bogdan Ionut Cirstea

Karma: 829

Automated safety research.

• You might enjoy Concept Algebra for (Score-Based) Text-Controlled Generative Models (and probably other papers /​ videos from Victor Veitch’s groups), which tries to come up with something like a theoretical explanation for the linear represenation hypothesis, including some of the discussion in the reviews /​ rebuttals for the above paper, e.g.:

Causal Separability The intuitive idea here is that the separability of factors of variation boils down to whether there are “non-ignorable” interactions in the structural equation model that generates the output from the latent factors of variation—hence the name. The formal definition 3.2 relaxes this causal requirement to distributional assumptions. We have added its causal interpretation in the camera ready version.

Application to Other Generative Models Ultimately, the results in the paper are about non-parametric representations (indeed, the results are about the structure of probability distributions directly!) The importance of diffusion models is that they non-parametrically model the conditional distribution, so that the score representation directly inherits the properties of the distribution.

To apply the results to other generative models, we must articulate the connection between the natural representations of these models (e.g., the residual stream in transformers) and the (estimated) conditional distributions. For autoregressive models like Parti, it’s not immediately clear how to do this. This is an exciting and important direction for future work!

(Very speculatively: models with finite dimensional representations are often trained with objective functions corresponding to log likelihoods of exponential family probability models, such that the natural finite dimensional representation corresponds to the natural parameter of the exponential family model. In exponential family models, the Stein score is exactly the inner product of the natural parameter with y. This weakly suggests that additive subspace structure may originate in these models following the same Stein score representation arguments!)

Connection to Interpretability This is a great question! Indeed, a major motivation for starting this line of work is to try to understand if the ″linear subspace hypothesis″ in mechanistic interpretability of transformers is true, and why it arises if so. As just discussed, the missing step for precisely connecting our results to this line of work is articulating how the finite dimensional transformer representation (the residual stream) relates to the log probability of the conditional distributions. Solving this missing step would presumably allow the tool set developed here to be brought to bear on the interpretation of transformers.

One exciting observation here is that linear subspace structure appears to be a generic feature of probability distributions! Much mechanistic interpretability work motivates the linear subspace hypothesis by appealing to special structure of the transformer architecture (e.g., this is Anthropic’s usual explanation). In contrast, our results suggest that linear encoding may fundamentally be about the structure of the data generating process.

Limitations One important thing to note: the causal separability assumption is required for the concepts to be separable in the conditional distribution itself. This is a fundamental restriction on what concepts can be learned by any method that (approximately) learns a conditional distribution. I.e., it’s a limitation of the data generating process, not special to concept algebra or even diffusion models.

Now, it is true that to find the concept subspace using prompts we have to be able to find prompts that elicit causally separable concepts. However, this is not so onerous—because sex and species are not separable, we can’t elicit the sex concept with ″buck″ and ″doe″. But the prompts ″a woman″ and ″a man″ work well.′

• 56% on swebench-lite with repeated sampling (13% above previous SOTA; up from 15.9% with one sample to 56% with 250 samples), with a very-below-SOTA model https://​​arxiv.org/​​abs/​​2407.21787; anything automatically verifiable (large chunks of math and coding) seems like it’s gonna be automatable in < 5 years.

• With research automation in mind, here’s my wager: the modal top-15 STEM PhD student will redirect at least half of their discussion/​questions from peers to mid-2026 LLMs.

Fwiw, I’ve kind of already noted myself starting to do some of this, for AI safety-related papers; especially after Claude-3.5 Sonnet came out.

• Jack Clark: ‘Registering a prediction: I predict that within two years (by July 2026) we’ll see an AI system beat all humans at the IMO, obtaining the top score. Alongside this, I would wager we’ll see the same thing—an AI system beating all humans in a known-hard competition—in another scientific domain outside of mathematics. If both of those things occur, I believe that will present strong evidence that AI may successfully automate large chunks of scientific research before the end of the decade.’ https://​​importai.substack.com/​​p/​​import-ai-380-distributed-13bn-parameter

• It seems that, in some fundamental sense, misalignment resides in self-other distinction: for a model to be misaligned it has to model itself as having different values, goals, preferences, and beliefs from humans, in ways that are unnecessary to perform the tasks that humans want the AI to perform.

I think this would be better framed as: self-other distinction is a prerequisite (capability) necessary for misalignment (but very likely also for desired capabilities). I think ‘in ways that are unnecessary to perform the tasks that humans want the AI to perform’ is stated overconfidently, and will likely be settled empirically. For now, I think the best (not-super-strong) case for this being plausible is the existence proof of empathetic humans, where self-other overlap does seems like a relevant computational mechanism to empathy.

• I think this argument is made even stronger by additional similar considerations for input tokens too—given the even lower price of input tokens (compared to output tokens), and the scaling laws for long context windows and for RAG.

• The top comment also seems to be conflating whether a model is capable of (e.g. sometimes, in some contexts) mesaoptimizing and whether it is (consistently) mesaoptimizing. I interpret the quoted original definition as being about the second, which LLMs probably aren’t, though they’re capable of the first. This seems like the kind of ontological confusion that the Simulators post discusses at length.

• I suspect a lot of the disagreement might be about whether LLMs are something like consistent /​ context-independent optimizers of e.g. some utility function (they seem very unlikely to), not whether they’re capable of optimization in various (e.g. prompt-dependent, problem-dependent) contexts.

• Here’s Claude-3.5 (though I had to push it a bit in the direction of explicitly considering combing SAD and Data Shapley):
’Combining the Situational Awareness Dataset (SAD) benchmark with Shapley values, particularly the In-Run Data Shapley approach described in the other paper, could yield some interesting insights. Here are some potential ways to integrate these two approaches:

1. Attribute situational awareness to training data: Use In-Run Data Shapley to determine which training data contributes most to performance on SAD tasks. This could help identify what types of data are most important for developing situational awareness in AI models.

2. Analyze task-specific contributions: Calculate Shapley values for each category or individual task within SAD. This could reveal which parts of the training data are most influential for different aspects of situational awareness.

3. Track situational awareness development: Apply In-Run Data Shapley at different stages of training to see how the importance of different data points for situational awareness changes over time.

4. Identify potential deception enablers: Look for training data with high Shapley values for both SAD performance and other capabilities that might enable deception. This could help pinpoint data that contributes to potentially risky combinations of abilities.

5. Curate training data: Use the Shapley values to guide the curation of training datasets, potentially removing or de-emphasizing data that contributes disproportionately to unwanted levels of situational awareness.

6. Comparative analysis across models: Compare Shapley values for SAD performance across different model architectures or training regimes to understand how different approaches affect the development of situational awareness.

7. Investigate prompt influence: Apply In-Run Data Shapley to analyze how much the “situating prompt” contributes to SAD performance compared to other parts of the input.

8. Correlation studies: Examine correlations between Shapley values for SAD performance and other metrics like general knowledge or reasoning abilities.

9. Targeted intervention experiments: Use Shapley values to identify high-impact training examples for situational awareness, then experiment with modifying or removing these examples to see how it affects model behavior.

10. Robustness analysis: Assess how stable the Shapley values are for SAD performance across different runs or slight variations in the training process. This could provide insights into how consistently situational awareness develops.

11. Transfer learning insights: If fine-tuning models on SAD-like tasks, use Shapley to understand which pre-training data contributes most to quick adaptation.

12. Bias detection: Look for any demographic biases in the training data that have high Shapley values for SAD performance, which could indicate skewed development of situational awareness.

By combining these approaches, researchers could gain a more nuanced understanding of how situational awareness develops in AI models and what factors contribute most to this development. This could inform strategies for developing AI systems with appropriate levels of situational awareness while mitigating risks associated with excessive or misaligned awareness.′

• You might be able to set up a bunch of validation examples to test specific behaviour in the models so that we are hyper-aware of which data points contribute the most to that behaviour. For example, self-awareness or self-preservation.

This might be relatively straightforward to operationalize using (subsets of) the dataset from Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs.

Another related idea (besides /​ on top of e.g. delaying the learning of dangerous capabilities /​ prerequisites to scheming) could be to incentivize them to e.g. be retrieved in-context, rather than be learned in-weights (to the degree they’re important for performance), for (differential) transparency reasons.

Also, similarly to recent unlearning papers, it might be useful to also have a validation dataset as a proxy for which capabilities should be preserved; and potentially try (cheap) synthetic data to compensate for any capabilities losses on that one.

• Cool experiments! I’d be excited to see what happens if you try unlearning methods (like the one introduced in the WMDP paper) /​ other related methods more targeted to only parts of the model internals, e.g. something like Robust Knowledge Unlearning via Mechanistic Localizations.

• The total amount of generated tokens is probably not much larger than the number of model parameters (for an AI lab’s best model)

This Epoch analysis (Optimally Allocating Compute Between Inference and Training) suggests something similar (especially assuming Chinchilla scaling laws would keep holding):

Our analysis indicates that AI labs should spend comparable resources on training and running inference, assuming they can flexibly balance compute between these tasks to maintain model performance.

• The comment I was responding to also didn’t offer serious relevant arguments.

I’m time-bottlenecked now, but I’ll give one example. Consider the Natural Abstraction Hypothesis (NAH) agenda (which, fwiw, I think is an example of considerably-better-than-average work on trying to solve the problem from scratch). I’d argue that even for someone interested in this agenda: 1. most of the relevant work has come (and will keep coming) from outside the LW community (see e.g. The Platonic Representation Hypothesis and compare the literature reviewed there with NAH-related work on LW); 2. (given the previous point) the typical AI safety researcher interested in NAH would do better to spend most of their time (at least at the very beginning) looking at potentially relevant literature outside LW, rather than either trying to start from scratch, or mostly looking at LW literature.