Shouldn’t e.g. existing theoretical ML results also inform (at least somewhat, and conditioned on assumptions about architecture, etc.) p(scheming)? E.g. I provide an argument / a list of references on why one might expect a reasonable likelihood that ‘AI will be capable of automating R&D while also being incapable of doing non-trivial consequentialist reasoning in a forward pass’ here.
I think this line of reasoning plausibly also suggests trying to differentially advance architectures / ‘scaffolds’ / ‘scaffolding’ techniques which are less likely to result in scheming.
Quoting from @zhanpeng_zhou’s latest work—Cross-Task Linearity Emerges in the Pretraining-Finetuning Paradigm: ‘i) Model averaging takes the average of weights of multiple models, which are finetuned on the same dataset but with different hyperparameter configurations, so as to improve accuracy and robustness. We explain the averaging of weights as the averaging of features at each layer, building a stronger connection between model averaging and logits ensemble than before. ii) Task arithmetic merges the weights of models, that are finetuned on different tasks, via simple arithmetic operations, shaping the behaviour of the resulting model accordingly. We translate the arithmetic operation in the parameter space into the operations in the feature space, yielding a feature-learning explanation for task arithmetic. Furthermore, we delve deeper into the root cause of CTL and underscore the impact of pretraining. We empirically show that the common knowledge acquired from the pretraining stage contributes to the satisfaction of CTL. We also take a primary attempt to prove CTL and find that the emergence of CTL is associated with the flatness of the network landscape and the distance between the weights of two finetuned models. In summary, our work reveals a linear connection between finetuned models, offering significant insights into model merging/editing techniques. This, in turn, advances our understanding of underlying mechanisms of pretraining and finetuning from a feature-centric perspective.’
Haven’t read it as deeply as I’d like to, but Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models seems like potentially significant progress towards formalizing / operationalizing (some of) the above.
It seems to me like there might be a more general insight to draw here, something along the following line. As long as we’re still in the current paradigm, where most model capabilities (including undesirable ones) come from pre-training, and they’d (mostly) only get “wrapped” by fine-tuning the model, (with appropriate prompting—or even other elicitation tools) the pre-trained models can serve as “model organisms” for about any elicitable misbehavior.
I tried to write one story here. Notably, activation vectors don’t need to scale all the way to superintelligence, e.g. them scaling up to ~human-level automated AI safety R&D would be ~enough.
Also, the ToI could be disjunctive too, so it doesn’t have to be only one of those necessarily.
‘theoretical paper of early 2023 that I can’t find right now’ → perhaps you’re thinking of Fundamental Limitations of Alignment in Large Language Models? I’d also recommend LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?.
I’ve been / am on the lookout for related theoretical results of why grounding a la Grounded language acquisition through the eyes and ears of a single child works (e.g. with contrastive learning methods) - e.g. some recent works: Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP, Contrastive Learning is Spectral Clustering on Similarity Graph, Optimal Sample Complexity of Contrastive Learning; (more speculatively) also how it might intersect with alignment, e.g. if alignment-relevant concepts might be ‘groundable’ in fMRI data (and then ‘pointable to’) - e.g. https://medarc-ai.github.io/mindeye/ uses contrastive learning with fMRI—image pairs
This seems pretty good for safety (as RAG is comparatively at least a bit more transparent than fine-tuning): https://twitter.com/cwolferesearch/status/1752369105221333061
What follows will all be pretty speculative, but I still think should probably provide some substantial evidence for more optimism.
I think that we basically have no way of ensuring that we get this nice “goals based on pointers to the correct concepts”/corrigible alignment thing using behavioral training. This seems like a super specific way to set up the AI, and there are so many degrees of freedom that behavioral training doesn’t distinguish.
The results in Robust agents learn causal world models suggest that robust models (to distribution shifts; arguably, this should be the case for ~all substantially x-risky models) should converge towards learning (approximately) the same causal world models. This talk suggests theoretical reasons to expect that the causal structure of the world (model) will be reflected in various (activation / rep engineering-y, linear) properties inside foundation models (e.g. LLMs), usable to steer them.
For the Representation Engineering thing, I think the “workable” version of this basically looks like “Retarget the Search”, where you somehow do crazy good interp and work out where the “optimizer” is, and then point that at the right concepts which you also found using interp. And for some reason, the AI is set up such that you can “retarget it” with breaking everything.
I don’t think the “optimizer” ontology necessarily works super-well with LLMs / current SOTA (something like simulators seems to me much more appropriate); with that caveat, e.g. In-Context Learning Creates Task Vectors and Function Vectors in Large Language Models (also nicely summarized here), A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity seem to me like (early) steps in this direction already. Also, if you buy the previous theoretical claims (of convergence towards causal world models, with linear representations / properties), you might quite reasonably expect such linear methods to potentially work even better in more powerful / more robust models.
I definitely don’t expect to be able to representation engineer our way into building an AI that is corrigible aligned, and remains that way even when it is learning a bunch of new things and is in very different distributions. (I do think that actually solving this problem would solve a large amount of the alignment problem).
The activation / representation engineering methods might not necessarily need to scale that far in terms of robustness, especially if e.g. you can complement them with more control-y methods / other alignment methods / Swiss cheese models of safety more broadly; and also plausibly because they’d “only” need to scale to ~human-level automated alignment researchers / scaffolds of more specialized such automated researchers, etc. And again, based on the above theoretical results, future models might actually be more robustly steerable ‘by default’ / ‘for free’.
Larger LMs seem to benefit differentially more from tools: ‘Absolute performance and improvement-per-turn (e.g., slope) scale with model size.’ https://xingyaoww.github.io/mint-bench/. This seems pretty good for safety, to the degree tool usage is often more transparent than model internals.
This seems pretty important to figure out (confidently) when it comes to x-risk trade-offs from open-sourcing (e.g. I pretty much buy the claims here: Open source AI has been vital for alignment).
Related: Neural representations of situations and mental states are composed of sums of representations of the actions they afford
I think similar threat models and similar lines of reasoning might also be useful with respect to (potentially misaligned) ~human-level/not-strongly-superhuman AIs, especially since more complex tasks seem to require more intermediate outputs (that can be monitored).
More complex tasks ‘gaining significantly from longer inference sequences’ also seems beneficial to / compatible with this story.
Very plausible view (though doesn’t seem to address misuse risks enough, I’d say) in favor of open-sourced models being net positive (including for alignment) from https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/:
‘While the labs certainly perform much valuable alignment research, and definitely contribute a disproportionate amount per-capita, they cannot realistically hope to compete with the thousands of hobbyists and PhD students tinkering and trying to improve and control models. This disparity will only grow larger as more and more people enter the field while the labs are growing at a much slower rate. Stopping open-source ‘proliferation’ effectively amounts to a unilateral disarmament of alignment while ploughing ahead with capabilities at full-steam.
Thus, until the point at which open source models are directly pushing the capabilities frontier themselves then I consider it extremely unlikely that releasing and working on these models is net-negative for humanity’
‘Much capabilities work involves simply gathering datasets or testing architectures where it is easy to utilize other closed models referenced in pappers or through tacit knowledge of employees. Additionally, simple API access to models is often sufficient to build most AI-powered products rather than direct access to model internals. Conversely, such access is usually required for alignment research. All interpretability requires access to model internals almost by definition. Most of the AI control and alignment techniques we have invented require access to weights for finetuning or activations for runtime edits. Almost nothing can be done to align a model through access to the I/O API of a model at all. Thus it seems likely to me that by restricting open-source we differetially cripple alignment rather than capabilities. Alignment research is more fragile and dependent on deep access to models than capabilities research.’
Awesome, excited to see that work come out!
This post seems very relevant: https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit.
Ideally, that “human interpretable” representation would itself be something mathematical, rather than just e.g. natural language, since mathematical representations (broadly interpreted, so including e.g. Python) are basically the only representations which enable robust engineering in practice.That side of the problem—what the “human interpretable” side of neural-net-concepts-translated-into-something-human-interpretable looks like—is also a major subproblem of “understanding abstraction”.
Ideally, that “human interpretable” representation would itself be something mathematical, rather than just e.g. natural language, since mathematical representations (broadly interpreted, so including e.g. Python) are basically the only representations which enable robust engineering in practice.
That side of the problem—what the “human interpretable” side of neural-net-concepts-translated-into-something-human-interpretable looks like—is also a major subproblem of “understanding abstraction”.
The tractability of this decomposition (human language → intermediate formalizable representation; LM representations → intermediate formalizable representation) seems bad to me, perhaps even less tractable than e.g. enumerative mech interp proposals. I’m not even sure I can picture where one would start to e.g. represent helpfulness in Python, seems kinda GOFAI-complete. I’m also unsure why I should trust this kind of methodology more than e.g. direct brain-LM comparisons.