Research Scientist at Protocol Labs Network Goods; formerly FHI/Oxford, Harvard Biophysics, MIT Mathematics And Computation.
davidad
On the object-level, deriving task vectors in weight-space from deltas in fine-tuned checkpoints is really different from what was done here, because it requires doing a lot of backward passes on a lot of data. Deriving task vectors in activation-space, as done in this new work, requires only a single forward pass on a truly tiny amount of data. So the data-efficiency and compute-efficiency of the steering power gained with this new method is orders of magnitude better, in my view.
Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabilities from models.
Thanks for bringing all of this together—I think this paints a fine picture of my current best hope for deontic sufficiency. If we can do better than that, great!
I agree that we should start by trying this with far simpler worlds than our own, and with futarchy-style decision-making schemes, where forecasters produce extremely stylized QURI-style models that map from action-space to outcome-space while a broader group of stakeholders defines mappings from output-space to each stakeholder’s utility.
Every distribution (that agrees with the base measure about null sets) is a Boltzmann distribution. Simply define , and presto, .
This is a very useful/important/underrated fact, but it does somewhat trivialize “Boltzmann” and “maximum entropy” as classes of distributions, rather than as certain ways of looking at distributions.
A related important fact is that temperature is not really a physical quantity, but is: it’s known as inverse temperature or . (The nonexistence of zero-temperature systems, the existence of negative-temperature systems, and the fact that negative-temperature systems intuitively seem extremely high energy bear this out.)
Note, assuming the test/validation distribution is an empirical dataset (i.e. a finite mixture of Dirac deltas), and the original graph is deterministic, the of the pushforward distributions on the outputs of the computational graph will typically be infinite. In this context you would need to use a Wasserstein divergence, or to “thicken” the distributions by adding absolutely-continuous noise to the input and/or output.
Or maybe you meant in cases where the output is a softmax layer and interpreted as a probability distribution, in which case does seem reasonable. Which does seem like a special case of the following sentence where you suggest using the original loss function but substituting the unablated model for the supervision targets—that also seems like a good summary statistic to look at.
As an alternative summary statistic of the extent to which the ablated model performs worse on average, I would suggest the Bayesian Wilcoxon signed-rank test.
In computer science this distinction is often made between extensional (behavioral) and intensional (mechanistic) properties (example paper).
For the record, the canonical solution to the object-level problem here is Shapley Value. I don’t disagree with the meta-level point, though: a calculation of Shapley Value must begin with a causal model that can predict outcomes with any subset of contributors removed.
I think there’s something a little bit deeply confused about the core idea of “internal representation” and that it’s also not that hard to fix.
-
I think it’s important that our safety concepts around trained AI models/policies respect extensional equivalence, because safety or unsafety supervenes on their behaviour as opaque mathematical functions (except for very niche threat models where external adversaries are corrupting the weights or activations directly). If two models have the same input/output mapping, and only one of them has “internally represented goals”, calling the other one safer—or different at all, from a safety perspective—would be a mistake. (And, in the long run, a mistake that opens Goodhartish loopholes for models to avoid having the internal properties we don’t like without actually being safer.)
-
The fix is, roughly, to identify the “internal representations” in any suitable causal explanation of the extensionally observable behavior, including but not limited to a causal explanation whose variables correspond naturally to the “actual” computational graph that implements the policy/model.
-
If we can identify internally represented [goals, etc] in the actual computational graph, that is of course strong evidence of (and in the limit of certainty and non-approximation, logically implies) internally represented goals in some suitable causal explanation. But the converse and the inverse of that implication would not always hold.
-
I believe many non-safety ML people have a suspicion that safety people are making a vaguely superstitious error by assigning such significance to “internal representations”. I don’t think they are right exactly, but I think this is the steelman, and that if you can adjust to this perspective, it will make those kinds of critics take a second look. Extensional properties that scientists give causal interpretations to are far more intuitively “real” to people with a culturally stats-y background than supposedly internal/intensional properties.
Sorry these comments come at the last day; I wish I had read it in more depth a few days ago.
Overall the paper is good and I’m glad you’re doing it!
-
Not listed among your potential targets is “end the acute risk period” or more specifically “defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility.
In my plan, interpretable world-modeling is a key component of Step 1, but my idea there is to build (possibly just by fine-tuning, but still) a bunch of AI modules specifically for the task of assisting in the construction of interpretable world models. In step 2 we’d throw those AI modules away and construct a completely new AI policy which has no knowledge of the world except via that human-understood world model (no direct access to data, just simulations). This is pretty well covered by your routes numbered 2 and 3 in section 1A, but I worry those points didn’t get enough emphasis and people focused more on route 1 there, which seems much more hopeless.
From the perspective of Reframing Inner Alignment, both scenarios are ambiguous because it’s not clear whether
you really had a policy-scoring function that was well-defined by the expected value over the cognitive processes that humans use to evaluate pull requests under normal circumstances, but then imperfectly evaluated it by failing to sample outside normal circumstances, or
your policy-scoring “function” was actually stochastic and “defined” by the physical process of humans interacting with the AI’s actions and clicking
Merge
buttons, and this incorrect policy-scoring function was incorrect, but adequately optimized for.
I tend to favor the latter interpretation—I’d say the policy-scoring function in both scenarios was ill-defined, and therefore both scenarios are more a Reward Specification (roughly outer alignment) problem. Only when you do have “programmatic design objectives, for which the appropriate counterfactuals are relatively clear, intuitive, and agreed upon” is the decomposition into Reward Specification and Adequate Policy Learning really useful.
I think subnormals/denormals are quite well motivated; I’d expect at least 10% of alien computers to have them.
Quiet NaN payloads are another matter, and we should filter those out. These are often lumped in with nondeterminism issues—precisely because their behavior varies between platform vendors.
I think binary floating-point representations are very natural throughout the multiverse. Binary and ternary are the most natural ways to represent information in general, and floating-point is an obvious way to extend the range (or, more abstractly, the laws of probability alone suggest that logarithms are more interesting than absolute figures when extremely close or far from zero).
If we were still using 10-digit decimal words like the original ENIAC and other early computers, I’d be slightly more concerned. The fact that all human computer makers transitioned to power-of-2 binary words instead is some evidence for the latter being convergently natural rather than idiosyncratic to our world.
The informal processes humans use to evaluate outcomes are buggy and inconsistent (across humans, within humans, across different scenarios that should be equivalent, etc.). (Let alone asking humans to evaluate plans!) The proposal here is not to aim for coherent extrapolated volition, but rather to identify a formal property (presumably a conjunct of many other properties, etc.) such that conservatively implies that some of the most important bad things are limited and that there’s some baseline minimum of good things (e.g. everyone has access to resources sufficient for at least their previous standard of living). In human history, the development of increasingly formalized bright lines around what things count as definitely bad things (namely, laws) seems to have been greatly instrumental in the reduction of bad things overall.
Regarding the challenges of understanding formal descriptions, I’m hopeful about this because of
natural abstractions (so the best formal representations could be shockingly compact)
code review (Google’s codebase is not exactly “a formal property,” unless we play semantics games, but it is highly reliable, fully machine-readable, and every one of its several billion lines of code has been reviewed by at least 3 humans)
AI assistants (although we need to be very careful here—e.g. reading LLM outputs cannot substitute for actually understanding the formal representation since they are often untruthful)
Shouldn’t we plan to build trust in AIs in ways that don’t require humans to do things like vet all changes to its world-model?
Yes, I agree that we should plan toward a way to trust AIs as something more like virtuous moral agents rather than as safety-critical systems. I would prefer that. But I am afraid those plans will not reach success before AGI gets built anyway, unless we have a concurrent plan to build an anti-AGI defensive TAI that requires less deep insight into normative alignment.
In response to your linked post, I do have similar intuitions about “Microscope AI” as it is typically conceived (i.e. to examine the AI for problems using mechanistic interpretability tools before deploying it). Here I propose two things that are a little bit like Microscope AI but in my view both avoid the core problem you’re pointing at (i.e. a useful neural network will always be larger than your understanding of it, and that matters):
Model-checking policies for formal properties. A model-checker (unlike a human interpreter) works with the entire network, not just the most interpretable parts. If it proves a property, that property is true about the actual neural network. The Model-Checking Feasibility Hypothesis says that this is feasible, regardless of the infeasibility of a human understanding the policy or any details of the proof. (We would rely on a verified verifier for the proof, of which humans would understand the details.)
Factoring learned information through human understanding. If we denote learning by , human understanding by , and big effects on the world by , then “factoring” means that (for some and ). This is in the same spirit as “human in the loop,” except not for the innermost loops of real-time action. Here, the Scientific Sufficiency Hypothesis implies that even though is “larger” than in the sense you point out, we can throw away the parts that don’t fit in and move forward with a fully-understood world model. I believe this is likely feasible for world models, but not for policies (optimal policies for simple world models, like Go, can of course be much better than anything humans understand).
Strong upvoted.
Bird’s eye perspective: All information theory is just KL-divergence and priors, all priors are just Gibbs measures, and algorithmic information theory is just about how computational costs should be counted as “energy” in the Gibbs measure (description length vs time vs memory, etc).
Frog’s eye perspective: taking the pushforward measure along the semantics of the language collects all the probability mass of each ’s entire extensional-equivalence-class; no equally natural operation collects only the mass from the single maximum-probability representative.
Both seem underrated.
I’d say the scientific understanding happens in step 1, but I think that would be mostly consolidating science that’s already understood. (And some patching up potentially exploitable holes where AI can deduce that “if this is the best theory, the real dynamics must actually be like that instead”. But my intuition is that there aren’t many of these holes, and that unknown physics questions are mostly underdetermined by known data, at least for quite a long way toward the infinite-compute limit of Solomonoff induction, and possibly all the way.)
Engineering understanding would happen in step 2, and I think engineering is more “the generator of large effects on the world,” the place where much-faster-than-human ingenuity is needed, rather than hoping to find new science.
(Although the formalization of the model of scientific reality is important for the overall proposal—to facilitate validating that the engineering actually does what is desired—and building such a formalization would be hard for unaided humans.)
Some direct quantitative comparison between activation-steering and task-vector-steering (at, say, reducing toxicity) is indeed a very sensible experiment for a peer reviewer to ask for and I would like to see it as well.