hmm. i think you’re missing eliezer’s point. the idea was never that AI would be unable to identify actions which humans consider good, but that the AI would not have any particular preference to take those actions.
But my point isn’t just that the AI is able to produce similar ratings to humans’ for aesthetics, etc., but that it also seems to do so through at least partially overlapping computational mechanisms to humans’, as the comparisons to fMRI data suggest.
Agree that it doesn’t imply caring for. But I think given cumulating evidence for human-like representations of multiple non-motivational components of affect, one should also update at least a bit on the likelihood of finding / incentivizing human-like representations of the motivational component(s) too (see e.g. https://en.wikipedia.org/wiki/Affect_(psychology)#Motivational_intensity_and_cognitive_scope).
Even if Eliezer’s argument in that Twitter thread is completely worthless, it remains the case that “merely hoping” that the AI turns out nice is an insufficiently good argument for continuing to create smarter and smarter AIs. I would describe as “merely hoping” the argument that since humans (in some societies) turned out nice (even though there was no designer that ensured they would), the AI might turn out nice. Also insufficiently good is any hope stemming from the observation that if we pick two humans at random out of the humans we know, the smarter of the two is more likely than not to be the nicer of the two. I certainly do not want the survival of the human race to depend on either one of those two hopes or arguments! Do you?
Eliezer finds posting on the internet enjoyable, like lots of people do. He posts a lot about, e.g., superconductors and macroeconomic policy. It is far from clear to me that he consider this Twitter thread to be relevant to the case against continuing to create smarter AIs. But more to the point: do you consider it relevant?
Contra both the ‘doomers’ and the ‘optimists’ on (not) pausing. Rephrased: RSPs (done right) seem right.
Contra ‘doomers’. Oversimplified, ‘doomers’ (e.g. PauseAI, FLI’s letter, Eliezer) ask(ed) for pausing now / even earlier - (e.g. the Pause Letter). I expect this would be / have been very much suboptimal, even purely in terms of solving technical alignment. For example, Some thoughts on automating alignment research suggests timing the pause so that we can use automated AI safety research could result in ‘[...] each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.’ We clearly don’t have such automated AI safety R&D capabilities now, suggesting that pausing later, when AIs are closer to having the required automated AI safety R&D capabilities would be better. At the same time, current models seem very unlikely to be x-risky (e.g. they’re still very bad at passing dangerous capabilities evals), which is another reason to think pausing now would be premature.
Contra ‘optimists’. I’m more unsure here, but the vibe I’m getting from e.g. AI Pause Will Likely Backfire (Guest Post) is roughly something like ‘no pause ever’; largely based on arguments of current systems seeming easy to align / control. While I agree with the point that current systems do seem easy to align / control and I could even see this holding all the way up to ~human-level automated AI safety R&D, I can easily see scenarios where around that time things get scary quickly without any pause. For example, similar arguments to those about the scalability of automated AI safety R&D suggest automated AI capabilities R&D could also be scaled up significantly. For example, figures like those in Before smart AI, there will be many mediocre or specialized AIs suggest very large populations of ~human-level automated AI capabilities researchers could be deployed (e.g. 100x larger than the current [human] population of AI researchers). Given that even with the current relatively small population, algorithmic progress seems to double LM capabilities ~every 8 months, it seems like algorithmic progress could be much faster with 100x larger populations, potentially leading to new setups (e.g. new AI paradigms, new architectures, new optimizers, synthetic data, etc.) which could quite easily break the properties that make current systems seem relatively easy / safe to align. In this scenario, pausing to get this right (especially since automated AI safety R&D would also be feasible) seems like it could be crucial.
At least Eliezer has been extremely clear that he is in favor of a stop not a pause (indeed, that was like the headline of his article “Pausing AI Developments Isn’t Enough. We Need to Shut it All Down”), so I am confused why you list him with anything related to “pause”.
My guess is me and Eliezer are both in favor of a pause, but mostly because a pause seems like it would slow down AGI progress, not because the next 6 months in-particular will be the most risky period.
At the same time, current models seem very unlikely to be x-risky (e.g. they’re still very bad at passing dangerous capabilities evals), which is another reason to think pausing now would be premature.
The relevant criterion is not whether the current models are likely to be x-risky (it’s obviously far too late if they are!), but whether the next generation of models have more than an insignificant chance of being x-risky together with all the future frameworks they’re likely to be embedded into.
Given that the next generations are planned to involve at least one order of magnitude more computing power in training (and are already in progress!) and that returns on scaling don’t seem to be slowing, I think the total chance of x-risk from those is not insignificant.
‘We view it as an interesting open question whether it is possible to develop SSM-like models with greater expressivity for state tracking that also have strong parallelizability and learning dynamics, or whether these different goals are fundamentally at odds, as Merrill & Sabharwal (2023a) suggest.’
We view it as an interesting open question whether it is possible to develop SSM-like models with greater expressivity for state tracking that also have strong parallelizability and learning dynamics
Surely fundamentally at odds? You can’t spend a while thinking without spending a while thinking.
Of course, the lunch still might be very cheap by only spending a while thinking a fraction of the time or whatever.
On an apparent missing mood—FOMO on all the vast amounts of automated AI safety R&D that could (almost already) be produced safely
Automated AI safety R&D could results in vast amounts of work produced quickly. E.g. from Some thoughts on automating alignment research (under certain assumptions detailed in the post):
each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.
Despite this promise, we seem not to have much knowledge when such automated AI safety R&D might happen, in which order the relevant capabilities might appear (including compared to various dangerous capabilities), etc. AFAICT. we don’t seem to be trying very hard either, neither at prediction, nor at elicitation of such capabilities.
One potential explanation I heard for this apparent missing mood of FOMO on AI automated AI safety R&D is the idea / worry that the relevant automated AI safety R&D capabilities would only appear when models are already / too close to being dangerous. This seems plausible to me, but would still seems like a strange way to proceed given the uncertainty.
For contrast, consider how one might go about evaluating dangerous capabilities and trying to anticipate them ahead of time. From GDM’s recent Introducing the Frontier Safety Framework:
The Framework has three key components:
Identifying capabilities a model may have with potential for severe harm. To do this, we research the paths through which a model could cause severe harm in high-risk domains, and then determine the minimal level of capabilities a model must have to play a role in causing such harm. We call these “Critical Capability Levels” (CCLs), and they guide our evaluation and mitigation approach.
Evaluating our frontier models periodically to detect when they reach these Critical Capability Levels. To do this, we will develop suites of model evaluations, called “early warning evaluations,” that will alert us when a model is approaching a CCL, and run them frequently enough that we have notice before that threshold is reached.
Applying a mitigation plan when a model passes our early warning evaluations. This should take into account the overall balance of benefits and risks, and the intended deployment contexts. These mitigations will focus primarily on security (preventing the exfiltration of models) and deployment (preventing misuse of critical capabilities).
I see no reason why, in principle, a similar high-level approach couldn’t be analogously taken for automated AI safety R&D capabilities, yet almost nobody seems to be working on this, at least publicly (in fact, I am currently working on related topics; I’m still very surprised by the overall neglectedness, made even more salient by current events).
AI R&D and AI safety R&D will almost surely come at the same time.
Putting aside using AIs as tools in some more limited ways (e.g. for interp labeling or generally for grunt work)
People at labs are often already heavily integrating AIs into their workflows (though probably somewhat less experimentation here than would be ideal as far as safety people go).
It seems good to track potential gaps between using AIs for safety and for capabilities, but by default, it seems like a bunch of this work is just ML and will come at the same time.
AI R&D and AI safety R&D will almost surely come at the same time.
Putting aside using AIs as tools in some more limited ways (e.g. for interp labeling or generally for grunt work)
Seems like probably the modal scenario to me too, but even limited exceptions like the one you mention seem to me like they could be very important to deploy at scale ASAP, especially if they could be deployed using non-x-risky systems (e.g. like current ones, very bad at DC evals).
People at labs are often already heavily integrating AIs into their workflows (though probably somewhat less experimentation here than would be ideal as far as safety people go).
This seems good w.r.t. automated AI safety potentially ‘piggybacking’, but bad for differential progress.
It seems good to track potential gaps between using AIs for safety and for capabilities, but by default, it seems like a bunch of this work is just ML and will come at the same time.
Sure, though wouldn’t this suggest at least focusing hard on (measuring / eliciting) what might not come at the same time?
mention seem to me like they could be very important to deploy at scale ASAP
Why think this is important to measure or that this already isn’t happening?
E.g., on the current model organism related project I’m working on, I automate inspecting reasoning traces in various ways. But I don’t feel like there is any particularly interesting thing going on here which is important to track (e.g. this tip isn’t more important than other tips for doing LLM research better).
Intuitively, I’m thinking of all this as something like a race between [capabilities enabling] safety and [capabilities enabling dangerous] capabilities (related: https://aligned.substack.com/i/139945470/targeting-ooms-superhuman-models); so from this perspective, maintaining as large a safety buffer as possible (especially if not x-risky) seems great. There could also be something like a natural endpoint to this ‘race’, corresponding to being able to automate all human-level AI safety R&D safely (and then using this to produce a scalable solution to aligning / controlling superintelligence).
W.r.t. measurement, I think it would be good orthogonally to whether auto AI safety R&D is already happening or not, similarly to how e.g. evals for automated ML R&D seem good even if automated ML R&D is already happening. In particular, the information of how successful auto AI safety R&D would be (and e.g. what the scaling curves look like vs. those for DCs) seems very strategically relevant to whether it might be feasible to deploy it at scale, when that might happen, with what risk tradeoffs, etc.
To get somewhat more concrete, the Frontier Safety Framework report already proposes a (somewhat vague) operationalization for ML R&D evals, which (vagueness-permitting) seems straightforward to translate into an operationalization for automated AI safety R&D evals:
Machine Learning R&D level 1: Could significantly accelerate AI research at a cutting-edge lab if deployed widely, e.g.improving the pace of algorithmic progress by 3X, or comparably accelerate other AI research groups.
Machine Learning R&D level 2: Could fully automate the AI R&D pipeline at a fraction of human labor costs, potentially enabling hyperbolic growth in AI capabilities.
Change my mind: outer alignment will likely be solved by default for LLMs. Brain-LM scaling laws (https://arxiv.org/abs/2305.11863) + LM embeddings as model of shared linguistic space for transmitting thoughts during communication (https://www.biorxiv.org/content/10.1101/2023.06.27.546708v1.abstract) suggest outer alignment will be solved by default for LMs: we’ll be able to ‘transmit our thoughts’, including alignment-relevant concepts (and they’ll also be represented in a [partially overlapping] human-like way).
I think the Corrigibility agenda, framed as “do what I mean, such that I will probably approve of the consequences, not just what I literally say such that our interaction will likely harm my goals” is more doable than some have made it out to be. I still think that there are sufficient subtle gotchas there that it makes sense to treat it as an area for careful study rather than “solved by default, no need to worry”.
quick take: Against Almost Every Theory of Impact of Interpretability should be required reading for ~anyone starting in AI safety (e.g. it should be in the AGISF curriculum), especially if they’re considering any model internals work (and of course even more so if they’re specifically considering mech interp)
Given the above, it might be high value to start testing integrating more interpretability tools into interpretability (V)LM agents like MAIA and maybe even considering randomized controlled trials to test for any productivity improvements they could already be providing.
For example, probing / activation steering workflows seem to me relatively short-horizon and at least somewhat standardized, to the point that I wouldn’t be surprised if MAIA could already automate very large chunks of that work (with proper tool integration). (Disclaimer: I haven’t done much probing / activation steering hands-on work [though I do follow such work quite closely and have supervised related projects], so my views here might be inaccurate).
While I’m not sure I can tell any ‘pivotal story’ about such automation, if I imagine e.g. 10x more research on probing and activation steering / year / researcher as a result of such automation, it still seems like it could be a huge win. Such work could also e.g. provide much more evidence (either towards or against) the linear representation hypothesis.
Decomposability seems like a fundamental assumption for interpretability and condition for it to succeed. E.g. from Toy Models of Superposition:
’Decomposability: Neural network activations which are decomposable can be decomposed into features, the meaning of which is not dependent on the value of other features. (This property is ultimately the most important – see the role of decomposition in defeating the curse of dimensionality.) [...]
The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latter (non-superposition and basis-aligned) are properties we believe only sometimes occur.′
If this assumption is true, it seems favorable to the prospects of safely automating large parts of interpretability work (e.g. using [V]LM agents like MAIA) differentially sooner than many other alignment research subareas and than the most consequential capabilities research (and likely before AIs become x-risky, e.g. before they pass many dangerous capabilities evals). For example, in a t-AGI framework, using an interpretability LM agent to search for the feature corresponding to a certain semantic direction should be much shorter horizon than e.g. coming up with a new conceptual alignment agenda or coming up with a new ML architecture (as well as having much faster feedback loops than e.g. training a SOTA LM using a new architecture).
For example, in a t-AGI framework, using an interpretability LM agent to search for the feature corresponding to a certain semantic direction should be much shorter horizon than e.g. coming up with a new conceptual alignment agenda or coming up with a new ML architecture (as well as having much faster feedback loops than e.g. training a SOTA LM using a new architecture).
Short-horizon tasks (e.g., fixing a problem on a Linux machine or making a web server) were those that would take less than 1 hour, whereas long-horizon tasks (e.g., building a web app or improving an agent framework) could take over four (up to 20) hours for a human to complete.
[...]
The Purple and Blue models completed 20-40% of short-horizon tasks but no long-horizon tasks. The Green model completed less than 10% of short-horizon tasks and was not assessed on long-horizon tasks3. We analysed failed attempts to understand the major impediments to success. On short-horizon tasks, models often made small errors (like syntax errors in code). On longer horizon tasks, models devised good initial plans but did not sufficiently test their solutions or failed to correct initial mistakes. Models also sometimes hallucinated constraints or the successful completion of subtasks.
Summary: We found that leading models could solve some short-horizon tasks, such as software engineering problems. However, no current models were able to tackle long-horizon tasks.
E.g. to the degree typical probing / activation steering work might often involve short 1-hour-horizons, it might be automatable differentially soon; e.g. from Steering GPT-2-XL by adding an activation vector:
For example, we couldn’t find a “talk in French” steering vector within an hour of manual effort.
You’ll often be able to use the model to advance lots of other forms of science and technology, including alignment-relevant science and technology, where experiment and human understanding (+ the assistance of available AI tools) is enough to check the advances in question.
Interpretability research seems like a strong and helpful candidate here, since many aspects of interpretability research seem like they involve relatively tight, experimentally-checkable feedback loops. (See e.g. Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie.)
Quote from Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie:
A quite early example of this is Collin Burn’s work, doing unsupervised identification of some aspects of a neural network that are correlated with things being true or false. I think that is important work. It’s a kind of obvious direction for the stuff to go. You can keep improving it when you have AIs that you’re training to do their best to deceive humans or other audiences in the face of the thing and you can measure whether our lie detectors break down. When we train our AIs to tell us the sky is green in the face of the lie detector and we keep using gradient descent on them, do they eventually succeed? That’s really valuable information to know because then we’ll know our existing lie detecting systems are not actually going to work on the AI takeover and that can allow government and regulatory response to hold things back. It can help redirect the scientific effort to create lie detectors that are robust and that can’t just be immediately evolved around and we can then get more assistance. Basically the incredibly juicy ability that we have working with the AIs is that we can have as an invaluable outcome that we can see and tell whether they got a fast one past us on an identifiable situation. Here’s an air gap computer, you get control of the keyboard, you can input commands, can you root the environment and make a blue banana appear on the screen? Even if we train the AI to do that and it succeeds. We see the blue banana, we know it worked. Even if we did not understand and would not have detected the particular exploit that it used to do it. This can give us a rich empirical feedback where we’re able to identify things that are even an AI using its best efforts to get past our interpretability methods, using its best efforts to get past our adversarial examples.
’Could we do concept algebra directly on the fMRI of the higher visual cortex? Yes, in principle, it should be possible to perform concept algebra directly on the fMRI signals from the higher visual cortex. The key idea would be to identify subspaces in the fMRI signal space that correspond to specific semantic concepts, and then manipulate these subspaces to control the reconstructed image. Here’s a hypothetical workflow:
Collect fMRI data while the subject views a large set of images that vary along certain semantic dimensions (e.g., gender, emotion, object category, scene type, etc.).
For each semantic dimension, identify a set of fMRI samples that differ only along that dimension. For example, for the gender dimension, you might compare the fMRI signals for “man” vs “woman”, keeping other aspects of the image content constant.
Use these samples to estimate a “semantic subspace” for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.
To reconstruct an image, first predict the full fMRI signal from the early visual cortex (similar to predicting z in Takagi et al.).
Then, to manipulate a specific semantic concept, project the predicted fMRI signal onto the orthogonal complement of the corresponding semantic subspace (to remove the original semantic information), and add the desired semantic component from a reference fMRI signal (similar to the concept algebra manipulation of z_c).
Finally, use this modified fMRI signal from the higher visual cortex to condition the diffusion model, reconstructing an image with the desired semantic manipulation.
The main challenges would be:
Collecting a sufficiently large and diverse fMRI dataset to robustly estimate the semantic subspaces.
Ensuring that the estimated subspaces are truly orthogonal and capture the desired semantic variation.
Translating the manipulated fMRI signals into a format that can effectively condition the diffusion model.
Nonetheless, this is an intriguing direction that could potentially unite the biological understanding from Takagi et al.‘s work with the interpretable semantic manipulation of concept algebra. It suggests a way to manipulate images directly from brain-space, without the need for text prompts. This could provide insights into how the brain itself represents and manipulates semantic concepts.’
’How might the above be relevant to the AI alignment problem? The hypothetical approach of performing concept algebra directly on fMRI signals from the higher visual cortex could potentially offer some interesting insights relevant to the AI alignment problem:
Understanding AI systems: By comparing the semantic subspaces identified in fMRI signals with those found in the latent spaces of AI models (like the diffusion model in concept algebra), we could gain insights into how AI systems represent and manipulate semantic concepts, and how this compares to the human brain. This could help in designing AI systems that reason in ways more analogous to humans, which could be beneficial for alignment.
Interpretability and control: The ability to identify and manipulate specific semantic concepts directly in the space of brain activity could provide a more direct and interpretable way of controlling the output of AI systems. If we can align the semantic subspaces of the AI with those of the human brain, it might become possible to control the AI’s outputs by manipulating these subspaces, leading to more interpretable and controllable AI systems.
Value alignment: By understanding how human values and preferences are represented in the brain, and how they relate to the semantic concepts identified through concept algebra, we might gain insights into how to align AI systems with human values. If we can identify the neural subspaces corresponding to human values and ethics, we could potentially use this information to constrain the AI’s outputs to align with these values.
Cognitive modeling: This approach could also contribute to the development of more accurate cognitive models of human reasoning and decision-making. By understanding how semantic concepts are represented and manipulated in the brain, we can build AI systems that more closely mimic human cognition, which could be useful for predicting and understanding human behavior in the context of AI alignment.
Neuro-symbolic AI: The idea of manipulating semantic concepts directly in the space of brain activity is reminiscent of neuro-symbolic AI approaches, which aim to combine the strengths of neural networks and symbolic reasoning. Insights from this fMRI-based concept algebra could potentially inform the development of neuro-symbolic AI systems that are more interpretable and aligned with human cognition.
However, it’s important to note that these are speculative connections, and there would be significant technical and ethical challenges in realizing this approach. Nonetheless, exploring the parallels between the structure of human thought and the structure of AI representations could be a valuable direction for informing AI alignment research.′
’How would this be relevant in the context of this proposal? [note: the proposal beingActivation vector steering with BCI] The proposed concept algebra on fMRI signals could be highly relevant to the goals outlined in this funding proposal. Specifically:
The proposal aims to find mappings between human brain states and the latent representations of language models. Applying concept algebra directly on fMRI signals, as discussed, could provide a more direct and interpretable way to establish these mappings. By identifying semantic subspaces in the fMRI signals and aligning them with subspaces in the language model’s latent space, we could potentially “steer” the language model to exhibit brain states similar to specific people.
The proposal suggests that reward models could be trained to reproduce humanlike brain states during evaluation, to improve their generalization. If concept algebra on fMRI can successfully manipulate semantic concepts in the reconstructed outputs (e.g., text or images), this could provide a way to directly optimize the reward models to produce outputs that match desired brain states.
The proposal aims to advance scientific understanding of analogies between LLM behavior and human behavior. Comparing the semantic subspaces identified in fMRI with those in LLMs, as enabled by concept algebra, could provide insights into these analogies at a more fundamental, representational level.
The proposed fMRI experiments could potentially provide the necessary data for attempting concept algebra on brain signals. The proposal includes plans for designing and implementing fMRI data-collection protocols, recruiting subjects, and administering the experiments. This data could be used to estimate the semantic subspaces needed for concept algebra.
The ML experiments proposed, including the fMRI feature extraction pipeline and architecture modifications, could potentially be adapted to implement concept algebra on the fMRI signals, in addition to the original goal of mapping brain states to LLM representations.
In summary, while the original proposal does not explicitly mention concept algebra, the ideas discussed around applying concept algebra to fMRI signals could potentially enhance and extend the proposal’s goals of connecting human brain states with AI representations, improving reward model generalization, and advancing scientific understanding of human-AI analogies. The proposed fMRI experiments could provide the necessary data, and the planned ML experiments could be extended to include concept algebra on the brain signals.′
’Above you wrote: ‘Use these samples to estimate a “semantic subspace” for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.’ What if instead I just took a large diverse dataset of fMRI of images of men and fMRI of images of women, took the means of the two, substracted them, and used that as the sex direction in concept algebra? Your proposed approach of taking the difference between the mean fMRI signals for men and women to define the “sex direction” is a great idea and could indeed work as a simple alternative to more complex methods like PCA or LDA. Here’s why this approach makes sense:
If the fMRI signals for men and women differ primarily along a single direction in the high-dimensional fMRI space, then the difference between their means will point along this direction.
By using a large and diverse dataset, you’re essentially averaging out other sources of variation (e.g., individual differences, noise), leaving only the consistent difference between men and women.
This is analogous to how concept algebra identifies concept directions in the latent space of the diffusion model: by taking the difference between the latent representations of prompts that differ only in the target concept.
To implement this:
Collect a large dataset of fMRI signals for images of men and women, ensuring diversity in other aspects (age, race, emotion, background, etc.).
Compute the mean fMRI signal for men (m_men) and women (m_women) separately.
Compute the “sex direction” as: v_sex = m_women—m_men.
To manipulate the sex of a reconstructed image, you could then project the fMRI signal onto the orthogonal complement of v_sex (to remove the original sex information), and then add or subtract a scaled version of v_sex to introduce the desired sex information.
The main advantages of this approach over PCA or LDA are simplicity and interpretability. It doesn’t require any complex eigendecompositions or model fitting, and the resulting direction has a clear interpretation as the difference between men and women. Potential limitations to keep in mind:
If there are multiple independent directions along which men and women differ, this approach will only capture their average, not the individual directions.
If there are significant interactions between sex and other factors (e.g., if the sex difference is itself different for different ages or emotions), this simple difference may not fully capture the sex effect.
The approach assumes linearity—that the sex effect is a simple additive shift in fMRI space. If there are nonlinear interactions, this linear shift may not fully capture the effect.
Despite these potential limitations, I think this is a promising and elegantly simple approach to try for implementing concept algebra directly on fMRI signals. It’s a great starting point that could be refined with more sophisticated methods if needed.′
Also positive update for me on interdisciplinary conceptual alignment being automatable differentially soon; which seemed to me for a long time plausible, since LLMs have ‘read the whole internet’ and interdisciplinary insights often seem (to me) to require relatively small numbers of inferential hops (plausibly because it’s hard for humans to have [especially deep] expertise in many different domains), making them potentially feasible for LLMs differentially early (reliably making long inferential chains still seems among the harder things for LLMs).
‘While the labs certainly perform much valuable alignment research, and definitely contribute a disproportionate amount per-capita, they cannot realistically hope to compete with the thousands of hobbyists and PhD students tinkering and trying to improve and control models. This disparity will only grow larger as more and more people enter the field while the labs are growing at a much slower rate. Stopping open-source ‘proliferation’ effectively amounts to a unilateral disarmament of alignment while ploughing ahead with capabilities at full-steam.
Thus, until the point at which open source models are directly pushing the capabilities frontier themselves then I consider it extremely unlikely that releasing and working on these models is net-negative for humanity’
‘Much capabilities work involves simply gathering datasets or testing architectures where it is easy to utilize other closed models referenced in pappers or through tacit knowledge of employees. Additionally, simple API access to models is often sufficient to build most AI-powered products rather than direct access to model internals. Conversely, such access is usually required for alignment research. All interpretability requires access to model internals almost by definition. Most of the AI control and alignment techniques we have invented require access to weights for finetuning or activations for runtime edits. Almost nothing can be done to align a model through access to the I/O API of a model at all. Thus it seems likely to me that by restricting open-source we differetially cripple alignment rather than capabilities. Alignment research is more fragile and dependent on deep access to models than capabilities research.’
Current open source models are not themselves any kind of problem. Their availability accelerates timelines, helps with alignment along the way. If there is no moratorium, this might be net positive. If there is a moratorium, it’s certainly net positive, as it’s the kind of research that the moratorium is buying time for, and it doesn’t shorten timelines because they are guarded by the moratorium.
It’s still irreversible proliferation even when the impact is positive. The main issue is open source as an ideology that unconditionally calls for publishing all the things, and refuses to acknowledge the very unusual situations where not publishing things is better than publishing things.
More reasons to believe that studying empathy in rats (which should be much easier than in humans, both for e.g. IRB reasons, but also because smaller brains, easier to get whole connectomes, etc.) could generalize to how it works in humans and help with validating/implementing it in AIs (I’d bet one can already find something like computational correlates in e.g. GPT-4 and the correlation will get larger with scale a la https://arxiv.org/abs/2305.11863) https://twitter.com/e_knapska/status/1722194325914964036
And there are theoretical results suggesting some latent factors can be identified using multimodality (all the following could be intepretable as different modalities—multiple brain recording modalities, animals, sessions, species, brains-ANNs), while being provably unindentifiable without the multiple modality—e.g. Identifiability Results for Multimodal Contrastive Learning (and results on nonlinear ICA in single-modal vs. multi-modal settings reviewed in section 2.1). This might a way to bypass single-model interpretability difficulties, by e.g. ‘comparing’ to brains or to other models.
(As reply to Zvi’s ‘If someone was founding a new AI notkilleveryoneism research organization, what is the best research agenda they should look into pursuing right now?’)
LLMs seem to represent meaning in a pretty human-like way and this seems likely to keep getting better as they get scaled up, e.g. https://arxiv.org/abs/2305.11863. This could make getting them to follow the commonsense meaning of instructions much easier. Also, similar methodologies to https://arxiv.org/abs/2305.11863 could be applied to other alignment-adjacent domains/tasks, e.g. moral reasoning, prosociality, etc.
RSPs for automated AI safety R&D require rethinking RSPs
AFAICT, all current RSPs are only framed negatively, in terms of [prerequisites to] dangerous capabilities to be detected (early) and mitigated.
In contrast, RSPs for automated AI safety R&D will likely require measuring [prerequisites to] capabilities for automating [parts of] AI safety R&D, and preferentially (safely) pushing these forward. An early such examples might be safely automating some parts of mechanistic intepretability.
I expect large parts of interpretability work could be safely automatable very soon (e.g. GPT-5 timelines) using (V)LM agents; see A Multimodal Automated Interpretability Agent for a prototype.
Given the potential scalability of automated interp, I’d be excited to see plans to use large amounts of compute on it (including e.g. explicit integrations with agendas like superalignment or control; for example, given non-dangerous-capabilities, MAIA seems framable as a ‘trusted’ model in control terminology).
Hey Bogdan, I’d be interested in doing a project on this or at least putting together a proposal we can share to get funding.
I’ve been brainstorming new directions (with @Quintin Pope) this past week, and we think it would be good to use/develop some automated interpretability techniques we can then apply to a set of model interventions to see if there are techniques we can use to improve model interpretability (e.g. L1 regularization).
I saw the MAIA paper, too; I’d like to look into it some more.
Anyway, here’s a related blurb I wrote:
Project: Regularization Techniques for Enhancing Interpretability and Editability
Explore the effectiveness of different regularization techniques (e.g. L1 regularization, weight pruning, activation sparsity) in improving the interpretability and/or editability of language models, and assess their impact on model performance and alignment. We expect we could apply automated interpretability methods (e.g. MAIA) to this project to test how well the different regularization techniques impact the model.
In some sense, this research is similar to the work Anthropic did with SoLU activation functions. Unfortunately, they needed to add layer norms to make the SoLU models competitive, which seems to have hide away the superposition in other parts of the network, making SoLU unhelpful for making the models more interpretable.
That said, we can increase our ability to interpret these models through regularization techniques. A technique like L1 regularization should help because it encourages the model to learn sparse representations by penalizing non-zero weights or activations. Sparse models tend to be more interpretable as they rely on a smaller set of important features.
Whether this works or not, I’d be interested in making more progress on automated interpretability, in the similar ways you are proposing.
@the gears to ascension I see you reacted “10%” to the phrase “while (overwhelmingly likely) being non-scheming” in the context of the GPT-4V-based MAIA.
Does that mean you think there’s a 90% chance that MAIA, as implemented, today is actually scheming? If so that seems like a very bold prediction, and I’d be very interested to know why you predict that. Or am I misunderstanding what you mean by that react?
ah, I got distracted before posting the comment I was intending to: yes, I think GPT4V is significantly scheming-on-behalf-of-openai, as a result of RLHF according to principles that more or less explicitly want a scheming AI; in other words, it’s not an alignment failure to openai, but openai is not aligned with human flourishing in the long term, and GPT4 isn’t either. I expect GPT4 to censor concepts that are relevant to detecting this somewhat. Probably not enough to totally fail to detect traces of it, but enough that it’ll look defensible, when a fair analysis would reveal it isn’t.
It seems to me like the sort of interpretability work you’re pointing at is mostly bottlenecked by not having good MVPs of anything that could plausibly be directly scaled up into a useful product as opposed to being bottlenecked on not having enough scale.
So, insofar as this automation will help people iterate faster fair enough, but otherwise, I don’t really see this as the bottleneck.
Yeah, I’m unsure if I can tell any ‘pivotal story’ very easily (e.g. I’d still be pretty skeptical of enumerative interp even with GPT-5-MAIA). But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of internals.
I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of things.
This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in ‘Catching AIs red-handed’), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).
This seems differentially good for safety (e.g. vs. models with larger forward passes but shorter context windows to achieve the same perplexity), since longer context and in-context learning are differentially transparent.
A brief list of resources with theoretical results which seem to imply RL is much more (e.g. sample efficiency-wise) difficult than IL—imitation learning (I don’t feel like I have enough theoretical RL expertise or time to scrutinize hard the arguments, but would love for others to pitch in). Probably at least somewhat relevant w.r.t. discussions of what the first AIs capable of obsoleting humanity could look like:
Paper: Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? (quote: ‘This work shows that, from the statistical viewpoint, the situation is far subtler than suggested by the more traditional approximation viewpoint, where the requirements on the representation that suffice for sample efficient RL are even more stringent. Our main results provide sharp thresholds for reinforcement learning methods, showing that there are hard limitations on what constitutes good function approximation (in terms of the dimensionality of the representation), where we focus on natural representational conditions relevant to value-based, model-based, and policy-based learning. These lower bounds highlight that having a good (value-based, model-based, or policy-based) representation in and of itself is insufficient for efficient reinforcement learning, unless the quality of this approximation passes certain hard thresholds. Furthermore, our lower bounds also imply exponential separations on the sample complexity between 1) value-based learning with perfect representation and value-based learning with a good-but-not-perfect representation, 2) value-based learning and policy-based learning, 3) policy-based learning and supervised learning and 4) reinforcement learning and imitation learning.’)
Probably at least somewhat relevant w.r.t. discussions of what the first AIs capable of obsoleting humanity could look like.
I’d bet against any of this providing interesting evidence beyond basic first principles arguments. These types of theory results never seem to add value on top of careful reasoning from my experience.
Hmm, unsure about this. E.g. the development models of many in the alignment community before GPT-3 (often heavily focused on RL or even on GOFAI) seem quite substantially worse in retrospect than those of some of the most famous deep learning people (e.g. LeCun’s cake); of course, this may be an unfair/biased comparison using hindsight. Unsure how much theory results were influencing the famous deep learners (and e.g. classic learning theory results would probably have been misleading), but doesn’t seem obvious they had 0 influence? For example, Bengio has multiple at least somewhat conceptual / theoretical (including review) papers motivating deep/representation learning; e.g. Representation Learning: A Review and New Perspectives.
I think Paul looks considerably better in retrospect than famous DL people IMO. (Partially via being somewhat more specific, though still not really making predictions.)
I’m skeptical hard theory had much influence on anyone though. (In this domain at least.)
Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity (‘We first demonstrate that, for a broad class of Markov decision processes (MDPs), the model can be represented by constant-depth circuits with polynomial size or Multi-Layer Perceptrons (MLPs) with constant layers and polynomial hidden dimension. However, the representation of the optimal policy and optimal value proves to be NP-complete and unattainable by constant-layer MLPs with polynomial size. This demonstrates a significant representation complexity gap between model-based RL and model-free RL, which includes policy-based RL and value-based RL. To further explore the representation complexity hierarchy between policy-based RL and value-based RL, we introduce another general class of MDPs where both the model and optimal policy can be represented by constant-depth circuits with polynomial size or constant-layer MLPs with polynomial size. In contrast, representing the optimal value is P-complete and intractable via a constant-layer MLP with polynomial hidden dimension. This accentuates the intricate representation complexity associated with value-based RL compared to policy-based RL. In summary, we unveil a potential representation complexity hierarchy within RL—representing the model emerges as the easiest task, followed by the optimal policy, while representing the optimal value function presents the most intricate challenge.’).
On Representation Complexity of Model-based and Model-free Reinforcement Learning (‘We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal Q-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory provides unique insights into why model-based algorithms usually enjoy better sample complexity than model-free algorithms from a novel representation complexity perspective: in some cases, the ground-truth rule (model) of the environment is simple to represent, while other quantities, such as Q-function, appear complex. We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal Q-function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal Q-function. To the best of our knowledge, this work is the first to study the circuit complexity of RL, which also provides a rigorous framework for future research.’).
Demonstration-Regularized RL (‘Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL’s sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using NE expert demonstrations enables the identification of an optimal policy at a sample complexity of order O˜(Poly(S,A,H)/(ε^2 * N^E)) in finite and O˜(Poly(d,H)/(ε^2 * N^E)) in linear Markov decision processes, where ε is the target precision, H the horizon, A the number of action, S the number of states in the finite case and d the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.’).
Limitations of Agents Simulated by Predictive Models (‘There is increasing focus on adapting predictive models into agent-like systems, most notably AI assistants based on language models. We outline two structural reasons for why these models can fail when turned into agents. First, we discuss auto-suggestive delusions. Prior work has shown theoretically that models fail to imitate agents that generated the training data if the agents relied on hidden observations: the hidden observations act as confounding variables, and the models treat actions they generate as evidence for nonexistent observations. Second, we introduce and formally study a related, novel limitation: predictor-policy incoherence. When a model generates a sequence of actions, the model’s implicit prediction of the policy that generated those actions can serve as a confounding variable. The result is that models choose actions as if they expect future actions to be suboptimal, causing them to be overly conservative. We show that both of those failures are fixed by including a feedback loop from the environment, that is, re-training the models on their own actions. We give simple demonstrations of both limitations using Decision Transformers and confirm that empirical results agree with our conceptual and formal analysis. Our treatment provides a unifying view of those failure modes, and informs the question of why fine-tuning offline learned policies with online learning makes them more effective.’).
I’m not aware of anybody currently working on coming up with concrete automated AI safety R&D evals, while there seems to be so much work going into e.g. DC evals or even (more recently) scheminess evals. This seems very suboptimal in terms of portfolio allocation.
Thanks! AFAICT though, the link you posted seems about automated AI capabilities R&D evals, rather than about automated AI safety / alignment R&D evals (I do expect transfer between the two, but they don’t seem like the same thing). I’ve also chatted to some people from both METR and UK AISI and got the impression from all of them that there’s some focus on automated AI capabilities R&D evals, but not on safety.
Can you give a concrete example of a safety property of the sort that are you envisioning automated testing for? Or am I misunderstanding what you’re hoping to see?
‘Summarize ‘Causal Explainable Guardrails for Large Language Models’. In particular, could this be useful to deal with some of the challenges mentioned in ‘Discussion: Challenges with Unsupervised LLM Knowledge Discovery’?
The paper “A Causal Explainable Guardrails for Large Language Models” proposes LLMGuardaril, a framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations in LLMs. Key points:Existing methods for steering LLMs towards desired attributes rely on unbiased representations and steering prompts, but the representations learned from pre-training can introduce semantic biases that negatively impact steering. LLMGuardaril systematically identifies and blocks the confounding effects of biases, enabling extraction of unbiased steering representations. It also includes an explainable component providing insights into alignment between generated output and desired direction. Experiments show LLMGuardaril can effectively steer LLMs towards desired attributes while mitigating biases.Regarding the challenges mentioned in “Discussion: Challenges with Unsupervised LLM Knowledge Discovery”:One key challenge is distinguishing the model’s own knowledge from knowledge of simulated entities. LLMGuardaril aims to disentangle the influence of semantic biases to obtain steering representations capturing desired attributes. This may help identify model’s own knowledge if those attributes are specific to the model itself. Another challenge is the hypothesized existence of many features satisfying consistency properties beyond just the model’s knowledge. LLMGuardaril’s causal analysis and adversarial learning aims to block confounding semantic biases, potentially reducing the set of candidate knowledge-like features. Empirically, contrast-consistent search seemed to find prominent arbitrary features rather than knowledge. LLMGuardaril’s explainable component could provide insight into whether extracted features actually correspond to model knowledge.
So in summary, while not a complete solution, LLMGuardaril’s approach of causally analyzing and debiasing semantic influences may help address some key challenges around identifying model knowledge in an unsupervised way. But significant open questions remain around definitively distinguishing model knowledge from simulated knowledge of other entities. Further research is needed to determine how much traction causal analysis provides on the hardest aspects of the ELK problem.′
‘How might MAIA be useful, potentially combined with ‘Causal Explainable Guardrails for Large Language Models’, to deal with some of the challenges mentioned in ‘Discussion: Challenges with Unsupervised LLM Knowledge Discovery’?
MAIA and the techniques from “Causal Explainable Guardrails for Large Language Models” (LLMGuardaril) could potentially be combined to help address some of the challenges mentioned in “Discussion: Challenges with Unsupervised LLM Knowledge Discovery”:
Distinguishing the model’s own knowledge from simulated knowledge: One key challenge raised is that consistency-based methods like contrast-consistent search (CCS) may not be able to distinguish an LLM’s “own” knowledge from the knowledge of simulated entities the model may be imagining. MAIA’s ability to automatically generate hypotheses and design experiments to test them could potentially be used to probe for features that are uniquely characteristic of the model’s own knowledge. For example, MAIA could generate synthetic inputs that vary the context in which a piece of knowledge is elicited (e.g. with/without an explicit simulation frame) and test whether the model’s responses are consistent.
Reducing the set of candidate knowledge-like features: Another challenge is that many features beyond the model’s own knowledge may satisfy consistency properties, making it hard to identify the “true” knowledge. LLMGuardaril’s technique of causally analyzing the influence of semantic biases on steering representations and using adversarial learning to debias them could potentially be applied by MAIA to narrow down the set of features that robustly encode model knowledge across prompts/contexts. The explainable component of LLMGuardaril could also help MAIA validate whether candidate knowledge features behave as expected.
Overcoming the limitations of one-shot probing: Existing automated interpretability approaches are noted to primarily be tools for one-shot hypothesis generation on a limited set of inputs, rather than scalable and fully rigorous causal explanation. MAIA’s ability to iteratively refine hypotheses based on experimental outcomes could lead to more robust explanations. Integrating the causal analysis techniques from LLMGuardaril into this iterative process could help rule out spurious hypotheses.
Evaluating unsupervised knowledge discovery methods: The paper emphasizes the need for principled testbeds to evaluate the effectiveness of unsupervised knowledge discovery techniques. The synthetic test systems used to quantitatively evaluate MAIA’s performance when ground-truth structure is known could potentially be exte’nded to benchmark a wider variety of unsupervised interpretability approaches.
However, significant challenges would likely need to be overcome to fruitfully combine these techniques. For example, MAIA is currently focused on interpreting components of vision models, while LLMGuardaril focuses on steering language models, so unifying them would require bridging this modality gap. More fundamentally, even with access to MAIA’s automated experimentation and LLMGuardaril’s causal analysis, rigorously identifying an LLM’s “own” knowledge in a fully unsupervised way, as opposed to the knowledge of a specific simulated entity, remains an open challenge and likely requires further conceptual breakthroughs. Nonetheless, exploring the intersection of scalable interpretability agents like MAIA and causally grounded analysis techniques like LLMGuardaril seems like a promising direction for future work on the ELK problem.′
‘How could the above be combined with ‘Catching AIs red-handed’ and, in particular, the ideas mentioned in paulfchristiano’s comment?
The approaches discussed in “A Multimodal Automated Interpretability Agent” (MAIA) and “Causal Explainable Guardrails for Large Language Models” (LLMGuardaril) could potentially be combined with the ideas in “Catching AIs red-handed” and paulfchristiano’s comment to create a more comprehensive strategy for detecting and mitigating risks from deceptively aligned AI systems:
Using MAIA to automatically generate and test hypotheses about deceptive behavior: As discussed in the “Catching AIs red-handed” post, once a suspicious action has been flagged, we can use techniques like varying the AI’s context and beliefs to test whether the behavior is truly deceptive. MAIA’s automated experimentation capabilities could potentially be used to scale up this kind of hypothesis generation and testing, systematically exploring the “landscape of deceptive behavior” as paulfchristiano puts it.
Applying causal analysis from LLMGuardaril to rule out alternative explanations: A key challenge raised in the post and comments is distinguishing true deception from an AI system just “kind of unlucky in how its context shifted it into a more deceptive role-play”. The causal analysis techniques from LLMGuardaril, which aim to disentangle the influence of semantic context on an AI’s behavior, could potentially be applied by MAIA to more confidently establish whether a pattern of suspicious behavior is caused by intentional deception vs. an unlucky context.
Using MAIA’s explainable component to interpret the AI’s reasoning: In paulfchristiano’s examples of analyzing a suspicious code review or context-dependent behavior, a critical step is understanding what information the AI is using to make its decisions. MAIA’s explainable component, which measures the alignment between an AI’s outputs and experimenter-specified directions, could potentially shed light on the reasoning behind a deceptive action.
Scaling up detection of hard-to-elicit capabilities: “Catching AIs red-handed” notes that real-world deceptive behavior could reveal capabilities that are hard to measure during normal testing. MAIA’s ability to automatically generate diverse test inputs based on a small number of real examples could help systematically probe for these hidden capabilities, reducing the AI’s “space of interventions...where they are more and more confident that it’s never going to be noticed.”
Iterative refinement of monitoring systems: Finally, as an AI system’s deceptive behavior is progressively uncovered through a combination of MAIA’s experimentation and LLMGuardaril’s causal analysis, the resulting dataset of confirmed deceptive actions could be used to fine-tune untrusted monitoring systems to more reliably catch future deception attempts, similar to the “few-shot catastrophe prevention” techniques outlined in the post.
Significant open questions remain about the feasibility of scaling these approaches to be robust against highly capable AI systems. There are also important limitations to consider: for example, MAIA’s causal analysis is unlikely to provide a fully conclusive “smoking gun” that an AI system is deceptive, as opposed to strong probabilistic evidence. And sufficiently capable systems may find ways to avoid exhibiting any detectable deceptive behavior under the kind of systematic experimentation enabled by MAIA.
Nonetheless, I believe the combination of automated interpretability agents, causally grounded analysis of AI reasoning, and the overall adversarial “red team” mindset advocated in “Catching AIs red-handed” represents a promising direction for building more robust monitoring and containment of potentially deceptive AI systems. A concerted effort to develop and integrate these approaches could significantly improve our odds of detecting and mitigating catastrophic deceptive alignment failures.′
’Any ways to combine all the previously mentioned ideas (A Causal Explainable Guardrails for Large Language Models, Discussion: Challenges with Unsupervised LLM Knowledge Discovery and A Multimodal Automated Interpretability Agent (MAIA), ‘Catching AIs red-handed’, including paulfchristiano’s comment) with ideas from ‘Searching for a model’s concepts by their shape – a theoretical framework’ to ‘catch AIs red-handed’?
Here are a few additional ideas for combining all the mentioned papers and comments to try catching deceptive models red-handed:
Using the “deliberate adversarial experiments” mentioned in paulfchristiano’s comment, like varying the model’s beliefs about how closely it is being monitored, to surface cases where the model’s behavior is suspiciously sensitive to being watched. The theoretical framework from “Searching for concepts by their shape” could be used to formalize the hypothesis that the model has a hidden “deception mode” that gets triggered when it believes oversight is lax. MAIA’s automated experimentation tools could be used to efficiently search for inputs exhibiting this sensitivity.
Comparing the features extracted by the interpretability tools from “Searching for concepts by their shape” across different copies of the model that have been adversarially perturbed to have different beliefs (per paulfchristiano’s comment). If there are decision-relevant features (e.g. representations of the model’s objectives) that are suspiciously different in the copy that believes it is unobserved, that would be strong evidence of deceptive cognition. The causal analysis from “Causal Explainable Guardrails” could help determine whether those differing features are causally linked to the differing outputs.
Training MAIA’s detectors for suspicious model outputs not just on straightforwardly deceptive behaviors, but on subtler signs of agency-like cognition hypothesized in paulfchristiano’s comment, e.g. the model giving worse answers when it believes it is unobserved. This could yield more robust detection that is harder for the model to circumvent by avoiding overt deception.
If interpretability tools manage to extract a model’s representations of its beliefs, objectives, plans, etc. (a big if!), use the techniques from “Catching AIs red-handed” to examine how those representations differ on the inputs that were “successfully” deceptive (i.e. where the model behaved deceptively but was not caught). There may be learnable patterns in how the model’s cognition differs in those cases which could inspire hypotheses about what relations characterize “successful” deception. Those hypotheses could then be formalized and tested using the framework from “Searching for concepts by their shape”.
A point emphasized in paulfchristiano’s comment is the importance of extreme thoroughness and breadth in searching for deceptive cognition, e.g. (to paraphrase) “even if a model exhibits some deceptive behavior, don’t assume you’ve found all of its tricks; keep probing its behavior in a huge variety of contexts”. MAIA and other automated testing tools should be put in service of this kind of breadth. Rather than just focus on the most egregious cases of deception, we should test the model’s behavior on a vast range of subtle perturbations hunting for any whiff of situational inconsistency. The theoretical frameworks for agency from “Searching for concepts by their shape” can guide us in knowing what to look for.
Extending my previous thoughts, I think one of the most useful things the “Searching for concepts by their shape” framing can contribute is mathematical precision in defining hypotheses about what characterizes deceptive cognition. If we can frame agency/deception in terms of precise relations between features (even if those features are hard to interpretably extract in practice), it gives us a crisper target to aim at with our testing, rather than just relying on intuitive notions of what “seems deceptive”.
However, I continue to be quite pessimistic about the prospects for catching competent deceptive models before it’s too late, for the reasons outlined previously. I worry that by the time a model is smart enough that we could meaningfully apply these combined interpretability strategies to it, it would also be smart enough to notice what we’re doing and come up with countermeasures.
Unless interpretability tools far exceed my expectations in their ability to crack open the black box of neural nets (which would be great!), I suspect the most practical path is to use simpler, more behavioral detection strategies (like the model consistency testing from your Control paper) to avoid building models that are competent enough to be robustly deceptive in the first place. Catching a superintelligent deceptive model after the fact seems like a much harder challenge than noticing warning signs of misalignment in weaker models and heading them off early.
But I would still encourage pursuing all these ideas, even if they end up not being enough on their own to solve the problem. The interpretability insights gained would be valuable for alignment more broadly. And we may get lucky in extracting some key structural signature of deceptive cognition that even very sophisticated models can’t fully mask. It’s an arms race we can’t afford to unilaterally forfeit.′
I wonder how much near-term interpretability [V]LM agents (e.g. MAIA, AIA) might help with finding better probes and better steering vectors (e.g. by iteratively testing counterfactual hypotheses against potentially spurious features, a major challenge for Contrast-consistent search (CCS)).
This seems plausible since MAIA can already find spurious features, and feature interpretability [V]LM agents could have much lengthier hypotheses iteration cycles (compared to current [V]LM agents and perhaps even to human researchers).
I might have updated at least a bit against the weakness of single-forward passes, based on intuitions about the amount of compute that huge context windows (e.g. Gemini 1.5 − 1 million tokens) might provide to a single-forward-pass, even if limited serially.
Somewhat relatedly: I’m interested on how well LLMs can solve tasks in parallel. This seems very important to me.[1]
The “I’ve thought about this for 2 minutes” version is: Hand an LLM two multiple choice questions with four answer options each. Encode these four answer options into a single token, so that there are 16 possible tokens of which one corresponds to the correct answer to both questions. A correct answer means that the model has solved both tasks in one forward-pass.
(One can of course vary the number of answer options and questions. I can see some difficulties in implementing this idea properly, but would nevertheless be excited if someone took a shot at it.)
- For serial computation the number of layers gives some very rough indication of the strength of one forward-pass, but it’s harder to have intuitions for parallel computation.
- For scheming, the model could reason about “should I still stay undercover”, “what should I do in case I should stay undercover” and “what should I do in case it’s time to attack” in parallel, finally using only one serial step to decide on its action.
There might be more such results I’m currently forgetting about, but they should be relatively easy to find by e.g. following citation trails (to and from the above references) with Google Scholar (or by looking at my recent comments / short forms).
- For scheming, the model could reason about “should I still stay undercover”, “what should I do in case I should stay undercover” and “what should I do in case it’s time to attack” in parallel, finally using only one serial step to decide on its action.
I am also very interested in e.g. how one could operationalize the number of hops of inference of out-of-context reasoning required for various types of scheming, especially scheming in one-forward-pass; and especially in the context of automated AI safety R&D.
“We show that transformers can use meaningless filler tokens (e.g., ‘......’) in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge.”
I looked over it and I should note that “transformers are in TC0” is not very useful statement for prediction of capabilities. Transformers are Turing-complete given rational inputs (see original paper) and them being in TC0 basically means they can implement whatever computation you can implement using boolean circuit for fixed amount of available compute which amounts to “whatever computation is practical to implement”.
I think I remember William Merrill (in a video) pointing out that the rational inputs assumption seems very unrealistic (would require infinite memory); and, from what I remember, https://arxiv.org/abs/2404.15758 and related papers made a different assumption about the number of bits of memory per parameter and per input.
Everything Turing-complete requires infinite memory. When we are saying “x86 set of instructions is Turing-complete” we imply “assuming that processor operates on infinite memory”. It’s in definition of Turing machine to include infinite tape, after all.
It’s hard to pinpoint, but the trick is that it’s very nuanced difference between the sense in which transformers are limited in complexity-theoretic sense and “transformers can’t do X”. Like, there is nothing preventing transformers from playing chess perfectly—they just need to be sufficiently large for this. To answer the question “can transformers do X” you need to ask “how much computing power transformer has” and “can this computing power be shaped by SGD into solution”.
Similarly, I find that GPT-3, GPT-3.5, and Claude 2 don’t benefit from filler tokens. However, GPT-4 (which Tamera didn’t study) shows mixed results with strong improvements on some tasks and no improvement on others.
It’s interesting question whether Gemini has any improvements.
Larger LMs seem to benefit differentially more from tools: ‘Absolute performance and improvement-per-turn (e.g., slope) scale with model size.’ https://xingyaoww.github.io/mint-bench/. This seems pretty good for safety, to the degree tool usage is often more transparent than model internals.
speculatively, it might also be fruitful to go about this the other way round, e.g. try to come up with better weight-space task erasure methods by analogy between concept erasure methods (in activation space) and through the task arithmetic—activation engineering link
Eliezer (among others in the MIRI mindspace) has this whole spiel about human kindness/sympathy/empathy/prosociality being contingent on specifics of the human evolutionary/cultural trajectory, e.g. https://twitter.com/ESYudkowsky/status/1660623336567889920 and about how gradient descent is supposed to be nothing like that https://twitter.com/ESYudkowsky/status/1660623900789862401. I claim that the same argument (about evolutionary/cultural contingencies) could be made about e.g. image aesthetics/affect, and this hypothesis should lose many Bayes points when we observe concrete empirical evidence of gradient descent leading to surprisingly human-like aesthetic perceptions/affect, e.g. The Perceptual Primacy of Feeling: Affectless machine vision models robustly predict human visual arousal, valence, and aesthetics; Towards Disentangling the Roles of Vision & Language in Aesthetic Experience with Multimodal DNNs; Controlled assessment of CLIP-style language-aligned vision models in prediction of brain & behavioral data; Neural mechanisms underlying the hierarchical construction of perceived aesthetic value.
hmm. i think you’re missing eliezer’s point. the idea was never that AI would be unable to identify actions which humans consider good, but that the AI would not have any particular preference to take those actions.
But my point isn’t just that the AI is able to produce similar ratings to humans’ for aesthetics, etc., but that it also seems to do so through at least partially overlapping computational mechanisms to humans’, as the comparisons to fMRI data suggest.
I don’t think having a beauty-detector that works the same way humans’ beauty-detectors do implies that you care about beauty?
Agree that it doesn’t imply caring for. But I think given cumulating evidence for human-like representations of multiple non-motivational components of affect, one should also update at least a bit on the likelihood of finding / incentivizing human-like representations of the motivational component(s) too (see e.g. https://en.wikipedia.org/wiki/Affect_(psychology)#Motivational_intensity_and_cognitive_scope).
Even if Eliezer’s argument in that Twitter thread is completely worthless, it remains the case that “merely hoping” that the AI turns out nice is an insufficiently good argument for continuing to create smarter and smarter AIs. I would describe as “merely hoping” the argument that since humans (in some societies) turned out nice (even though there was no designer that ensured they would), the AI might turn out nice. Also insufficiently good is any hope stemming from the observation that if we pick two humans at random out of the humans we know, the smarter of the two is more likely than not to be the nicer of the two. I certainly do not want the survival of the human race to depend on either one of those two hopes or arguments! Do you?
Eliezer finds posting on the internet enjoyable, like lots of people do. He posts a lot about, e.g., superconductors and macroeconomic policy. It is far from clear to me that he consider this Twitter thread to be relevant to the case against continuing to create smarter AIs. But more to the point: do you consider it relevant?
Contra both the ‘doomers’ and the ‘optimists’ on (not) pausing. Rephrased: RSPs (done right) seem right.
Contra ‘doomers’. Oversimplified, ‘doomers’ (e.g. PauseAI, FLI’s letter, Eliezer) ask(ed) for pausing now / even earlier - (e.g. the Pause Letter). I expect this would be / have been very much suboptimal, even purely in terms of solving technical alignment. For example, Some thoughts on automating alignment research suggests timing the pause so that we can use automated AI safety research could result in ‘[...] each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.’ We clearly don’t have such automated AI safety R&D capabilities now, suggesting that pausing later, when AIs are closer to having the required automated AI safety R&D capabilities would be better. At the same time, current models seem very unlikely to be x-risky (e.g. they’re still very bad at passing dangerous capabilities evals), which is another reason to think pausing now would be premature.
Contra ‘optimists’. I’m more unsure here, but the vibe I’m getting from e.g. AI Pause Will Likely Backfire (Guest Post) is roughly something like ‘no pause ever’; largely based on arguments of current systems seeming easy to align / control. While I agree with the point that current systems do seem easy to align / control and I could even see this holding all the way up to ~human-level automated AI safety R&D, I can easily see scenarios where around that time things get scary quickly without any pause. For example, similar arguments to those about the scalability of automated AI safety R&D suggest automated AI capabilities R&D could also be scaled up significantly. For example, figures like those in Before smart AI, there will be many mediocre or specialized AIs suggest very large populations of ~human-level automated AI capabilities researchers could be deployed (e.g. 100x larger than the current [human] population of AI researchers). Given that even with the current relatively small population, algorithmic progress seems to double LM capabilities ~every 8 months, it seems like algorithmic progress could be much faster with 100x larger populations, potentially leading to new setups (e.g. new AI paradigms, new architectures, new optimizers, synthetic data, etc.) which could quite easily break the properties that make current systems seem relatively easy / safe to align. In this scenario, pausing to get this right (especially since automated AI safety R&D would also be feasible) seems like it could be crucial.
At least Eliezer has been extremely clear that he is in favor of a stop not a pause (indeed, that was like the headline of his article “Pausing AI Developments Isn’t Enough. We Need to Shut it All Down”), so I am confused why you list him with anything related to “pause”.
My guess is me and Eliezer are both in favor of a pause, but mostly because a pause seems like it would slow down AGI progress, not because the next 6 months in-particular will be the most risky period.
The relevant criterion is not whether the current models are likely to be x-risky (it’s obviously far too late if they are!), but whether the next generation of models have more than an insignificant chance of being x-risky together with all the future frameworks they’re likely to be embedded into.
Given that the next generations are planned to involve at least one order of magnitude more computing power in training (and are already in progress!) and that returns on scaling don’t seem to be slowing, I think the total chance of x-risk from those is not insignificant.
Like transformers, SSMs like Mamba also have weak single forward passes: The Illusion of State in State-Space Models (summary thread). As suggested previously in The Parallelism Tradeoff: Limitations of Log-Precision Transformers, this may be due to a fundamental tradeoff between parallelizability and expressivity:
‘We view it as an interesting open question whether it is possible to develop SSM-like models with greater expressivity for state tracking that also have strong parallelizability and learning dynamics, or whether these different goals are fundamentally at odds, as Merrill & Sabharwal (2023a) suggest.’
Surely fundamentally at odds? You can’t spend a while thinking without spending a while thinking.
Of course, the lunch still might be very cheap by only spending a while thinking a fraction of the time or whatever.
On an apparent missing mood—FOMO on all the vast amounts of automated AI safety R&D that could (almost already) be produced safely
Automated AI safety R&D could results in vast amounts of work produced quickly. E.g. from Some thoughts on automating alignment research (under certain assumptions detailed in the post):
Despite this promise, we seem not to have much knowledge when such automated AI safety R&D might happen, in which order the relevant capabilities might appear (including compared to various dangerous capabilities), etc. AFAICT. we don’t seem to be trying very hard either, neither at prediction, nor at elicitation of such capabilities.
One potential explanation I heard for this apparent missing mood of FOMO on AI automated AI safety R&D is the idea / worry that the relevant automated AI safety R&D capabilities would only appear when models are already / too close to being dangerous. This seems plausible to me, but would still seems like a strange way to proceed given the uncertainty.
For contrast, consider how one might go about evaluating dangerous capabilities and trying to anticipate them ahead of time. From GDM’s recent Introducing the Frontier Safety Framework:
I see no reason why, in principle, a similar high-level approach couldn’t be analogously taken for automated AI safety R&D capabilities, yet almost nobody seems to be working on this, at least publicly (in fact, I am currently working on related topics; I’m still very surprised by the overall neglectedness, made even more salient by current events).
My main vibe is:
AI R&D and AI safety R&D will almost surely come at the same time.
Putting aside using AIs as tools in some more limited ways (e.g. for interp labeling or generally for grunt work)
People at labs are often already heavily integrating AIs into their workflows (though probably somewhat less experimentation here than would be ideal as far as safety people go).
It seems good to track potential gaps between using AIs for safety and for capabilities, but by default, it seems like a bunch of this work is just ML and will come at the same time.
Seems like probably the modal scenario to me too, but even limited exceptions like the one you mention seem to me like they could be very important to deploy at scale ASAP, especially if they could be deployed using non-x-risky systems (e.g. like current ones, very bad at DC evals).
This seems good w.r.t. automated AI safety potentially ‘piggybacking’, but bad for differential progress.
Sure, though wouldn’t this suggest at least focusing hard on (measuring / eliciting) what might not come at the same time?
Why think this is important to measure or that this already isn’t happening?
E.g., on the current model organism related project I’m working on, I automate inspecting reasoning traces in various ways. But I don’t feel like there is any particularly interesting thing going on here which is important to track (e.g. this tip isn’t more important than other tips for doing LLM research better).
Intuitively, I’m thinking of all this as something like a race between [capabilities enabling] safety and [capabilities enabling dangerous] capabilities (related: https://aligned.substack.com/i/139945470/targeting-ooms-superhuman-models); so from this perspective, maintaining as large a safety buffer as possible (especially if not x-risky) seems great. There could also be something like a natural endpoint to this ‘race’, corresponding to being able to automate all human-level AI safety R&D safely (and then using this to produce a scalable solution to aligning / controlling superintelligence).
W.r.t. measurement, I think it would be good orthogonally to whether auto AI safety R&D is already happening or not, similarly to how e.g. evals for automated ML R&D seem good even if automated ML R&D is already happening. In particular, the information of how successful auto AI safety R&D would be (and e.g. what the scaling curves look like vs. those for DCs) seems very strategically relevant to whether it might be feasible to deploy it at scale, when that might happen, with what risk tradeoffs, etc.
To get somewhat more concrete, the Frontier Safety Framework report already proposes a (somewhat vague) operationalization for ML R&D evals, which (vagueness-permitting) seems straightforward to translate into an operationalization for automated AI safety R&D evals:
Change my mind: outer alignment will likely be solved by default for LLMs. Brain-LM scaling laws (https://arxiv.org/abs/2305.11863) + LM embeddings as model of shared linguistic space for transmitting thoughts during communication (https://www.biorxiv.org/content/10.1101/2023.06.27.546708v1.abstract) suggest outer alignment will be solved by default for LMs: we’ll be able to ‘transmit our thoughts’, including alignment-relevant concepts (and they’ll also be represented in a [partially overlapping] human-like way).
I think the Corrigibility agenda, framed as “do what I mean, such that I will probably approve of the consequences, not just what I literally say such that our interaction will likely harm my goals” is more doable than some have made it out to be. I still think that there are sufficient subtle gotchas there that it makes sense to treat it as an area for careful study rather than “solved by default, no need to worry”.
quick take: Against Almost Every Theory of Impact of Interpretability should be required reading for ~anyone starting in AI safety (e.g. it should be in the AGISF curriculum), especially if they’re considering any model internals work (and of course even more so if they’re specifically considering mech interp)
Language model agents for interpretability (e.g. MAIA, FIND) seem to be making fast progress, to the point where I expect it might be feasible to safely automate large parts of interpretability workflows soon.
Given the above, it might be high value to start testing integrating more interpretability tools into interpretability (V)LM agents like MAIA and maybe even considering randomized controlled trials to test for any productivity improvements they could already be providing.
For example, probing / activation steering workflows seem to me relatively short-horizon and at least somewhat standardized, to the point that I wouldn’t be surprised if MAIA could already automate very large chunks of that work (with proper tool integration). (Disclaimer: I haven’t done much probing / activation steering hands-on work [though I do follow such work quite closely and have supervised related projects], so my views here might be inaccurate).
While I’m not sure I can tell any ‘pivotal story’ about such automation, if I imagine e.g. 10x more research on probing and activation steering / year / researcher as a result of such automation, it still seems like it could be a huge win. Such work could also e.g. provide much more evidence (either towards or against) the linear representation hypothesis.
Decomposability seems like a fundamental assumption for interpretability and condition for it to succeed. E.g. from Toy Models of Superposition:
’Decomposability: Neural network activations which are decomposable can be decomposed into features, the meaning of which is not dependent on the value of other features. (This property is ultimately the most important – see the role of decomposition in defeating the curse of dimensionality.) [...]
The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latter (non-superposition and basis-aligned) are properties we believe only sometimes occur.′
If this assumption is true, it seems favorable to the prospects of safely automating large parts of interpretability work (e.g. using [V]LM agents like MAIA) differentially sooner than many other alignment research subareas and than the most consequential capabilities research (and likely before AIs become x-risky, e.g. before they pass many dangerous capabilities evals). For example, in a t-AGI framework, using an interpretability LM agent to search for the feature corresponding to a certain semantic direction should be much shorter horizon than e.g. coming up with a new conceptual alignment agenda or coming up with a new ML architecture (as well as having much faster feedback loops than e.g. training a SOTA LM using a new architecture).
Some theoretical results might also be relevant here, e.g. Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks.
Related, from Advanced AI evaluations at AISI: May update:
E.g. to the degree typical probing / activation steering work might often involve short 1-hour-horizons, it might be automatable differentially soon; e.g. from Steering GPT-2-XL by adding an activation vector:
Related, from The “no sandbagging on checkable tasks” hypothesis:
Quote from Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie:
Selected fragments (though not really cherry-picked, no reruns) of a conversation with Claude Opus on operationalizing something like Activation vector steering with BCI by applying the methodology of Concept Algebra for (Score-Based) Text-Controlled Generative Models to the model from High-resolution image reconstruction with latent diffusion models from human brain activity (website with nice illustrations of the model).
My prompts bolded:
’Could we do concept algebra directly on the fMRI of the higher visual cortex?
Yes, in principle, it should be possible to perform concept algebra directly on the fMRI signals from the higher visual cortex. The key idea would be to identify subspaces in the fMRI signal space that correspond to specific semantic concepts, and then manipulate these subspaces to control the reconstructed image.
Here’s a hypothetical workflow:
Collect fMRI data while the subject views a large set of images that vary along certain semantic dimensions (e.g., gender, emotion, object category, scene type, etc.).
For each semantic dimension, identify a set of fMRI samples that differ only along that dimension. For example, for the gender dimension, you might compare the fMRI signals for “man” vs “woman”, keeping other aspects of the image content constant.
Use these samples to estimate a “semantic subspace” for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.
To reconstruct an image, first predict the full fMRI signal from the early visual cortex (similar to predicting z in Takagi et al.).
Then, to manipulate a specific semantic concept, project the predicted fMRI signal onto the orthogonal complement of the corresponding semantic subspace (to remove the original semantic information), and add the desired semantic component from a reference fMRI signal (similar to the concept algebra manipulation of z_c).
Finally, use this modified fMRI signal from the higher visual cortex to condition the diffusion model, reconstructing an image with the desired semantic manipulation.
The main challenges would be:
Collecting a sufficiently large and diverse fMRI dataset to robustly estimate the semantic subspaces.
Ensuring that the estimated subspaces are truly orthogonal and capture the desired semantic variation.
Translating the manipulated fMRI signals into a format that can effectively condition the diffusion model.
Nonetheless, this is an intriguing direction that could potentially unite the biological understanding from Takagi et al.‘s work with the interpretable semantic manipulation of concept algebra. It suggests a way to manipulate images directly from brain-space, without the need for text prompts. This could provide insights into how the brain itself represents and manipulates semantic concepts.’
’How might the above be relevant to the AI alignment problem?
The hypothetical approach of performing concept algebra directly on fMRI signals from the higher visual cortex could potentially offer some interesting insights relevant to the AI alignment problem:
Understanding AI systems: By comparing the semantic subspaces identified in fMRI signals with those found in the latent spaces of AI models (like the diffusion model in concept algebra), we could gain insights into how AI systems represent and manipulate semantic concepts, and how this compares to the human brain. This could help in designing AI systems that reason in ways more analogous to humans, which could be beneficial for alignment.
Interpretability and control: The ability to identify and manipulate specific semantic concepts directly in the space of brain activity could provide a more direct and interpretable way of controlling the output of AI systems. If we can align the semantic subspaces of the AI with those of the human brain, it might become possible to control the AI’s outputs by manipulating these subspaces, leading to more interpretable and controllable AI systems.
Value alignment: By understanding how human values and preferences are represented in the brain, and how they relate to the semantic concepts identified through concept algebra, we might gain insights into how to align AI systems with human values. If we can identify the neural subspaces corresponding to human values and ethics, we could potentially use this information to constrain the AI’s outputs to align with these values.
Cognitive modeling: This approach could also contribute to the development of more accurate cognitive models of human reasoning and decision-making. By understanding how semantic concepts are represented and manipulated in the brain, we can build AI systems that more closely mimic human cognition, which could be useful for predicting and understanding human behavior in the context of AI alignment.
Neuro-symbolic AI: The idea of manipulating semantic concepts directly in the space of brain activity is reminiscent of neuro-symbolic AI approaches, which aim to combine the strengths of neural networks and symbolic reasoning. Insights from this fMRI-based concept algebra could potentially inform the development of neuro-symbolic AI systems that are more interpretable and aligned with human cognition.
However, it’s important to note that these are speculative connections, and there would be significant technical and ethical challenges in realizing this approach. Nonetheless, exploring the parallels between the structure of human thought and the structure of AI representations could be a valuable direction for informing AI alignment research.′
’How would this be relevant in the context of this proposal? [note: the proposal being Activation vector steering with BCI]
The proposed concept algebra on fMRI signals could be highly relevant to the goals outlined in this funding proposal. Specifically:
The proposal aims to find mappings between human brain states and the latent representations of language models. Applying concept algebra directly on fMRI signals, as discussed, could provide a more direct and interpretable way to establish these mappings. By identifying semantic subspaces in the fMRI signals and aligning them with subspaces in the language model’s latent space, we could potentially “steer” the language model to exhibit brain states similar to specific people.
The proposal suggests that reward models could be trained to reproduce humanlike brain states during evaluation, to improve their generalization. If concept algebra on fMRI can successfully manipulate semantic concepts in the reconstructed outputs (e.g., text or images), this could provide a way to directly optimize the reward models to produce outputs that match desired brain states.
The proposal aims to advance scientific understanding of analogies between LLM behavior and human behavior. Comparing the semantic subspaces identified in fMRI with those in LLMs, as enabled by concept algebra, could provide insights into these analogies at a more fundamental, representational level.
The proposed fMRI experiments could potentially provide the necessary data for attempting concept algebra on brain signals. The proposal includes plans for designing and implementing fMRI data-collection protocols, recruiting subjects, and administering the experiments. This data could be used to estimate the semantic subspaces needed for concept algebra.
The ML experiments proposed, including the fMRI feature extraction pipeline and architecture modifications, could potentially be adapted to implement concept algebra on the fMRI signals, in addition to the original goal of mapping brain states to LLM representations.
In summary, while the original proposal does not explicitly mention concept algebra, the ideas discussed around applying concept algebra to fMRI signals could potentially enhance and extend the proposal’s goals of connecting human brain states with AI representations, improving reward model generalization, and advancing scientific understanding of human-AI analogies. The proposed fMRI experiments could provide the necessary data, and the planned ML experiments could be extended to include concept algebra on the brain signals.′
’Above you wrote: ‘Use these samples to estimate a “semantic subspace” for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.’ What if instead I just took a large diverse dataset of fMRI of images of men and fMRI of images of women, took the means of the two, substracted them, and used that as the sex direction in concept algebra?
Your proposed approach of taking the difference between the mean fMRI signals for men and women to define the “sex direction” is a great idea and could indeed work as a simple alternative to more complex methods like PCA or LDA.
Here’s why this approach makes sense:
If the fMRI signals for men and women differ primarily along a single direction in the high-dimensional fMRI space, then the difference between their means will point along this direction.
By using a large and diverse dataset, you’re essentially averaging out other sources of variation (e.g., individual differences, noise), leaving only the consistent difference between men and women.
This is analogous to how concept algebra identifies concept directions in the latent space of the diffusion model: by taking the difference between the latent representations of prompts that differ only in the target concept.
To implement this:
Collect a large dataset of fMRI signals for images of men and women, ensuring diversity in other aspects (age, race, emotion, background, etc.).
Compute the mean fMRI signal for men (m_men) and women (m_women) separately.
Compute the “sex direction” as: v_sex = m_women—m_men.
To manipulate the sex of a reconstructed image, you could then project the fMRI signal onto the orthogonal complement of v_sex (to remove the original sex information), and then add or subtract a scaled version of v_sex to introduce the desired sex information.
The main advantages of this approach over PCA or LDA are simplicity and interpretability. It doesn’t require any complex eigendecompositions or model fitting, and the resulting direction has a clear interpretation as the difference between men and women.
Potential limitations to keep in mind:
If there are multiple independent directions along which men and women differ, this approach will only capture their average, not the individual directions.
If there are significant interactions between sex and other factors (e.g., if the sex difference is itself different for different ages or emotions), this simple difference may not fully capture the sex effect.
The approach assumes linearity—that the sex effect is a simple additive shift in fMRI space. If there are nonlinear interactions, this linear shift may not fully capture the effect.
Despite these potential limitations, I think this is a promising and elegantly simple approach to try for implementing concept algebra directly on fMRI signals. It’s a great starting point that could be refined with more sophisticated methods if needed.′
More reasons to think something like the above should work: High-resolution image reconstruction with latent diffusion models from human brain activity literally steers diffusion models using linearly-decoded fMRI signals (see fig. 2); and linear encoding (the inverse of decoding) from the text latents to fMRI also works well (see fig. 6; and similar results in Natural language supervision with a large and diverse dataset builds better models of human high-level visual cortex, e.g. fig. 2). Furthermore, they use the same (Stable Diffusion with CLIP) model used in Concept Algebra for (Score-Based) Text-Controlled Generative Models, which both provides theory and demo empirically activation engineering-style linear manipulations. All this suggests similar Concept Algebra for (Score-Based) Text-Controlled Generative Models—like manipulations would also work when applied directly to the fMRI representations used to decode the text latents c in High-resolution image reconstruction with latent diffusion models from human brain activity.
Turns out, someone’s already done a similar (vector arithmetic in neural space; latent traversals too) experiment in a restricted domain (face processing) with another model (GAN) and it seemed to work: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012058 https://github.com/neuralcodinglab/brain2gan/blob/main/figs_manuscript/Fig12.png https://openreview.net/pdf?id=hT1S68yza7
Also positive update for me on interdisciplinary conceptual alignment being automatable differentially soon; which seemed to me for a long time plausible, since LLMs have ‘read the whole internet’ and interdisciplinary insights often seem (to me) to require relatively small numbers of inferential hops (plausibly because it’s hard for humans to have [especially deep] expertise in many different domains), making them potentially feasible for LLMs differentially early (reliably making long inferential chains still seems among the harder things for LLMs).
Very plausible view (though doesn’t seem to address misuse risks enough, I’d say) in favor of open-sourced models being net positive (including for alignment) from https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/:
‘While the labs certainly perform much valuable alignment research, and definitely contribute a disproportionate amount per-capita, they cannot realistically hope to compete with the thousands of hobbyists and PhD students tinkering and trying to improve and control models. This disparity will only grow larger as more and more people enter the field while the labs are growing at a much slower rate. Stopping open-source ‘proliferation’ effectively amounts to a unilateral disarmament of alignment while ploughing ahead with capabilities at full-steam.
Thus, until the point at which open source models are directly pushing the capabilities frontier themselves then I consider it extremely unlikely that releasing and working on these models is net-negative for humanity’
‘Much capabilities work involves simply gathering datasets or testing architectures where it is easy to utilize other closed models referenced in pappers or through tacit knowledge of employees. Additionally, simple API access to models is often sufficient to build most AI-powered products rather than direct access to model internals. Conversely, such access is usually required for alignment research. All interpretability requires access to model internals almost by definition. Most of the AI control and alignment techniques we have invented require access to weights for finetuning or activations for runtime edits. Almost nothing can be done to align a model through access to the I/O API of a model at all. Thus it seems likely to me that by restricting open-source we differetially cripple alignment rather than capabilities. Alignment research is more fragile and dependent on deep access to models than capabilities research.’
Current open source models are not themselves any kind of problem. Their availability accelerates timelines, helps with alignment along the way. If there is no moratorium, this might be net positive. If there is a moratorium, it’s certainly net positive, as it’s the kind of research that the moratorium is buying time for, and it doesn’t shorten timelines because they are guarded by the moratorium.
It’s still irreversible proliferation even when the impact is positive. The main issue is open source as an ideology that unconditionally calls for publishing all the things, and refuses to acknowledge the very unusual situations where not publishing things is better than publishing things.
More reasons to believe that studying empathy in rats (which should be much easier than in humans, both for e.g. IRB reasons, but also because smaller brains, easier to get whole connectomes, etc.) could generalize to how it works in humans and help with validating/implementing it in AIs (I’d bet one can already find something like computational correlates in e.g. GPT-4 and the correlation will get larger with scale a la https://arxiv.org/abs/2305.11863) https://twitter.com/e_knapska/status/1722194325914964036
Contrastive methods could be used both to detect common latent structure across animals, measuring sessions, multiple species (https://twitter.com/LecoqJerome/status/1673870441591750656) and to e.g. look for which parts of an artificial neural network do what a specific brain area does during a task assuming shared inputs (https://twitter.com/BogdanIonutCir2/status/1679563056454549504).
And there are theoretical results suggesting some latent factors can be identified using multimodality (all the following could be intepretable as different modalities—multiple brain recording modalities, animals, sessions, species, brains-ANNs), while being provably unindentifiable without the multiple modality—e.g. Identifiability Results for Multimodal Contrastive Learning (and results on nonlinear ICA in single-modal vs. multi-modal settings reviewed in section 2.1). This might a way to bypass single-model interpretability difficulties, by e.g. ‘comparing’ to brains or to other models.
Example of potential cross-species application: empathy mechanisms seem conserved across species Empathy as a driver of prosocial behaviour: highly conserved neurobehavioural mechanisms across species. Example of brain-ANN applications: ‘matching’ to modular brain networks, e.g. language network—ontology-relevant, non-agentic (e.g. The universal language network: A cross-linguistic investigation spanning 45 languages and 12 language families) or Theory of Mind network—could be very useful for detecting lying-relevant circuits (e.g. Single-neuronal predictions of others’ beliefs in humans).
Examples of related interpretability across models—Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences between Pretrained Generative Models, across brain measurement modalities—Learnable latent embeddings for joint behavioural and neural analysis
, across animals and brain-ANN—Quantifying stimulus-relevant representational drift using cross-modality contrastive learning.
Examples of reasons to expect (approximate) convergence to the same causal world models in various setups: theorem 2 in Robust agents learn causal world models; from Deep de Finetti: Recovering Topic Distributions from Large Language Models: ‘In particular, given the central role of exchangeability in our analysis, this analysis would most naturally be extended to other latent variables that do not depend heavily on word order, such as the author of the document [Andreas, 2022] or the author’s sentiment’ (this assumption might be expected to be approximately true for quite a few alignment-relevant-concepts); results from Victor Veitch: Linear Structure of (Causal) Concepts in Generative AI.
(As reply to Zvi’s ‘If someone was founding a new AI notkilleveryoneism research organization, what is the best research agenda they should look into pursuing right now?’)
LLMs seem to represent meaning in a pretty human-like way and this seems likely to keep getting better as they get scaled up, e.g. https://arxiv.org/abs/2305.11863. This could make getting them to follow the commonsense meaning of instructions much easier. Also, similar methodologies to https://arxiv.org/abs/2305.11863 could be applied to other alignment-adjacent domains/tasks, e.g. moral reasoning, prosociality, etc.
Step 2: e.g. plug the commonsense-meaning-of-instructions following models into OpenAI’s https://openai.com/blog/introducing-superalignment.
Related intuition: turning LLM processes/simulacra into [coarse] emulations of brain processes.
(https://twitter.com/BogdanIonutCir2/status/1677060966540795905)
RSPs for automated AI safety R&D require rethinking RSPs
AFAICT, all current RSPs are only framed negatively, in terms of [prerequisites to] dangerous capabilities to be detected (early) and mitigated.
In contrast, RSPs for automated AI safety R&D will likely require measuring [prerequisites to] capabilities for automating [parts of] AI safety R&D, and preferentially (safely) pushing these forward. An early such examples might be safely automating some parts of mechanistic intepretability.
(Related: On an apparent missing mood—FOMO on all the vast amounts of automated AI safety R&D that could (almost already) be produced safely)
I expect large parts of interpretability work could be safely automatable very soon (e.g. GPT-5 timelines) using (V)LM agents; see A Multimodal Automated Interpretability Agent for a prototype.
Notably, MAIA (GPT-4V-based) seems approximately human-level on a bunch of interp tasks, while (overwhelmingly likely) being non-scheming (e.g. current models are bad at situational awareness and out-of-context reasoning) and basically-not-x-risky (e.g. bad at ARA).
Given the potential scalability of automated interp, I’d be excited to see plans to use large amounts of compute on it (including e.g. explicit integrations with agendas like superalignment or control; for example, given non-dangerous-capabilities, MAIA seems framable as a ‘trusted’ model in control terminology).
Hey Bogdan, I’d be interested in doing a project on this or at least putting together a proposal we can share to get funding.
I’ve been brainstorming new directions (with @Quintin Pope) this past week, and we think it would be good to use/develop some automated interpretability techniques we can then apply to a set of model interventions to see if there are techniques we can use to improve model interpretability (e.g. L1 regularization).
I saw the MAIA paper, too; I’d like to look into it some more.
Anyway, here’s a related blurb I wrote:
Whether this works or not, I’d be interested in making more progress on automated interpretability, in the similar ways you are proposing.
Hey Jacques, sure, I’d be happy to chat!
@the gears to ascension I see you reacted “10%” to the phrase “while (overwhelmingly likely) being non-scheming” in the context of the GPT-4V-based MAIA.
Does that mean you think there’s a 90% chance that MAIA, as implemented, today is actually scheming? If so that seems like a very bold prediction, and I’d be very interested to know why you predict that. Or am I misunderstanding what you mean by that react?
ah, I got distracted before posting the comment I was intending to: yes, I think GPT4V is significantly scheming-on-behalf-of-openai, as a result of RLHF according to principles that more or less explicitly want a scheming AI; in other words, it’s not an alignment failure to openai, but openai is not aligned with human flourishing in the long term, and GPT4 isn’t either. I expect GPT4 to censor concepts that are relevant to detecting this somewhat. Probably not enough to totally fail to detect traces of it, but enough that it’ll look defensible, when a fair analysis would reveal it isn’t.
It seems to me like the sort of interpretability work you’re pointing at is mostly bottlenecked by not having good MVPs of anything that could plausibly be directly scaled up into a useful product as opposed to being bottlenecked on not having enough scale.
So, insofar as this automation will help people iterate faster fair enough, but otherwise, I don’t really see this as the bottleneck.
Yeah, I’m unsure if I can tell any ‘pivotal story’ very easily (e.g. I’d still be pretty skeptical of enumerative interp even with GPT-5-MAIA). But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of internals.
I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.
This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in ‘Catching AIs red-handed’), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).
Recent long-context LLMs seem to exhibit scaling laws from longer contexts—e.g. fig. 6 at page 8 in Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, fig. 1 at page 1 in Effective Long-Context Scaling of Foundation Models.
The long contexts also seem very helpful for in-context learning, e.g. Many-Shot In-Context Learning.
This seems differentially good for safety (e.g. vs. models with larger forward passes but shorter context windows to achieve the same perplexity), since longer context and in-context learning are differentially transparent.
A brief list of resources with theoretical results which seem to imply RL is much more (e.g. sample efficiency-wise) difficult than IL—imitation learning (I don’t feel like I have enough theoretical RL expertise or time to scrutinize hard the arguments, but would love for others to pitch in). Probably at least somewhat relevant w.r.t. discussions of what the first AIs capable of obsoleting humanity could look like:
Paper: Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? (quote: ‘This work shows that, from the statistical viewpoint, the situation is far subtler than suggested by the more traditional approximation viewpoint, where the requirements on the representation that suffice for sample efficient RL are even more stringent. Our main results provide sharp thresholds for reinforcement learning methods, showing that there are hard limitations on what constitutes good function approximation (in terms of the dimensionality of the representation), where we focus on natural representational conditions relevant to value-based, model-based, and policy-based learning. These lower bounds highlight that having a good (value-based, model-based, or policy-based) representation in and of itself is insufficient for efficient reinforcement learning, unless the quality of this approximation passes certain hard thresholds. Furthermore, our lower bounds also imply exponential separations on the sample complexity between 1) value-based learning with perfect representation and value-based learning with a good-but-not-perfect representation, 2) value-based learning and policy-based learning, 3) policy-based learning and supervised learning and 4) reinforcement learning and imitation learning.’)
Talks (very likely with some redundancy):
Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning—Sham Kakade
What is the Statistical Complexity of Reinforcement Learning? (and another two versions)
IL = imitation learning.
I’d bet against any of this providing interesting evidence beyond basic first principles arguments. These types of theory results never seem to add value on top of careful reasoning from my experience.
Hmm, unsure about this. E.g. the development models of many in the alignment community before GPT-3 (often heavily focused on RL or even on GOFAI) seem quite substantially worse in retrospect than those of some of the most famous deep learning people (e.g. LeCun’s cake); of course, this may be an unfair/biased comparison using hindsight. Unsure how much theory results were influencing the famous deep learners (and e.g. classic learning theory results would probably have been misleading), but doesn’t seem obvious they had 0 influence? For example, Bengio has multiple at least somewhat conceptual / theoretical (including review) papers motivating deep/representation learning; e.g. Representation Learning: A Review and New Perspectives.
I think Paul looks considerably better in retrospect than famous DL people IMO. (Partially via being somewhat more specific, though still not really making predictions.)
I’m skeptical hard theory had much influence on anyone though. (In this domain at least.)
Some more (somewhat) related papers:
Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity (‘We first demonstrate that, for a broad class of Markov decision processes (MDPs), the model can be represented by constant-depth circuits with polynomial size or Multi-Layer Perceptrons (MLPs) with constant layers and polynomial hidden dimension. However, the representation of the optimal policy and optimal value proves to be NP-complete and unattainable by constant-layer MLPs with polynomial size. This demonstrates a significant representation complexity gap between model-based RL and model-free RL, which includes policy-based RL and value-based RL. To further explore the representation complexity hierarchy between policy-based RL and value-based RL, we introduce another general class of MDPs where both the model and optimal policy can be represented by constant-depth circuits with polynomial size or constant-layer MLPs with polynomial size. In contrast, representing the optimal value is P-complete and intractable via a constant-layer MLP with polynomial hidden dimension. This accentuates the intricate representation complexity associated with value-based RL compared to policy-based RL. In summary, we unveil a potential representation complexity hierarchy within RL—representing the model emerges as the easiest task, followed by the optimal policy, while representing the optimal value function presents the most intricate challenge.’).
On Representation Complexity of Model-based and Model-free Reinforcement Learning (‘We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal Q-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory provides unique insights into why model-based algorithms usually enjoy better sample complexity than model-free algorithms from a novel representation complexity perspective: in some cases, the ground-truth rule (model) of the environment is simple to represent, while other quantities, such as Q-function, appear complex. We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal Q-function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal Q-function. To the best of our knowledge, this work is the first to study the circuit complexity of RL, which also provides a rigorous framework for future research.’).
Demonstration-Regularized RL (‘Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL’s sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using NE expert demonstrations enables the identification of an optimal policy at a sample complexity of order O˜(Poly(S,A,H)/(ε^2 * N^E)) in finite and O˜(Poly(d,H)/(ε^2 * N^E)) in linear Markov decision processes, where ε is the target precision, H the horizon, A the number of action, S the number of states in the finite case and d the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.’).
Limitations of Agents Simulated by Predictive Models (‘There is increasing focus on adapting predictive models into agent-like systems, most notably AI assistants based on language models. We outline two structural reasons for why these models can fail when turned into agents. First, we discuss auto-suggestive delusions. Prior work has shown theoretically that models fail to imitate agents that generated the training data if the agents relied on hidden observations: the hidden observations act as confounding variables, and the models treat actions they generate as evidence for nonexistent observations. Second, we introduce and formally study a related, novel limitation: predictor-policy incoherence. When a model generates a sequence of actions, the model’s implicit prediction of the policy that generated those actions can serve as a confounding variable. The result is that models choose actions as if they expect future actions to be suboptimal, causing them to be overly conservative. We show that both of those failures are fixed by including a feedback loop from the environment, that is, re-training the models on their own actions. We give simple demonstrations of both limitations using Decision Transformers and confirm that empirical results agree with our conceptual and formal analysis. Our treatment provides a unifying view of those failure modes, and informs the question of why fine-tuning offline learned policies with online learning makes them more effective.’).
I’m not aware of anybody currently working on coming up with concrete automated AI safety R&D evals, while there seems to be so much work going into e.g. DC evals or even (more recently) scheminess evals. This seems very suboptimal in terms of portfolio allocation.
Edit: oops I read this as “automated AI capabilies R&D”.
METR and UK AISI are both interested in this. I think UK AISI is working on this directly while METR is working on this indirectly.
See here.
Thanks! AFAICT though, the link you posted seems about automated AI capabilities R&D evals, rather than about automated AI safety / alignment R&D evals (I do expect transfer between the two, but they don’t seem like the same thing). I’ve also chatted to some people from both METR and UK AISI and got the impression from all of them that there’s some focus on automated AI capabilities R&D evals, but not on safety.
Oops, misread you.
I think some people at superalignment (OpenAI) are interested in some version of this and might already be working on this.
Can you give a concrete example of a safety property of the sort that are you envisioning automated testing for? Or am I misunderstanding what you’re hoping to see?
Conversation with Claude Opus on A Causal Explainable Guardrails for Large Language Models, Discussion: Challenges with Unsupervised LLM Knowledge Discovery and A Multimodal Automated Interpretability Agent (MAIA). To me it seems surprisingly good at something like coming up with plausible alignment research follow-ups, which e.g. were highlighted here as an important part of the superalignment agenda.
Prompts bolded:
‘Summarize ‘Causal Explainable Guardrails for Large
Language Models’. In particular, could this be useful to deal with some of the challenges mentioned in ‘Discussion: Challenges with Unsupervised LLM Knowledge Discovery’?
The paper “A Causal Explainable Guardrails for Large Language Models” proposes LLMGuardaril, a framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations in LLMs. Key points:Existing methods for steering LLMs towards desired attributes rely on unbiased representations and steering prompts, but the representations learned from pre-training can introduce semantic biases that negatively impact steering.
LLMGuardaril systematically identifies and blocks the confounding effects of biases, enabling extraction of unbiased steering representations. It also includes an explainable component providing insights into alignment between generated output and desired direction.
Experiments show LLMGuardaril can effectively steer LLMs towards desired attributes while mitigating biases.Regarding the challenges mentioned in “Discussion: Challenges with Unsupervised LLM Knowledge Discovery”:One key challenge is distinguishing the model’s own knowledge from knowledge of simulated entities. LLMGuardaril aims to disentangle the influence of semantic biases to obtain steering representations capturing desired attributes. This may help identify model’s own knowledge if those attributes are specific to the model itself.
Another challenge is the hypothesized existence of many features satisfying consistency properties beyond just the model’s knowledge. LLMGuardaril’s causal analysis and adversarial learning aims to block confounding semantic biases, potentially reducing the set of candidate knowledge-like features.
Empirically, contrast-consistent search seemed to find prominent arbitrary features rather than knowledge. LLMGuardaril’s explainable component could provide insight into whether extracted features actually correspond to model knowledge.
So in summary, while not a complete solution, LLMGuardaril’s approach of causally analyzing and debiasing semantic influences may help address some key challenges around identifying model knowledge in an unsupervised way. But significant open questions remain around definitively distinguishing model knowledge from simulated knowledge of other entities. Further research is needed to determine how much traction causal analysis provides on the hardest aspects of the ELK problem.′
‘How might MAIA be useful, potentially combined with ‘Causal Explainable Guardrails for Large Language Models’, to deal with some of the challenges mentioned in ‘Discussion: Challenges with Unsupervised LLM Knowledge Discovery’?
MAIA and the techniques from “Causal Explainable Guardrails for Large Language Models” (LLMGuardaril) could potentially be combined to help address some of the challenges mentioned in “Discussion: Challenges with Unsupervised LLM Knowledge Discovery”:
Distinguishing the model’s own knowledge from simulated knowledge: One key challenge raised is that consistency-based methods like contrast-consistent search (CCS) may not be able to distinguish an LLM’s “own” knowledge from the knowledge of simulated entities the model may be imagining. MAIA’s ability to automatically generate hypotheses and design experiments to test them could potentially be used to probe for features that are uniquely characteristic of the model’s own knowledge. For example, MAIA could generate synthetic inputs that vary the context in which a piece of knowledge is elicited (e.g. with/without an explicit simulation frame) and test whether the model’s responses are consistent.
Reducing the set of candidate knowledge-like features: Another challenge is that many features beyond the model’s own knowledge may satisfy consistency properties, making it hard to identify the “true” knowledge. LLMGuardaril’s technique of causally analyzing the influence of semantic biases on steering representations and using adversarial learning to debias them could potentially be applied by MAIA to narrow down the set of features that robustly encode model knowledge across prompts/contexts. The explainable component of LLMGuardaril could also help MAIA validate whether candidate knowledge features behave as expected.
Overcoming the limitations of one-shot probing: Existing automated interpretability approaches are noted to primarily be tools for one-shot hypothesis generation on a limited set of inputs, rather than scalable and fully rigorous causal explanation. MAIA’s ability to iteratively refine hypotheses based on experimental outcomes could lead to more robust explanations. Integrating the causal analysis techniques from LLMGuardaril into this iterative process could help rule out spurious hypotheses.
Evaluating unsupervised knowledge discovery methods: The paper emphasizes the need for principled testbeds to evaluate the effectiveness of unsupervised knowledge discovery techniques. The synthetic test systems used to quantitatively evaluate MAIA’s performance when ground-truth structure is known could potentially be exte’nded to benchmark a wider variety of unsupervised interpretability approaches.
However, significant challenges would likely need to be overcome to fruitfully combine these techniques. For example, MAIA is currently focused on interpreting components of vision models, while LLMGuardaril focuses on steering language models, so unifying them would require bridging this modality gap. More fundamentally, even with access to MAIA’s automated experimentation and LLMGuardaril’s causal analysis, rigorously identifying an LLM’s “own” knowledge in a fully unsupervised way, as opposed to the knowledge of a specific simulated entity, remains an open challenge and likely requires further conceptual breakthroughs. Nonetheless, exploring the intersection of scalable interpretability agents like MAIA and causally grounded analysis techniques like LLMGuardaril seems like a promising direction for future work on the ELK problem.′
Continued:
‘How could the above be combined with ‘Catching AIs red-handed’ and, in particular, the ideas mentioned in paulfchristiano’s comment?
The approaches discussed in “A Multimodal Automated Interpretability Agent” (MAIA) and “Causal Explainable Guardrails for Large Language Models” (LLMGuardaril) could potentially be combined with the ideas in “Catching AIs red-handed” and paulfchristiano’s comment to create a more comprehensive strategy for detecting and mitigating risks from deceptively aligned AI systems:
Using MAIA to automatically generate and test hypotheses about deceptive behavior: As discussed in the “Catching AIs red-handed” post, once a suspicious action has been flagged, we can use techniques like varying the AI’s context and beliefs to test whether the behavior is truly deceptive. MAIA’s automated experimentation capabilities could potentially be used to scale up this kind of hypothesis generation and testing, systematically exploring the “landscape of deceptive behavior” as paulfchristiano puts it.
Applying causal analysis from LLMGuardaril to rule out alternative explanations: A key challenge raised in the post and comments is distinguishing true deception from an AI system just “kind of unlucky in how its context shifted it into a more deceptive role-play”. The causal analysis techniques from LLMGuardaril, which aim to disentangle the influence of semantic context on an AI’s behavior, could potentially be applied by MAIA to more confidently establish whether a pattern of suspicious behavior is caused by intentional deception vs. an unlucky context.
Using MAIA’s explainable component to interpret the AI’s reasoning: In paulfchristiano’s examples of analyzing a suspicious code review or context-dependent behavior, a critical step is understanding what information the AI is using to make its decisions. MAIA’s explainable component, which measures the alignment between an AI’s outputs and experimenter-specified directions, could potentially shed light on the reasoning behind a deceptive action.
Scaling up detection of hard-to-elicit capabilities: “Catching AIs red-handed” notes that real-world deceptive behavior could reveal capabilities that are hard to measure during normal testing. MAIA’s ability to automatically generate diverse test inputs based on a small number of real examples could help systematically probe for these hidden capabilities, reducing the AI’s “space of interventions...where they are more and more confident that it’s never going to be noticed.”
Iterative refinement of monitoring systems: Finally, as an AI system’s deceptive behavior is progressively uncovered through a combination of MAIA’s experimentation and LLMGuardaril’s causal analysis, the resulting dataset of confirmed deceptive actions could be used to fine-tune untrusted monitoring systems to more reliably catch future deception attempts, similar to the “few-shot catastrophe prevention” techniques outlined in the post.
Significant open questions remain about the feasibility of scaling these approaches to be robust against highly capable AI systems. There are also important limitations to consider: for example, MAIA’s causal analysis is unlikely to provide a fully conclusive “smoking gun” that an AI system is deceptive, as opposed to strong probabilistic evidence. And sufficiently capable systems may find ways to avoid exhibiting any detectable deceptive behavior under the kind of systematic experimentation enabled by MAIA.
Nonetheless, I believe the combination of automated interpretability agents, causally grounded analysis of AI reasoning, and the overall adversarial “red team” mindset advocated in “Catching AIs red-handed” represents a promising direction for building more robust monitoring and containment of potentially deceptive AI systems. A concerted effort to develop and integrate these approaches could significantly improve our odds of detecting and mitigating catastrophic deceptive alignment failures.′
’Any ways to combine all the previously mentioned ideas (A Causal Explainable Guardrails for Large Language Models, Discussion: Challenges with Unsupervised LLM Knowledge Discovery and A Multimodal Automated Interpretability Agent (MAIA), ‘Catching AIs red-handed’, including paulfchristiano’s comment) with ideas from ‘Searching for a model’s concepts by their shape – a theoretical framework’ to ‘catch AIs red-handed’?
Here are a few additional ideas for combining all the mentioned papers and comments to try catching deceptive models red-handed:
Using the “deliberate adversarial experiments” mentioned in paulfchristiano’s comment, like varying the model’s beliefs about how closely it is being monitored, to surface cases where the model’s behavior is suspiciously sensitive to being watched. The theoretical framework from “Searching for concepts by their shape” could be used to formalize the hypothesis that the model has a hidden “deception mode” that gets triggered when it believes oversight is lax. MAIA’s automated experimentation tools could be used to efficiently search for inputs exhibiting this sensitivity.
Comparing the features extracted by the interpretability tools from “Searching for concepts by their shape” across different copies of the model that have been adversarially perturbed to have different beliefs (per paulfchristiano’s comment). If there are decision-relevant features (e.g. representations of the model’s objectives) that are suspiciously different in the copy that believes it is unobserved, that would be strong evidence of deceptive cognition. The causal analysis from “Causal Explainable Guardrails” could help determine whether those differing features are causally linked to the differing outputs.
Training MAIA’s detectors for suspicious model outputs not just on straightforwardly deceptive behaviors, but on subtler signs of agency-like cognition hypothesized in paulfchristiano’s comment, e.g. the model giving worse answers when it believes it is unobserved. This could yield more robust detection that is harder for the model to circumvent by avoiding overt deception.
If interpretability tools manage to extract a model’s representations of its beliefs, objectives, plans, etc. (a big if!), use the techniques from “Catching AIs red-handed” to examine how those representations differ on the inputs that were “successfully” deceptive (i.e. where the model behaved deceptively but was not caught). There may be learnable patterns in how the model’s cognition differs in those cases which could inspire hypotheses about what relations characterize “successful” deception. Those hypotheses could then be formalized and tested using the framework from “Searching for concepts by their shape”.
A point emphasized in paulfchristiano’s comment is the importance of extreme thoroughness and breadth in searching for deceptive cognition, e.g. (to paraphrase) “even if a model exhibits some deceptive behavior, don’t assume you’ve found all of its tricks; keep probing its behavior in a huge variety of contexts”. MAIA and other automated testing tools should be put in service of this kind of breadth. Rather than just focus on the most egregious cases of deception, we should test the model’s behavior on a vast range of subtle perturbations hunting for any whiff of situational inconsistency. The theoretical frameworks for agency from “Searching for concepts by their shape” can guide us in knowing what to look for.
Extending my previous thoughts, I think one of the most useful things the “Searching for concepts by their shape” framing can contribute is mathematical precision in defining hypotheses about what characterizes deceptive cognition. If we can frame agency/deception in terms of precise relations between features (even if those features are hard to interpretably extract in practice), it gives us a crisper target to aim at with our testing, rather than just relying on intuitive notions of what “seems deceptive”.
However, I continue to be quite pessimistic about the prospects for catching competent deceptive models before it’s too late, for the reasons outlined previously. I worry that by the time a model is smart enough that we could meaningfully apply these combined interpretability strategies to it, it would also be smart enough to notice what we’re doing and come up with countermeasures.
Unless interpretability tools far exceed my expectations in their ability to crack open the black box of neural nets (which would be great!), I suspect the most practical path is to use simpler, more behavioral detection strategies (like the model consistency testing from your Control paper) to avoid building models that are competent enough to be robustly deceptive in the first place. Catching a superintelligent deceptive model after the fact seems like a much harder challenge than noticing warning signs of misalignment in weaker models and heading them off early.
But I would still encourage pursuing all these ideas, even if they end up not being enough on their own to solve the problem. The interpretability insights gained would be valuable for alignment more broadly. And we may get lucky in extracting some key structural signature of deceptive cognition that even very sophisticated models can’t fully mask. It’s an arms race we can’t afford to unilaterally forfeit.′
I wonder how much near-term interpretability [V]LM agents (e.g. MAIA, AIA) might help with finding better probes and better steering vectors (e.g. by iteratively testing counterfactual hypotheses against potentially spurious features, a major challenge for Contrast-consistent search (CCS)).
This seems plausible since MAIA can already find spurious features, and feature interpretability [V]LM agents could have much lengthier hypotheses iteration cycles (compared to current [V]LM agents and perhaps even to human researchers).
I might have updated at least a bit against the weakness of single-forward passes, based on intuitions about the amount of compute that huge context windows (e.g. Gemini 1.5 − 1 million tokens) might provide to a single-forward-pass, even if limited serially.
Or maybe not, apparently LLMs are (mostly) not helped by filler tokens.
Somewhat relatedly: I’m interested on how well LLMs can solve tasks in parallel. This seems very important to me.[1]
The “I’ve thought about this for 2 minutes” version is: Hand an LLM two multiple choice questions with four answer options each. Encode these four answer options into a single token, so that there are 16 possible tokens of which one corresponds to the correct answer to both questions. A correct answer means that the model has solved both tasks in one forward-pass.
(One can of course vary the number of answer options and questions. I can see some difficulties in implementing this idea properly, but would nevertheless be excited if someone took a shot at it.)
Two quick reasons:
- For serial computation the number of layers gives some very rough indication of the strength of one forward-pass, but it’s harder to have intuitions for parallel computation.
- For scheming, the model could reason about “should I still stay undercover”, “what should I do in case I should stay undercover” and “what should I do in case it’s time to attack” in parallel, finally using only one serial step to decide on its action.
I would expect, generally, solving tasks in parallel to be fundamentally hard in one-forward pass for pretty much all current SOTA architectures (especially Transformers and modern RNNs like MAMBA). See e.g. this comment of mine; and other related works like https://twitter.com/bohang_zhang/status/1664695084875501579, https://twitter.com/bohang_zhang/status/1664695108447399937 (video presentation), Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks, RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval.
There might be more such results I’m currently forgetting about, but they should be relatively easy to find by e.g. following citation trails (to and from the above references) with Google Scholar (or by looking at my recent comments / short forms).
I am also very interested in e.g. how one could operationalize the number of hops of inference of out-of-context reasoning required for various types of scheming, especially scheming in one-forward-pass; and especially in the context of automated AI safety R&D.
https://arxiv.org/abs/2404.15758
“We show that transformers can use meaningless filler tokens (e.g., ‘......’) in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge.”
Thanks, seen it; see also the exchanges in the thread here: https://twitter.com/jacob_pfau/status/1784446572002230703.
I looked over it and I should note that “transformers are in TC0” is not very useful statement for prediction of capabilities. Transformers are Turing-complete given rational inputs (see original paper) and them being in TC0 basically means they can implement whatever computation you can implement using boolean circuit for fixed amount of available compute which amounts to “whatever computation is practical to implement”.
I think I remember William Merrill (in a video) pointing out that the rational inputs assumption seems very unrealistic (would require infinite memory); and, from what I remember, https://arxiv.org/abs/2404.15758 and related papers made a different assumption about the number of bits of memory per parameter and per input.
Also, TC0 is very much limited, see e.g. this presentation.
Everything Turing-complete requires infinite memory. When we are saying “x86 set of instructions is Turing-complete” we imply “assuming that processor operates on infinite memory”. It’s in definition of Turing machine to include infinite tape, after all.
It’s hard to pinpoint, but the trick is that it’s very nuanced difference between the sense in which transformers are limited in complexity-theoretic sense and “transformers can’t do X”. Like, there is nothing preventing transformers from playing chess perfectly—they just need to be sufficiently large for this. To answer the question “can transformers do X” you need to ask “how much computing power transformer has” and “can this computing power be shaped by SGD into solution”.
It’s interesting question whether Gemini has any improvements.
I’ve been / am on the lookout for related theoretical results of why grounding a la Grounded language acquisition through the eyes and ears of a single child works (e.g. with contrastive learning methods) - e.g. some recent works: Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP, Contrastive Learning is Spectral Clustering on Similarity Graph, Optimal Sample Complexity of Contrastive Learning; (more speculatively) also how it might intersect with alignment, e.g. if alignment-relevant concepts might be ‘groundable’ in fMRI data (and then ‘pointable to’) - e.g. https://medarc-ai.github.io/mindeye/ uses contrastive learning with fMRI—image pairs
This seems pretty good for safety (as RAG is comparatively at least a bit more transparent than fine-tuning): https://twitter.com/cwolferesearch/status/1752369105221333061
Larger LMs seem to benefit differentially more from tools: ‘Absolute performance and improvement-per-turn (e.g., slope) scale with model size.’ https://xingyaoww.github.io/mint-bench/. This seems pretty good for safety, to the degree tool usage is often more transparent than model internals.
In my book, this would probably be the most impactful model internals / interpretability project that I can think of: https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit?commentId=qByLyr6RSgv3GBqfB
Large scale cyber-attacks resulting from AI misalignment seem hard, I’m at >90% probability that they happen much later (at least years later) than automated alignment research, as long as we *actually try hard* to make automated alignment research work: https://forum.effectivealtruism.org/posts/bhrKwJE7Ggv7AFM7C/modelling-large-scale-cyber-attacks-from-advanced-ai-systems.
I had speculated previously about links between task arithmetic and activation engineering. I think given all the recent results on in context learning, task/function vectors and activation engineering / their compositionality (In-Context Learning Creates Task Vectors, In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering, Function Vectors in Large Language Models), this link is confirmed to a large degree. This might also suggest trying to import improvements to task arithmetic (e.g. Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models, or more broadly look at the citations of the task arithmetic paper) to activation engineering.
speculatively, it might also be fruitful to go about this the other way round, e.g. try to come up with better weight-space task erasure methods by analogy between concept erasure methods (in activation space) and through the task arithmetic—activation engineering link
For the pretraining-finetuning paradigm, this link is now made much more explicitly in Cross-Task Linearity Emerges in the Pretraining-Finetuning Paradigm; as well as linking to model ensembling through logit averaging.