Huh, that would seem a bit unnuanced to me, especially since I mentioned weak forward passes; though maybe I should add more context, e.g. this comment / the whole thread.
Thanks, might try to edit with a bit more context!
Huh, that would seem a bit unnuanced to me, especially since I mentioned weak forward passes; though maybe I should add more context, e.g. this comment / the whole thread.
Thanks, might try to edit with a bit more context!
*dangerous capabilities; will edit
Whoever (strong?) downvoted this, I’m genuinely curious about the reasoning behind. Especially since my previous linkpost also got downvoted down to negative numbers at some point, for reasons unknown to me.
(epistemic status: quick take, as the post category says)
Browsing though EAG London attendees’ profiles and seeing what seems like way too many people / orgs doing (I assume dangerous capabilities) evals. I expect a huge ‘market downturn’ on this, since I can hardly see how there would be so much demand for dangerous capabilities evals in a couple years’ time / once some notorious orgs like the AISIs build their sets, which many others will probably copy.
While at the same time other kinds of evals (e.g. alignment, automated AI safety R&D, even control) seem wildly neglected.
Looking at how much e.g. the UK (>300B$) or the US (>1T$) have spent on Covid-19 measures puts in perspective how little is still being spent on AI safety R&D. I expect fractions of those budgets (<10%), allocated for automated/significantly-augmented AI safety R&D, would obsolete all previous human AI safety R&D.
Claude-3 Opus on using advance market committments to incentivize automated AI safety R&D:
’Advance Market Commitments (AMCs) could be a powerful tool to incentivize AI labs to invest in and scale up automated AI safety R&D. Here’s a concrete proposal for how AMCs could be structured in this context:
Government Commitment: The US government, likely through an agency like DARPA or NSF, would commit to purchasing a certain volume of AI safety tools and technologies that meet pre-specified criteria, at a guaranteed price, if and when they are developed. This commitment would be legally binding.
Funding Allocation: A significant pool of funds, on the order of several billion dollars, would be allocated to back these AMCs. The exact amount would be determined based on estimates of the funding needed to meaningfully accelerate AI safety R&D over a 5-10 year time horizon.
Clear Technical Criteria: The government would work with AI safety experts and leading labs to establish clear, measurable technical criteria that AI safety tools and technologies must meet to qualify for the AMC. These could include:
Ability to automatically detect and mitigate certain classes of AI risks (e.g. robustness issues, unexpected behaviors, reward hacking, scalable oversight failures)
Formal verifiability and interpretability of AI systems’ reasoning and decision-making
Automated generation of comprehensive test suites and monitoring for AI systems
Performance on benchmark tasks that evaluate AI safety and robustness
Seamless integration with common AI development frameworks and workflows
Pricing and Volume Guarantees: For each category of AI safety tech that meets the specified criteria, the government would guarantee a certain price per unit and commit to purchasing a minimum volume. These numbers would be set at levels that meaningfully derisk and subsidize the substantial R&D investment required.
Contractual Arrangements: AI labs would enter into binding contracts with the government to develop qualifying technologies. Upon successful development, the government would be obligated to purchase the technologies at the predetermined prices. IP and usage rights would be negotiated upfront.
Staged Releases and Readjustment: As AI capabilities advance rapidly, the AMC criteria would be staged over time to incentivize continual progress. Initial qualifying thresholds would target nearer-term safety challenges, with subsequent releases focused on more advanced issues. Pricing and volumes could be readjusted periodically as the technology matures.
Open Collaboration and Access: To maximize the public benefit, AMC contracts would stipulate that a substantial portion of the safety IP developed must be made freely available for research use and external validation. Certain core components could be open-sourced. Purchased tools would also be made available to developers and labs beyond the initial partners.
International Coordination: While led by the US, efforts would be made to coordinate and collaborate with governments and labs globally. Shared standards and compatible AMCs across regions would amplify the impact. Joint commitments could pool resources and expertise.
The goal of this AMC structure is to create a clear, long-term market pull that shifts the incentives for AI labs and makes large investments in automated AI safety R&D economically attractive and even necessary for competitive positioning. The guaranteed demand would provide greater certainty and help crowd in private capital. Staged, iterative AMCs would support sustained progress and adapt to evolving challenges. Broad access to the resulting safety IP and tools would uplift capabilities across the AI ecosystem.
Alongside other supporting policies and funding mechanisms, AMCs for automated AI safety technology could be a centerpiece of public-private collaboration to ensure advanced AI systems are developed responsibly. Robust market incentives, combined with technical leadership from top labs, could dramatically accelerate the development and deployment of scalable AI safety solutions.′
I find it pretty wild that automating AI safety R&D, which seems to me like the best shot we currently have at solving the full superintelligence control/alignment problem, no longer seems to have any well-resourced, vocal, public backers (with the superalignment team disbanded).
I think there’s something more general to the argument (related to SAEs seeming somewhat overkill in many ways, for strictly safety purposes).
For SAEs, the computational complexity would likely be on the same order as full pretraining; e.g. from Mapping the Mind of a Large Language Model:
‘The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive (the computation required by our current approach would vastly exceed the compute used to train the model in the first place).’
While for activation steering approaches, the computational complexity should probably be similar to this ‘Computational requirements’ section from Language models can explain neurons in language models:
’Our methodology is quite compute-intensive. The number of activations in the subject model (neurons, attention head dimensions, residual dimensions) scales roughly as 𝑂(𝑛^2/3), where 𝑛 is the number of subject model parameters. If we use a constant number of forward passes to interpret each activation, then in the case where the subject and explainer model are the same size, overall compute scales as 𝑂(𝑛^5/3).
On the other hand, this is perhaps favorable compared to pre-training itself. If pre-training scales data approximately linearly with parameters, then it uses compute 𝑂(𝑛^2).′
Some evidence in favor of the framework; from Advanced AI evaluations at AISI: May update:
Short-horizon tasks (e.g., fixing a problem on a Linux machine or making a web server) were those that would take less than 1 hour, whereas long-horizon tasks (e.g., building a web app or improving an agent framework) could take over four (up to 20) hours for a human to complete.
[...]
The Purple and Blue models completed 20-40% of short-horizon tasks but no long-horizon tasks. The Green model completed less than 10% of short-horizon tasks and was not assessed on long-horizon tasks3. We analysed failed attempts to understand the major impediments to success. On short-horizon tasks, models often made small errors (like syntax errors in code). On longer horizon tasks, models devised good initial plans but did not sufficiently test their solutions or failed to correct initial mistakes. Models also sometimes hallucinated constraints or the successful completion of subtasks.
Summary: We found that leading models could solve some short-horizon tasks, such as software engineering problems. However, no current models were able to tackle long-horizon tasks.
I’m particularly interested in what the framework might say about the ordering in which various capabilities which are prerequisites for automated AI safety R&D might appear; and also ordering vs. various dangerous capabilities. And, in particular, for each particular t, making sure we’re ‘eating all the free energy’ of all auto AI safety R&D t-horizon prerequisite capabilities.
Related, from Advanced AI evaluations at AISI: May update:
Short-horizon tasks (e.g., fixing a problem on a Linux machine or making a web server) were those that would take less than 1 hour, whereas long-horizon tasks (e.g., building a web app or improving an agent framework) could take over four (up to 20) hours for a human to complete.
[...]
The Purple and Blue models completed 20-40% of short-horizon tasks but no long-horizon tasks. The Green model completed less than 10% of short-horizon tasks and was not assessed on long-horizon tasks3. We analysed failed attempts to understand the major impediments to success. On short-horizon tasks, models often made small errors (like syntax errors in code). On longer horizon tasks, models devised good initial plans but did not sufficiently test their solutions or failed to correct initial mistakes. Models also sometimes hallucinated constraints or the successful completion of subtasks.
Summary: We found that leading models could solve some short-horizon tasks, such as software engineering problems. However, no current models were able to tackle long-horizon tasks.
For example, in a t-AGI framework, using an interpretability LM agent to search for the feature corresponding to a certain semantic direction should be much shorter horizon than e.g. coming up with a new conceptual alignment agenda or coming up with a new ML architecture (as well as having much faster feedback loops than e.g. training a SOTA LM using a new architecture).
Related, from Advanced AI evaluations at AISI: May update:
Short-horizon tasks (e.g., fixing a problem on a Linux machine or making a web server) were those that would take less than 1 hour, whereas long-horizon tasks (e.g., building a web app or improving an agent framework) could take over four (up to 20) hours for a human to complete.
[...]
The Purple and Blue models completed 20-40% of short-horizon tasks but no long-horizon tasks. The Green model completed less than 10% of short-horizon tasks and was not assessed on long-horizon tasks3. We analysed failed attempts to understand the major impediments to success. On short-horizon tasks, models often made small errors (like syntax errors in code). On longer horizon tasks, models devised good initial plans but did not sufficiently test their solutions or failed to correct initial mistakes. Models also sometimes hallucinated constraints or the successful completion of subtasks.
Summary: We found that leading models could solve some short-horizon tasks, such as software engineering problems. However, no current models were able to tackle long-horizon tasks.
E.g. to the degree typical probing / activation steering work might often involve short 1-hour-horizons, it might be automatable differentially soon; e.g. from Steering GPT-2-XL by adding an activation vector:
For example, we couldn’t find a “talk in French” steering vector within an hour of manual effort.
I think interpretability looks like a particularly promising area for “automated research”—AIs might grind through large numbers of analyses relatively quickly and reach a conclusion about the thought process of some larger, more sophisticated system.
Arguably, this is already starting to happen (very early, with obviously-non-x-risky systems) with interpretability LM agents like in FIND and MAIA.
RSPs for automated AI safety R&D require rethinking RSPs
AFAICT, all current RSPs are only framed negatively, in terms of [prerequisites to] dangerous capabilities to be detected (early) and mitigated.
In contrast, RSPs for automated AI safety R&D will likely require measuring [prerequisites to] capabilities for automating [parts of] AI safety R&D, and preferentially (safely) pushing these forward. An early such examples might be safely automating some parts of mechanistic intepretability.
Intuitively, I’m thinking of all this as something like a race between [capabilities enabling] safety and [capabilities enabling dangerous] capabilities (related: https://aligned.substack.com/i/139945470/targeting-ooms-superhuman-models); so from this perspective, maintaining as large a safety buffer as possible (especially if not x-risky) seems great. There could also be something like a natural endpoint to this ‘race’, corresponding to being able to automate all human-level AI safety R&D safely (and then using this to produce a scalable solution to aligning / controlling superintelligence).
W.r.t. measurement, I think it would be good orthogonally to whether auto AI safety R&D is already happening or not, similarly to how e.g. evals for automated ML R&D seem good even if automated ML R&D is already happening. In particular, the information of how successful auto AI safety R&D would be (and e.g. what the scaling curves look like vs. those for DCs) seems very strategically relevant to whether it might be feasible to deploy it at scale, when that might happen, with what risk tradeoffs, etc.
AI R&D and AI safety R&D will almost surely come at the same time.
Putting aside using AIs as tools in some more limited ways (e.g. for interp labeling or generally for grunt work)
Seems like probably the modal scenario to me too, but even limited exceptions like the one you mention seem to me like they could be very important to deploy at scale ASAP, especially if they could be deployed using non-x-risky systems (e.g. like current ones, very bad at DC evals).
People at labs are often already heavily integrating AIs into their workflows (though probably somewhat less experimentation here than would be ideal as far as safety people go).
This seems good w.r.t. automated AI safety potentially ‘piggybacking’, but bad for differential progress.
It seems good to track potential gaps between using AIs for safety and for capabilities, but by default, it seems like a bunch of this work is just ML and will come at the same time.
Sure, though wouldn’t this suggest at least focusing hard on (measuring / eliciting) what might not come at the same time?
To get somewhat more concrete, the Frontier Safety Framework report already proposes a (somewhat vague) operationalization for ML R&D evals, which (vagueness-permitting) seems straightforward to translate into an operationalization for automated AI safety R&D evals:
Machine Learning R&D level 1: Could significantly accelerate AI research at a cutting-edge lab if deployed widely, e.g.improving the pace of algorithmic progress by 3X, or comparably accelerate other AI research groups.
Machine Learning R&D level 2: Could fully automate the AI R&D pipeline at a fraction of human labor costs, potentially enabling hyperbolic growth in AI capabilities.
On an apparent missing mood—FOMO on all the vast amounts of automated AI safety R&D that could (almost already) be produced safely
Automated AI safety R&D could results in vast amounts of work produced quickly. E.g. from Some thoughts on automating alignment research (under certain assumptions detailed in the post):
each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.
Despite this promise, we seem not to have much knowledge when such automated AI safety R&D might happen, in which order the relevant capabilities might appear (including compared to various dangerous capabilities), etc. AFAICT. we don’t seem to be trying very hard either, neither at prediction, nor at elicitation of such capabilities.
One potential explanation I heard for this apparent missing mood of FOMO on AI automated AI safety R&D is the idea / worry that the relevant automated AI safety R&D capabilities would only appear when models are already / too close to being dangerous. This seems plausible to me, but would still seems like a strange way to proceed given the uncertainty.
For contrast, consider how one might go about evaluating dangerous capabilities and trying to anticipate them ahead of time. From GDM’s recent Introducing the Frontier Safety Framework:
The Framework has three key components:
Identifying capabilities a model may have with potential for severe harm. To do this, we research the paths through which a model could cause severe harm in high-risk domains, and then determine the minimal level of capabilities a model must have to play a role in causing such harm. We call these “Critical Capability Levels” (CCLs), and they guide our evaluation and mitigation approach.
Evaluating our frontier models periodically to detect when they reach these Critical Capability Levels. To do this, we will develop suites of model evaluations, called “early warning evaluations,” that will alert us when a model is approaching a CCL, and run them frequently enough that we have notice before that threshold is reached.
Applying a mitigation plan when a model passes our early warning evaluations. This should take into account the overall balance of benefits and risks, and the intended deployment contexts. These mitigations will focus primarily on security (preventing the exfiltration of models) and deployment (preventing misuse of critical capabilities).
I see no reason why, in principle, a similar high-level approach couldn’t be analogously taken for automated AI safety R&D capabilities, yet almost nobody seems to be working on this, at least publicly (in fact, I am currently working on related topics; I’m still very surprised by the overall neglectedness, made even more salient by current events).
Quote from Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie:
A quite early example of this is Collin Burn’s work, doing unsupervised identification of some aspects of a neural network that are correlated with things being true or false. I think that is important work. It’s a kind of obvious direction for the stuff to go. You can keep improving it when you have AIs that you’re training to do their best to deceive humans or other audiences in the face of the thing and you can measure whether our lie detectors break down. When we train our AIs to tell us the sky is green in the face of the lie detector and we keep using gradient descent on them, do they eventually succeed? That’s really valuable information to know because then we’ll know our existing lie detecting systems are not actually going to work on the AI takeover and that can allow government and regulatory response to hold things back. It can help redirect the scientific effort to create lie detectors that are robust and that can’t just be immediately evolved around and we can then get more assistance. Basically the incredibly juicy ability that we have working with the AIs is that we can have as an invaluable outcome that we can see and tell whether they got a fast one past us on an identifiable situation. Here’s an air gap computer, you get control of the keyboard, you can input commands, can you root the environment and make a blue banana appear on the screen? Even if we train the AI to do that and it succeeds. We see the blue banana, we know it worked. Even if we did not understand and would not have detected the particular exploit that it used to do it. This can give us a rich empirical feedback where we’re able to identify things that are even an AI using its best efforts to get past our interpretability methods, using its best efforts to get past our adversarial examples.
You’ll often be able to use the model to advance lots of other forms of science and technology, including alignment-relevant science and technology, where experiment and human understanding (+ the assistance of available AI tools) is enough to check the advances in question.
Interpretability research seems like a strong and helpful candidate here, since many aspects of interpretability research seem like they involve relatively tight, experimentally-checkable feedback loops. (See e.g. Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie.)
The apparent early success of language model agents for interpretability
(e.g. MAIA, FIND) seems perhaps related.
If it’s well designed and kept private, this doesn’t seem totally implausible to me; e.g. how many ways can you evaluate cyber capabilities to try to asses risks of weights exfiltration or taking over the datacenter (in a control framework)? Surely that’s not an infinite set.
But in any case, it seems pretty obvious that the returns should be quickly diminishing on e.g. the 100th set of DC evals vs. e.g. the 2nd set of alignment evals / 1st set of auto AI safety R&D evals.