AI forecasting & strategy at AI Impacts. Blog: Not Optional.
My best LW posts are Framing AI strategy and Slowing AI: Foundations. My most useful LW posts are probably AI policy ideas: Reading list and Ideas for AI labs: Reading list.
AI forecasting & strategy at AI Impacts. Blog: Not Optional.
My best LW posts are Framing AI strategy and Slowing AI: Foundations. My most useful LW posts are probably AI policy ideas: Reading list and Ideas for AI labs: Reading list.
It’s “a unified methodology” but I claim it has two very different uses: (1) determining whether a model is safe (in general or within particular scaffolding) and (2) directly making deployment safer. Or (1) model evals and (2) inference-time safety techniques.
Thanks!
I think there’s another agenda like make untrusted models safe but useful by putting them in a scaffolding/bureaucracy—of filters, classifiers, LMs, humans, etc.—such that at inference time, takeover attempts are less likely to succeed and more likely to be caught. See Untrusted smart models and trusted dumb models (Shlegeris 2023). Other relevant work:
AI safety with current science (Shlegeris 2023)
Preventing Language Models From Hiding Their Reasoning (Roger and Greenblatt 2023)
Coup probes (Roger 2023)
(Some other work boosts this agenda but isn’t motivated by it — perhaps activation engineering and chain-of-thought)
Update: Greg Brockman quit.
Update: Sam and Greg say:
Sam and I are shocked and saddened by what the board did today.
Let us first say thank you to all the incredible people who we have worked with at OpenAI, our customers, our investors, and all of those who have been reaching out.
We too are still trying to figure out exactly what happened. Here is what we know:
- Last night, Sam got a text from Ilya asking to talk at noon Friday. Sam joined a Google Meet and the whole board, except Greg, was there. Ilya told Sam he was being fired and that the news was going out very soon.
- At 12:19pm, Greg got a text from Ilya asking for a quick call. At 12:23pm, Ilya sent a Google Meet link. Greg was told that he was being removed from the board (but was vital to the company and would retain his role) and that Sam had been fired. Around the same time, OpenAI published a blog post.
- As far as we know, the management team was made aware of this shortly after, other than Mira who found out the night prior.
The outpouring of support has been really nice; thank you, but please don’t spend any time being concerned. We will be fine. Greater things coming soon.
Update: three more resignations including Jakub Pachocki.
Sam Altman’s firing as OpenAI CEO was not the result of “malfeasance or anything related to our financial, business, safety, or security/privacy practices” but rather a “breakdown in communications between Sam Altman and the board,” per an internal memo from chief operating officer Brad Lightcap seen by Axios.
Update: Sam is planning to launch something (no details yet).
Update: Sam may return as OpenAI CEO.
Update: Tigris.
Update: talks with Sam and the board.
Update: Mira wants to hire Sam and Greg in some capacity; board still looking for a permanent CEO.
Update: Emmett Shear is interim CEO; Sam won’t return.
Update: lots more resignations (according to an insider).
Update: Sam and Greg leading a new lab in Microsoft.
Update: total chaos.
Has anyone collected their public statements on various AI x-risk topics anywhere?
A bit, not shareable.
Helen is an AI safety person. Tasha is on the Effective Ventures board. Ilya leads superalignment. Adam signed the CAIS statement.
Thanks!
automating the world economy will take longer
I’m curious what fraction-of-2023-tasks-automatable and maybe fraction-of-world-economy-automated you think will occur at e.g. overpower time, and the median year for that. (I sometimes notice people assuming 99%-automatability occurs before all the humans are dead, without realizing they’re assuming anything.)
@Daniel Kokotajlo it looks like you expect 1000x-energy 4 years after 99%-automation. I thought we get fast takeoff, all humans die, and 99% automation at around the same time (but probably in that order) and then get massive improvements in technology and massive increases in energy use soon thereafter. What takes 4 years?
(I don’t think the part after fast takeoff or all humans dying is decision-relevant, but maybe resolving my confusion about this part of your model would help illuminate other confusions too.)
So why doesn’t one of those thousand people run for president and win? (This is a rhetorical question, I know the answer)
The answer is that there’s a coordination problem.
It occurs to me that maybe these things are related. Maybe in a world of monarchies where the dynasty of so-and-so has ruled for generations, supporting someone with zero royal blood is like supporting a third-party candidate in the USA.
Wait, what is it that gave monarchic dynasties momentum, in your view?
In the future, sharing weights will enable misuse. For now, the main effect of sharing weights is boosting research (both capabilities and safety) (e.g. the Llama releases definitely did this). The sign of that research-boosting currently seems negative to me, but there’s lots of reasonable disagreement.
fwiw my guess is that OP didn’t ask its grantees to do open-source LLM biorisk work at all; I think its research grantees generally have lots of freedom.
(I’ve worked for an OP-funded research org for 1.5 years. I don’t think I’ve ever heard of OP asking us to work on anything specific, nor of us working on something because we thought OP would like it. Sometimes we receive restricted, project-specific grants, but I think those projects were initiated by us. Oh, one exception: Holden’s standards-case-studies project.)
Interesting. If you’re up for skimming a couple more EA-associated AI-bio reports, I’d be curious about your quick take on the RAND report and the CLTR report.
https://managing-ai-risks.com said “we call on major tech companies and public funders to allocate at least one-third of their AI R&D budget to ensuring safety and ethical use”
Open letters (related to AI safety):
FLI, Oct 2015: Research Priorities for Robust and Beneficial Artificial Intelligence
FLI, Aug 2017: Asilomar AI Principles
FLI, Mar 2023: Pause Giant AI Experiments
CAIS, May 2023: Statement on AI Risk
Meta, Jul 2023: Statement of Support for Meta’s Open Approach to Today’s AI
Academic AI researchers, Oct 2023: Managing AI Risks in an Era of Rapid Progress
CHAI et al., Oct 2023: Prominent AI Scientists from China and the West Propose Joint Strategy to Mitigate Risks from AI
Oct 2023: Urging an International AI Treaty
Mozilla, Oct 2023: Joint Statement on AI Safety and Openness
Nov 2023: Post-Summit Civil Society Communique
Joint declarations between countries (related to AI safety):
Nov 2023: Bletchley Declaration
Thanks to Peter Barnett.
Nice.
You’ll need to evaluate more than just foundation models
Not sure what this is gesturing at—you need to evaluate other kinds of models, or whole labs, or foundation-models-plus-finetuning-and-scaffolding, or something else.
(I think “model evals” means “model+finetuning+scaffolding evals,” at least to the AI safety community + Anthropic.)
This was the press release; the actual order has now been published.
One safety-relevant part:
4.2. Ensuring Safe and Reliable AI. (a) Within 90 days of the date of this order, to ensure and verify the continuous availability of safe, reliable, and effective AI in accordance with the Defense Production Act, as amended, 50 U.S.C. 4501 et seq., including for the national defense and the protection of critical infrastructure, the Secretary of Commerce shall require:
(i) Companies developing or demonstrating an intent to develop potential dual-use foundation models to provide the Federal Government, on an ongoing basis, with information, reports, or records regarding the following:
(A) any ongoing or planned activities related to training, developing, or producing dual-use foundation models, including the physical and cybersecurity protections taken to assure the integrity of that training process against sophisticated threats;
(B) the ownership and possession of the model weights of any dual-use foundation models, and the physical and cybersecurity measures taken to protect those model weights; and
(C) the results of any developed dual-use foundation model’s performance in relevant AI red-team testing based on guidance developed by NIST pursuant to subsection 4.1(a)(ii) of this section, and a description of any associated measures the company has taken to meet safety objectives, such as mitigations to improve performance on these red-team tests and strengthen overall model security. Prior to the development of guidance on red-team testing standards by NIST pursuant to subsection 4.1(a)(ii) of this section, this description shall include the results of any red-team testing that the company has conducted relating to lowering the barrier to entry for the development, acquisition, and use of biological weapons by non-state actors; the discovery of software vulnerabilities and development of associated exploits; the use of software or tools to influence real or virtual events; the possibility for self-replication or propagation; and associated measures to meet safety objectives; and
(ii) Companies, individuals, or other organizations or entities that acquire, develop, or possess a potential large-scale computing cluster to report any such acquisition, development, or possession, including the existence and location of these clusters and the amount of total computing power available in each cluster.
(b) The Secretary of Commerce, in consultation with the Secretary of State, the Secretary of Defense, the Secretary of Energy, and the Director of National Intelligence, shall define, and thereafter update as needed on a regular basis, the set of technical conditions for models and computing clusters that would be subject to the reporting requirements of subsection 4.2(a) of this section. Until such technical conditions are defined, the Secretary shall require compliance with these reporting requirements for:
(i) any model that was trained using a quantity of computing power greater than 1026 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 1023 integer or floating-point operations; and
(ii) any computing cluster that has a set of machines physically co-located in a single datacenter, transitively connected by data center networking of over 100 Gbit/s, and having a theoretical maximum computing capacity of 1020 integer or floating-point operations per second for training AI.
Thanks.
Shrug. You can get it ~arbitrarily high by using more sensitive tests. Of course the probability for Anthropic is less than 1, and I totally agree it’s not “high enough to write off this point.” I just feel like this is an engineering problem, not a flawed “core assumption.”
[Busy now but I hope to reply to the rest later.]
there are clearly some training setups that seem more dangerous than other training setups . . . .
Like, as an example, my guess is systems where a substantial chunk of the compute was spent on training with reinforcement learning in environments that reward long-term planning and agentic resource acquisition (e.g. many video games or diplomacy or various simulations with long-term objectives) sure seem more dangerous.
Any recommended reading on which training setups are safer? If none exist, someone should really write this up.
This is great. Some quotes I want to come back to:
a thing that I like that both the Anthropic RSP and the ARC Evals RSP post point to is basically a series of well-operationalized conditional commitments. One way an RSP could be is to basically be a contract between AI labs and the public that concretely specifies “when X happens, then we commit to do Y”, where X is some capability threshold and Y is some pause commitment, with maybe some end condition.
instead of an RSP I would much prefer a bunch of frank interviews with Dario and Daniella where someone is like “so you think AGI has a decent chance of killing everyone, then why are you building it?”. And in-general to create higher-bandwidth channels that people can use to understand what people at leading AI labs believe about the risks from AI, and when the benefits are worth it.
It seems pretty important to me to have some sort of written down and maintained policy on “when would we stop increasing the power of our models” and “what safety interventions will we have in place for different power levels”.
I generally think more honest clear communication and specific plans on everything seems pretty good on current margins (e.g., RSPs, clear statements on risk, clear explanation of why labs are doing things that they know are risky, detailed discussion with skeptics, etc).
I do feel like there is a substantial tension here between two different types of artifacts here:
A document that is supposed to accurately summarize what decisions the organization is expecting to make in different circumstances
A document that is supposed to bind the organization to make certain decisions in certain circumstances
Like, the current vibe that I am getting is that RSPs are a “no-take-backsies” kind of thing. You don’t get to publish an RSP saying “yeah, we aren’t planning to scale” and then later on to be like “oops, I changed my mind, we are actually going to go full throttle”.
And my guess is this is the primary reason why I expect organizations to not really commit to anything real in their RSPs and for them to not really capture what leadership of an organization thinks the tradeoffs are. Like, that’s why the Anthropic RSP has a big IOU where the actually most crucial decisions are supposed to be.
Like, here is an alternative to “RSP”s. Call them “Conditional Pause Commitments” (CPC if you are into acronyms).
Basically, we just ask AGI companies to tell us under what conditions they will stop scaling or stop otherwise trying to develop AGI. And then also some conditions under which they would resume. [Including implemented countermeasures.] Then we can critique those.
This seems like a much clearer abstraction that’s less philosophically opinionated about whether the thing is trying to be an accurate map of an organization’s future decisions, or to what degree it’s supposed to seriously commit an organization, or whether the whole thing is “responsible”.
people working at AI labs should think through and write down the conditions under which they would loudly quit (this can be done privately, but maybe should be shared between like minded employees for common knowledge). Then, people can hopefully avoid getting frog-boiled.
I mostly disagree with your criticisms.
On the other hand: before dangerous capability X appears, labs specifically testing for signs-of-X should be able to notice those signs. (I’m pretty optimistic about detecting dangerous capabilities; I’m more worried about how labs will get reason-to-believe their models are still safe after those models have dangerous capabilities.)
There’s a good solution: build safety buffers into your model evals. See https://www-files.anthropic.com/production/files/responsible-scaling-policy-1.0.pdf#page=11. “the RSP is unclear on how to handle that ambiguity” is wrong, I think; Anthropic treats a model as RSP-2 until the evals-with-safety-buffers trigger, and then they implement ASL-3 safety measures.
I don’t really get this.
Shrug; maybe. I wish I was aware of a better plan… but I’m not.
I am excited about this. I’ve also recently been interested in ideas like nudge researchers to write 1-5 page research agendas, then collect them and advertise the collection.
Possible formats:
A huge google doc (maybe based on this post); anyone can comment; there’s one or more maintainers; maintainers approve ~all suggestions by researchers about their own research topics and consider suggestions by random people.
A directory of google docs on particular agendas; the individual google docs are each owned by a relevant researcher, who is responsible for maintaining them; some maintainer-of-the-whole-project occasionally nudges researchers to update their docs and reassigns the topic to someone else if necessary. Random people can make suggestions too.
(Alex, I think we can do much better than the best textbooks format in terms of organization, readability, and keeping up to date.)
I am interested in helping make something like this happen. Or if it doesn’t happen soon I might try to do it (but I’m not taking responsibility for making this happen). Very interested in suggestions.
(One particular kind-of-suggestion: is there a taxonomy/tree of alignment research directions you like, other than the one in this post? (Note to self: taxonomies have to focus on either methodology or theory of change… probably organize by theory of change and don’t hesitate to point to the same directions/methodologies/artifacts in multiple places.))