A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections.
I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn’t how safety policies are supposed to work.
What is the change and how does it affect security?
9 days ago, Anthropic changed their RSP so that ASL-3 no longer requires being robust to employees trying to steal model weights if the employee has any access to “systems that process model weights”.
Anthropic claims this change is minor (and calls insiders with this access “sophisticated insiders”).
But, I’m not so sure it’s a small change: we don’t know what fraction of employees could get this access and “systems that process model weights” isn’t explained.
Naively, I’d guess that access to “systems that process model weights” includes employees being able to operate on the model weights in any way other than through a trusted API (a restricted API that we’re very confident is secure). If that’s right, it could be a high fraction! So, this might be a large reduction in the required level of security.
If this does actually apply to a large fraction of technical employees, then I’m also somewhat skeptical that Anthropic can actually be “highly protected” from (e.g.) organized cybercrime groups without meeting the original bar: hacking an insider and using their access is typical!
Also, one of the easiest ways for security-aware employees to evaluate security is to think about how easily they could steal the weights. So, if you don’t aim to be robust to employees, it might be much harder for employees to evaluate the level of security and then complain about not meeting requirements[1].
Anthropic’s justification and why I disagree
Anthropic justified the change by saying that model theft isn’t much of the risk from amateur CBRN uplift (CBRN-3) and that the risks from AIs being able to “fully automate the work of an entry-level, remote-only Researcher at Anthropic” (AI R&D-4) don’t depend on model theft.
I disagree.
On CBRN: If other actors are incentivized to steal the model for other reasons (e.g. models become increasingly valuable), it could end up broadly proliferating which might greatly increase risk, especially as elicitation techniques improve.
On AI R&D: AIs which are over the capability level needed to automate the work of an entry-level researcher could seriously accelerate AI R&D (via fast speed, low cost, and narrow superhumanness). If other less safe (or adversarial) actors got access, risk might increase a bunch.[2]
More strongly, ASL-3 security must suffice up until the ASL-4 threshold: it has to cover the entire range from ASL-3 to ASL-4. ASL-4 security itself is still not robust to high-effort attacks from state actors which could easily be motivated by large AI R&D acceleration.
As of the current RSP, it must suffice until just before AIs can “substantially uplift CBRN [at] state programs” or “cause dramatic acceleration in [overall AI progress]”. These seem like extremely high bars indicating very powerful systems, especially the AI R&D threshold.[3]
As it currently stands, Anthropic might not require ASL-4 security (which still isn’t sufficient for high effort state actor attacks) until we see something like 5x AI R&D acceleration (and there might be serious issues with measurement lag).
I’m somewhat sympathetic to security not being very important for ASL-3 CBRN, but it seems very important as of the ASL-3 AI R&D and seems crucial before the AI R&D ASL-4 threshold! I think the ASL-3 AI R&D threshold should probably instead trigger the ASL-4 security!
Overall, Anthropic’s justification for this last minute change seems dubious and the security requirements they’ve currently committed to seem dramatically insufficient for AI R&D threat models. To be clear, other companies have worse security commitments.
Concerns about potential noncompliance and lack of visibility
Another concern is that this last minute change is quite suggestive of Anthropic being out of compliance with their RSP before they weakened the security requirements.
We have to trust Anthropic quite a bit to rule out noncompliance. This isn’t a good state of affairs.
To explain this concern, I’ll need to cover some background on how the RSP works.
The RSP requires ASL-3 security as soon as it’s determined that ASL-3 can’t be ruled out (as Anthropic says is the case for Opus 4).
Here’s how it’s supposed to go:
They ideally have ASL-3 security mitigations ready, including the required auditing.
Once they find the model is ASL-3, they apply the mitigations immediately (if not already applied).
If they aren’t ready, they need temporary restrictions.
My concern is that the security mitigations they had ready when they found the model was ASL-3 didn’t suffice for the old ASL-3 bar but do suffice for the new bar (otherwise why did they change the bar?). So, prior to the RSP change they might have been out of compliance.
It’s certainly possible they remained compliant:
Maybe they had measures which temporarily sufficed for the old higher bar but which were too costly longer term. Also, they could have deleted the weights outside of secure storage until the RSP was updated to lower the bar.
Maybe an additional last minute security assessment (which wasn’t required to meet the standard?) indicated inadequate security and they deployed temporary measures until they changed the RSP. It would be bad to depend on last minute security assessment for compliance.
(It’s also technically possible that the ASL-3 capability decision was made after the RSP was updated. This would imply the decision was only made 8 days before release, so hopefully this isn’t right. Delaying evals until an RSP change lowers the bar would be especially bad.)
Conclusion
Overall, this incident demonstrates our limited visibility into AI companies. How many employees are covered by the new bar? What triggered this change? Why does Anthropic believe it remained in compliance? Why does Anthropic think that security isn’t important for ASL-3 AI R&D?
I think a higher level of external visibility, auditing, and public risk assessment would be needed (as a bare minimum) before placing any trust in policies like RSPs to keep the public safe from AI companies, especially as they develop existentially dangerous AIs.
To be clear, I appreciate Anthropic’s RSP update tracker and that it explains changes. Other AI companies have mostly worse safety policies: as far as I can tell, o3 and Gemini 2.5 Pro are about as likely to cross the ASL-3 bar as Opus 4 and they have much worse mitigations!
Appendix and asides
I don’t think current risks are existentially high (if current models were fully unmitigated, I’d guess this would cause around 50,000 expected fatalities per year) and temporarily being at a lower level of security for Opus 4 doesn’t seem like that big of a deal. Also, given that security is only triggered after a capability decision, the ASL-3 CBRN bar is supposed to include some conservativeness anyway. But, my broader points around visibility stand and potential noncompliance (especially unreported noncompliance) should be worrying even while the stakes are relatively low.
You can view the page showing the RSP updates including the diff of the latest change here: https://www.anthropic.com/rsp-updates. (Edit: the diff is the “redline” pdf and is only available for the most recent change.) Again, I appreciate that Anthropic has this page and makes it easy to see the changes they make to the RSP.
I find myself quite skeptical that Anthropic actually could rule out that Sonnet 4 and other models weaker than Opus 4 cross the ASL-3 CBRN threshold. How sure is Anthropic that it wouldn’t substantially assist amateurs even after the “possible performance increase from using resources that a realistic attacker would have access to”? I feel like our current evidence and understanding is so weak, and models already substantially exceed virology experts at some of our best proxy tasks.
The skepticism applies similarly or more to other AI companies (and Anthropic’s reasoning is more transparent).
But, this just serves to further drive home ways in which the current regime is unacceptable once models become so capable that the stakes are existential.
One response is that systems this powerful will be open sourced or trained by less secure AI companies anyway. Sure, but the intention of the RSP is (or was) to outline what would “keep risks below acceptable levels” if all actors follow a similar policy.
(I don’t know if I ever bought that the RSP would succeed at this. It’s also worth noting there is an explicit exit clause Anthropic could invoke if they thought proceeding outweighed the risks despite the risks being above an acceptable level.)
This sort of criticism is quite time consuming and costly for me. For this reason there are specific concerns I have about AI companies which I haven’t discussed publicly. This is likely true for other people as well. You should keep this in mind when assessing AI companies and their practices.
It also makes it harder for these complaints to be legible to other employees while other employees might be able to more easily interpret arguments about what they could do.
It looks like AI 2027 would estimate around a ~2x AI R&D acceleration for a system which was just over this ASL-3 AI R&D bar (as it seems somewhat more capable than the “Reliable agent” bar). I’d guess more like 1.5x at this point, but either way this is a big deal!
Anthropic says they’ll likely require a higher level of security for this “dramatic acceleration” AI R&D threshold, but they haven’t yet committed to this nor have they defined a lower AI R&D bar which results in an ASL-4 security requirement.
I’d been pretty much assuming that AGI labs’ “responsible scaling policies” are LARP/PR, and that if an RSP ever conflicts with their desire to release a model, either the RSP will be swiftly revised, or the testing suite for the model will be revised such that it doesn’t trigger the measures the AGI lab doesn’t want to trigger. I. e.: that RSPs are toothless and that their only purposes are to showcase how Responsible the lab is and to hype up how powerful a given model ended up.
This seems to confirm that cynicism.
(The existence of the official page tracking the updates is a (smaller) update in the other direction, though. I don’t see why they’d have it if they consciously intended to RSP-hack this way.)
Employees at Anthropic don’t think the RSP is LARP/PR. My best guess is that Dario doesn’t think the RSP is LARP/PR.
This isn’t necessarily in conflict with most of your comment.
I think I mostly agree the RSP is toothless. My sense is that for any relatively subjective criteria, like making a safety case for misalignment risk, the criteria will basically come down to “what Jared+Dario think is reasonable”. Also, if Anthropic is unable to meet this (very subjective) bar, then Anthropic will still basically do whatever Anthropic leadership thinks is best whether via maneuvering within the constraints of the RSP commitments, editing the RSP in ways which are defensible, or clearly substantially loosening the RSP and then explaining they needed to do this due to other actors having worse precautions (as is allowed by the RSP). I currently don’t expect clear cut and non-accidental procedural violations of the RSP (edit: and I think they’ll be pretty careful to avoid accidental procedural violations).
I’m skeptical of normal employees having significant influence on high stakes decisions via pressuring the leadership, but empirical evidence could change the views of Anthropic leadership.
How you feel about this state of affairs depends a lot on how much you trust Anthropic leadership to make decisions which are good from your perspective.
Minimally it’s worth noting that Dario and Jared are much less concerned about misalignment risk than I am and I expect only partial convergence in beliefs due to empirical evidence (before it’s too late).
I think the RSP still has a few important purposes:
I expect that the RSP will eventually end up with some transparency commitments with some teeth. These won’t stop Anthropic from proceeding if Anthropic leadership thinks this is best, but it might at least mean there ends up being common knowledge of whether reasonable third parties (or Anthropic leadership) think the current risk is large.
I think the RSP might end up with serious security requirements. I don’t expect these will be met on time in short timelines but the security bar specified in advance might at least create some expectations about what a baseline security expectation would be.
Anthropic might want to use the RSP the bind itself to the mast so that investors or other groups have a harder time pressuring it to spend less on security/safety.
There are some other more tenative hopes (e.g., eventually getting common expectations of serious security or safety requirements which are likely to be upheld, regulation) which aren’t impossible.
And there are some small wins already, like Google Deepmind having set some security expectations for itself which it is reasonably likely to follow through with if it isn’t too costly.
How you feel about this state of affairs depends a lot on how much you trust Anthropic leadership to make decisions which are good from your perspective.
Another note: My guess is that people on LessWrong tend to be overly pessimistic about Anthropic leadership (in terms of how good of decisions Anthropic leadership will make under the LessWrong person’s views and values) and Anthropic employees tend to be overly optimistic.
I’m less confident that people on LessWrong are overly pessimistic, but they at least seem too pessimistic about the intentions/virtue of Anthropic leadership.
For the record, I think the importance of “intentions”/values of leaders of AGI labs is overstated. What matters the most in the context of AGI labs is the virtue / power-seeking trade-offs, i.e. the propensity to do dangerous moves (/burn the commons) to unilaterally grab more power (in pursuit of whatever value).
Stuff like this op-ed, broken promise of not meaningfully pushing the frontier, Anthropic’s obsession & single focus on automating AI R&D, Dario’s explicit calls to be the first to RSI AI or Anthropic’s shady policy activity has provided ample evidence that their propensity to burn the commons to grab more power (probably in name of some values I would mostly agree with fwiw) is very high.
As a result, I’m now all-things-considered trusting Google DeepMind slightly more than Anthropic to do what’s right for AI safety. Google, as a big corp, is less likely to do unilateral power grabbing moves (such as automating AI R&D asap to achieve a decisive strategic advantage), is more likely to comply with regulations, and is already fully independent to build AGI (compute / money / talent) so won’t degrade further in terms of incentives; additionally D. Hassabis has been pretty consistent in his messaging about AI risks & AI policy, about the need for an IAEA/CERN for AI etc., Google has been mostly scaling up its safety efforts and has produced some of the best research on AI risk assessment (e.g. this excellent paper, or this one).
IMO, reasonableness and epistemic competence are also key factors. This includes stuff like how effectively they update on evidence, how much they are pushed by motivated reasoning, how good are they at futurism and thinking about what will happen. I’d also include “general competence”.
(This is a copy of my comment made on your shortform version of this point.)
Not the main thrust of the thread, but for what it’s worth, I find it somewhat anti-helpful to flatten things into a single variable of “how much you trust Anthropic leadership to make decisions which are good from your perspective”, and then ask how optimistic/pessimistic you are about this variable.
I think I am much more optimistic about Anthropic leadership on many axis relative to an overall survey of the US population or Western population – I expect them to be more libertarian, more in favor of free speech, more pro economic growth, more literate, more self-aware, higher IQ, and a bunch of things.
I am more pessimistic about their ability to withstand the pressures of a trillion dollar industry to shape their incentives than the people who are at Anthropic.
I believe the people working there are siloing themselves intellectually into an institution facing incredible financial incentives for certain bottom lines like “rapid AI progress is inevitable” and “it’s reasonably likely we can solve alignment” and “beating China in the race is a top priority”, and aren’t allowed to talk to outsiders about most details of their work, and this is a key reason that I expect them to screw up their decision-making.
I am optimistic about their relative-ability to have a sensible conversation about the next 5 years and what alignment failures look like, relative to most people on earth. This is not the standard I require to expect people to not do ML training runs that lead to human extinction, but nonetheless I predict they will do relatively quite well on this axis.
I don’t have a single variable here, I have a much more complicated model than this. It looks to me that collapsing questions of trust about people or groups into a single varibale of how optimistic I am about them making decisions which are good from my values has been a common question-substitution in the Effective Altruism scene, where I think people have been repeatedly hoodwinked by sociopaths due to not moving toward a more detailed model that predicts exactly where and when someone will make good vs bad decisions.
I certainly agree that the pressures and epistemic environment should make you less optimistic about good decisions being made. And that thinking through the overall situation and what types or decisions you care about are important. (Like, you can think of my comment as making a claim about the importance weighted goodness of decisions.)
I don’t see the relevance of “relative decision making goodness compared to the general population” which I think you agree with, but in that case I don’t see what this was responding to.
Not sure I agree with other aspects of this comment and implications. Like, I think reducing things to a variable like “how good is it to generically empowering this person/group” is pretty reasonable in the case of Anthropic leadership because in a lot of cases they’d have a huge amount of general open ended power, though a detailed model (taking into account what decisions you care about etc) would need to feed into this.
What’s an example decision or two where you would want to ask yourself whether they should get more or less open-ended power? I’m not sure what you’re thinking of.
I think the main thing I want to convey is that I think you’re saying that LWers (of which I am one) have a very low opinion of the integrity of people at Anthropic, but what I’m actually saying that their integrity is no match for the forces that they are being tested with.
I don’t need to be able to predict a lot of fine details about individuals’ decision-making in order to be able to have good estimates of these two quantities, and comparing them is the second-most question relating to whether it’s good to work on capabilities at Anthropic. (The first one is a basic ethical question about working on a potentially extinction-causing technology that is not much related to the details of which capabilities company you’re working on.)
I think you’re saying that LWers (of which I am one) have a very low opinion of the integrity of people at Anthropic
This is related to what I was saying but it wasn’t what I was saying. I was saying “tend to be overly pessimistic about Anthropic leadership (in terms of how good of decisions Anthropic leadership will make under the LessWrong person’s views and values)”. I wasn’t making a claim about the perceived absolute level of integrity.
Probably not worth hashing this out further, I think I get what you’re saying.
Employees at Anthropic don’t think the RSP is LARP/PR. My best guess is that Dario doesn’t think the RSP is LARP/PR.
Yeah, I don’t think this is necessarily in contradiction with my comment. Things can be effectively just LARP/PR without being consciously LARP/PR. (Indeed, this is likely the case in most instances of LARP-y behavior.)
Can you explain how you got the diffs from https://www.anthropic.com/rsp-updates ? I see the links to previous versions, but nothing that’s obviously a diff view to see the actual changed language.
I feel as though I must be missing the motivation for Anthropic to do this. Why put so much effort into safety/alignment research just to intentionally fumble the ball on actual physical security?
I would like to understand why they would resist this. Is increasing physical security so onerous that it’s going to seriously hamper their research efficiency?
I think security is legitimately hard and can be costly in research efficiency. I think there is a defensible case for this ASL-3 security bar being reasonable for the ASL-3 CBRN threshold, but it seems too weak for the ASL-3 AI R&D threshold (hopefully the bar for things like this ends up being higher).
Could you give an example of where security would negatively effect research efficiency? Like what is the actual implementation difficulty that arises from increased physical security?
Every time you want to interact with the weights in some non-basic way, you need to have another randomly selected person who inspects in detail all the code and commands you run.
The datacenter and office are airgapped and so you don’t have internet access.
Increased physical security isn’t much of difficulty.
This is a great post. Good eye for catching this and making the connections here. I think I expect to see more “cutting corners” like this though I’m not sure what to do about it since I don’t think internally it will feel like corners are being cut rather than necessary updates that are only obvious in hindsight.
I’ve heard from a credible source that OpenAI substantially overestimated where other AI companies were at with respect to RL and reasoning when they released o1. Employees at OpenAI believed that other top AI companies had already figured out similar things when they actually hadn’t and were substantially behind. OpenAI had been sitting the improvements driving o1 for a while prior to releasing it. Correspondingly, releasing o1 resulted in much larger capabilities externalities than OpenAI expected. I think there was one more case like this either from OpenAI or GDM where employees had a large misimpression about capabilities progress at other companies causing a release they wouldn’t do otherwise.
One key takeaway from this is that employees at AI companies might be very bad at predicting the situation at other AI companies (likely making coordination more difficult by default). This includes potentially thinking they are in a close race when they actually aren’t. Another update is that keeping secrets about something like reasoning models worked surprisingly well to prevent other companies from copying OpenAI’s work even though there was a bunch of public reporting (and presumably many rumors) about this.
One more update is that OpenAI employees might unintentionally accelerate capabilities progress at other actors via overestimating how close they are. My vague understanding was that they haven’t updated much, but I’m unsure. (Consider updating more if you’re an OpenAI employee!)
Interesting. What confuses me a bit: What made other companies be able to copy OpenAI’s work after it was released, conditional on your story being true? As far as I know, OpenAI didn’t actually explain their methods developing o1, so what exactly did other companies learn from the release which they didn’t learn from the rumors that OpenAI is developing something like this?
Is the conclusion basically that Jesse Hoogland has been right that just the few bits that OpenAI did leak already constrained the space of possibilities enough for others to copy the work? Quote from his post:
For all its secrecy, OpenAI has leaked enough bits to tightly constrain the space of possibilities.
The few bits they leaked in the release helped a bunch. Note that these bits were substantially leaked via people being able to use the model rather than necessarily via the blog post.
Other companies weren’t that motivated to try to copy OpenAI’s work until it was released as they we’re sure how important it was or how good the results were.
“Employees at OpenAI believed…” — do you mean Sam Altman and the board?
If this information is accurate, it speaks volumes about how flawed their alignment predictions might also be. If a company with vast resources and insider access like OpenAI can’t predict the capabilities of competing firms (a relatively simple problem with objectively knowable answers), how can we expect them to predict the behavior of advanced AI models, where the unknowns are far greater and often unknowable?
I’m currently working as a contractor at Anthropic in order to get employee-level model access as part of a project I’m working on. The project is a model organism of scheming, where I demonstrate scheming arising somewhat naturally with Claude 3 Opus. So far, I’ve done almost all of this project at Redwood Research, but my access to Anthropic models will allow me to redo some of my experiments in better and simpler ways and will allow for some exciting additional experiments. I’m very grateful to Anthropic and the Alignment Stress-Testing team for providing this access and supporting this work. I expect that this access and the collaboration with various members of the alignment stress testing team (primarily Carson Denison and Evan Hubinger so far) will be quite helpful in finishing this project.
I think that this sort of arrangement, in which an outside researcher is able to get employee-level access at some AI lab while not being an employee (while still being subject to confidentiality obligations), is potentially a very good model for safety research, for a few reasons, including (but not limited to):
For some safety research, it’s helpful to have model access in ways that labs don’t provide externally. Giving employee level access to researchers working at external organizations can allow these researchers to avoid potential conflicts of interest and undue influence from the lab. This might be particularly important for researchers working on RSPs, safety cases, and similar, because these researchers might naturally evolve into third-party evaluators.
Related to undue influence concerns, an unfortunate downside of doing safety research at a lab is that you give the lab the opportunity to control the narrative around the research and use it for their own purposes. This concern seems substantially addressed by getting model access through a lab as an external researcher.
I think this could make it easier to avoid duplicating work between various labs. I’m aware of some duplication that could potentially be avoided by ensuring more work happened at external organizations.
For these and other reasons, I think that external researchers with employee-level access is a promising approach for ensuring that safety research can proceed quickly and effectively while reducing conflicts of interest and unfortunate concentration of power. I’m excited for future experimentation with this structure and appreciate that Anthropic was willing to try this. I think it would be good if other labs beyond Anthropic experimented with this structure.
(Note that this message was run by the comms team at Anthropic.)
Yay Anthropic. This is the first example I’m aware of of a lab sharing model access with external safety researchers to boost their research (like, not just for evals). I wish the labs did this more.
[Edit: OpenAI shared GPT-4 access with safety researchers including Rachel Freedman before release. OpenAI shared GPT-4 fine-tuning access with academic researchers including Jacob Steinhardt and Daniel Kang in 2023. Yay OpenAI. GPT-4 fine-tuning access is still not public; some widely-respected safety researchers I know recently were wishing for it, and were wishing they could disable content filters.]
I’d be surprised if this was employee-level access. I’m aware of a red-teaming program that gave early API access to specific versions of models, but not anything like employee-level.
It was a secretive program — it wasn’t advertised anywhere, and we had to sign an NDA about its existence (which we have since been released from). I got the impression that this was because OpenAI really wanted to keep the existence of GPT4 under wraps. Anyway, that means I don’t have any proof beyond my word.
(I’m a full-time employee at Anthropic.) It seems worth stating for the record that I’m not aware of any contract I’ve signed whose contents I’m not allowed to share. I also don’t believe I’ve signed any non-disparagement agreements. Before joining Anthropic, I confirmed that I wouldn’t be legally restricted from saying things like “I believe that Anthropic behaved recklessly by releasing [model]”.
I think I could share the literal language in the contractor agreement I signed related to confidentiality, though I don’t expect this is especially interesting as it is just a standard NDA from my understanding.
I do not have any non-disparagement, non-solicitation, or non-interference obligations.
I’m not currently going to share information about any other policies Anthropic might have related to confidentiality, though I am asking about what Anthropic’s policy is on sharing information related to this.
Here is the full section on confidentiality from the contract:
Confidential Information.
(a) Protection of Information. Consultant understands that during the Relationship, the Company intends to provide Consultant with certain information, including Confidential Information (as defined below), without which Consultant would not be able to perform Consultant’s duties to the Company. At all times during the term of the Relationship and thereafter, Consultant shall hold in strictest confidence, and not use, except for the benefit of the Company to the extent necessary to perform the Services, and not disclose to any person, firm, corporation or other entity, without written authorization from the Company in each instance, any Confidential Information that Consultant obtains from the Company or otherwise obtains, accesses or creates in connection with, or as a result of, the Services during the term of the Relationship, whether or not during working hours, until such Confidential Information becomes publicly and widely known and made generally available through no wrongful act of Consultant or of others who were under confidentiality obligations as to the item or items involved. Consultant shall not make copies of such Confidential Information except as authorized by the Company or in the ordinary course of the provision of Services.
(b) Confidential Information. Consultant understands that “Confidential Information” means any and all information and physical manifestations thereof not generally known or available outside the Company and information and physical manifestations thereof entrusted to the Company in confidence by third parties, whether or not such information is patentable, copyrightable or otherwise legally protectable. Confidential Information includes, without limitation: (i) Company Inventions (as defined below); and (ii) technical data, trade secrets, know-how, research, product or service ideas or plans, software codes and designs, algorithms, developments, inventions, patent applications, laboratory notebooks, processes, formulas, techniques, biological materials, mask works, engineering designs and drawings, hardware configuration information, agreements with third parties, lists of, or information relating to, employees and consultants of the Company (including, but not limited to, the names, contact information, jobs, compensation, and expertise of such employees and consultants), lists of, or information relating to, suppliers and customers (including, but not limited to, customers of the Company on whom Consultant called or with whom Consultant became acquainted during the Relationship), price lists, pricing methodologies, cost data, market share data, marketing plans, licenses, contract information, business plans, financial forecasts, historical financial data, budgets or other business information disclosed to Consultant by the Company either directly or indirectly, whether in writing, electronically, orally, or by observation.
(c) Third Party Information. Consultant’s agreements in this Section 5 are intended to be for the benefit of the Company and any third party that has entrusted information or physical material to the Company in confidence. During the term of the Relationship and thereafter, Consultant will not improperly use or disclose to the Company any confidential, proprietary or secret information of Consultant’s former clients or any other person, and Consultant will not bring any such information onto the Company’s property or place of business.
(d) Other Rights. This Agreement is intended to supplement, and not to supersede, any rights the Company may have in law or equity with respect to the protection of trade secrets or confidential or proprietary information.
(e) U.S. Defend Trade Secrets Act. Notwithstanding the foregoing, the U.S. Defend Trade Secrets Act of 2016 (“DTSA”) provides that an individual shall not be held criminally or civilly liable under any federal or state trade secret law for the disclosure of a trade secret that is made (i) in confidence to a federal, state, or local government official, either directly or indirectly, or to an attorney; and (ii) solely for the purpose of reporting or investigating a suspected violation of law; or (iii) in a complaint or other document filed in a lawsuit or other proceeding, if such filing is made under seal. In addition, DTSA provides that an individual who files a lawsuit for retaliation by an employer for reporting a suspected violation of law may disclose the trade secret to the attorney of the individual and use the trade secret information in the court proceeding, if the individual (A) files any document containing the trade secret under seal; and (B) does not disclose the trade secret, except pursuant to court order.
Do you feel like there are any benefits or drawbacks specifically tied to the fact that you’re doing this work as a contractor? (compared to a world where you were not a contractor but Anthropic just gave you model access to run these particular experiments and let Evan/Carson review your docs)
Being a contractor was the most convenient way to make the arrangement.
I would ideally prefer to not be paid by Anthropic[1], but this doesn’t seem that important (as long as the pay isn’t too overly large). I asked to be paid as little as possible and I did end up being paid less than would otherwise be the case (and as a contractor I don’t receive equity). I wasn’t able to ensure that I only get paid a token wage (e.g. $1 in total or minimum wage or whatever).
I think the ideal thing would be a more specific legal contract between me and Anthropic (or Redwood and Anthropic), but (again) this doesn’t seem important.
At least for this current primary purpose of this contracting. I do think that it could make sense to be paid for some types of consulting work. I’m not sure what all the concerns are here.
It seems a substantial drawback that it will be more costly for you to criticize Anthropic in the future.
Many of the people / orgs involved in evals research are also important figures in policy debates. With this incentive Anthropic may gain more ability to control the narrative around AI risks.
It seems a substantial drawback that it will be more costly for you to criticize Anthropic in the future.
As in, if at some point I am currently a contractor with model access (or otherwise have model access via some relationship like this) it will at that point be more costly to criticize Anthropic?
AI labs may provide model access (or other goods), so people who might want to obtain model access might be incentivized to criticize AI labs less.
Is that accurate?
Notably, as described this is not specifically a downside of anything I’m arguing for in my comment or a downside of actually being a contractor. (Unless you think me being a contractor will make me more likely to want model acess for whatever reason.)
I agree that this is a concern in general with researchers who could benefit from various things that AI labs might provide (such as model access). So, this is a downside of research agendas with a dependence on (e.g.) model access.
I think various approaches to mitigate this concern could be worthwhile. (Though I don’t think this is worth getting into in this comment.)
Notably, as described this is not specifically a downside of anything I’m arguing for in my comment or a downside of actually being a contractor.
In your comment you say
For some safety research, it’s helpful to have model access in ways that labs don’t provide externally. Giving employee level access to researchers working at external organizations can allow these researchers to avoid potential conflicts of interest and undue influence from the lab. This might be particularly important for researchers working on RSPs, safety cases, and similar, because these researchers might naturally evolve into third-party evaluators.
Related to undue influence concerns, an unfortunate downside of doing safety research at a lab is that you give the lab the opportunity to control the narrative around the research and use it for their own purposes. This concern seems substantially addressed by getting model access through a lab as an external researcher.
I’m essentially disagreeing with this point. I expect that most of the conflict of interest concerns remain when a big lab is giving access to a smaller org / individual.
(Unless you think me being a contractor will make me more likely to want model access for whatever reason.)
From my perspective the main takeaway from your comment was “Anthropic gives internal model access to external safety researchers.” I agree that once you have already updated on this information, the additional information “I am currently receiving access to Anthropic’s internal models” does not change much. (Although I do expect that establishing the precedent / strengthening the relationships / enjoying the luxury of internal model access, will in fact make you more likely to want model access again in the future).
Should we update against seeing relatively fast AI progress in 2025 and 2026? (Maybe (re)assess this after the GPT-5 release.)
Around the early o3 announcement (and maybe somewhat before that?), I felt like there were some reasonably compelling arguments for putting a decent amount of weight on relatively fast AI progress in 2025 (and maybe in 2026):
Maybe AI companies will be able to rapidly scale up RL further because RL compute is still pretty low (so there is a bunch of overhang here); this could cause fast progress if the companies can effectively directly RL on useful stuff or RL transfers well even from more arbitrary tasks (e.g. competition programming)
Maybe OpenAI hasn’t really tried hard to scale up RL on agentic software engineering and has instead focused on scaling up single turn RL. So, when people (either OpenAI themselves or other people like Anthropic) scale up RL on agentic software engineering, we might see rapid progress.
It seems plausible that larger pretraining runs are still pretty helpful, but prior runs have gone wrong for somewhat random reasons. So, maybe we’ll see some more successful large pretraining runs (with new improved algorithms) in 2025.
I updated against this perspective somewhat because:
The releases of 3.7 Sonnet and 4 Opus were somewhat below expectations on this perspective. It looks like there wasn’t some easy way to just actually do a bunch of RL on agentic software engineering (with reasoning?) in a way that makes a massive difference (and wasn’t already in the process of being scaled up). Or, at least Anthropic wasn’t able to pull this off; it seems plausible that Anthropic is substantially worse at RL than OpenAI (at least at some aspects of RL like effectively scaling up RL on more narrow tasks). Interestingly, reasoning doesn’t seem to help Anthropic models on agentic software engineering tasks, but does help OpenAI models.
We haven’t yet seen much better models due to more (or algorithmically improved) pretraining AFAICT.
We haven’t seen OpenAI releases that perform substantially better than o3 at software engineering yet despite o3 being announced 7 months ago. (That said, o3 was actually released only 3 months ago.)
I updated towards thinking that the training of o3 was more focused on software engineering than I previously thought (at least the final release version of o3) and the returns weren’t that big. (This is due to rumors, seeing that OpenAI was training on software engineering tasks here, and based on OpenAI releases and communication like Codex.)
I updated a bit against this perspective due to xAI seemingly scaling things up a bunch, but I don’t put as much weight on this because it seems pretty plausible they just did a bad job scaling things up. (E.g., maybe they didn’t actually scale up RL to pretraining scale or if they did, maybe this RL was mostly compute inefficient RL on lower quality environments. xAI might also just generally be algorithmically behind.)
GPT-5 is expected to be released in 0.5-3 weeks and rumors indicate that it is substantially more focused on practical (agentic) software engineering. This is (arguably) the first major model release from OpenAI since o3, and it should resolve some of our uncertainties (particularly related to whether there was/is a bunch of low hanging fruit at OpenAI due to them not being very focused on software engineering).
My expectation is that GPT-5 will be a decent amount better than o3 on agentic software engineering (both in benchmarks and in practice), but won’t be substantially above trend. In particular, my median is that it will have a 2.75 hour time horizon[1] on METR’s evaluation suite[2]. This prediction was produced by extrapolating out the faster 2024-2025 agentic software engineering time horizon trend from o3 and expecting GPT-5 will be slightly below trend.[3]
If GPT-5 is actually a large (way above trend) jump in agentic software engineering with (e.g.) a >6 hour time horizon[4] (which seems plausible but unlikely to me), then we’ll have seen relatively fast (and possibly very fast) software progress in 2025 and we’d naively expect this to continue.[5] If GPT-5 is below trend[6], then it seems like the case against expecting relatively faster AI progress in 2025/2026 due to scaling up RL focused on agentic software engineering is pretty strong.
Overall, I wonder if I have (thus far) insufficiently updated my overall timelines picture based on the observations we’ve had so far in 2025. I’m a bit worried that I’m still operating on cached beliefs when these observations should have pushed away a bunch of the shorter timelines mass. Regardless, I think that the release of GPT-5 (or really, 2-6 weeks after the release of GPT-5 so that we have a better picture of GPT-5′s capabilities) will be a good point to (re)assess and consider stronger updates.
Edit: An earlier version of this post said “3.5 hours”, but this was actually a mistake because I thought o3 had a 2 hour time horizon when it actually has a 1.5 hour time horizon. I also edited from “>8″ to “>6” at a later point in this post as “>8 hours” was meant to refer to 2 doublings from o3 which is actually “>6 hours”.
I do worry that METR’s evaluation suite will start being less meaningful and noisier for longer time horizons as the evaluation suite was built a while ago. We could instead look at 80% reliability time horizons if we have concerns about the harder/longer tasks.
The faster 2024-2025 agentic software engineering time horizon (see figure 19 in METR’s paper) has a 4 month doubling time. o3 was released 4 months before GPT-5 is expected to be released and o3 has a 1.5 hour time horizon (edit: this used to say 2 hour which was a mistake), so this yields a 3 hour time horizon for GPT-5. I think that GPT-5 is more likely than not to be below trend (on at least METR’s specific evaluation) so I round this down a bit to 2.75 hours, though I have a pretty wide confidence interval. I expect below trend rather than above trend due to some early reports about GPT-5, the trend being pretty fast, Opus 4 having lower than expected results, and thinking that the METR evaluation suite might have issues with larger time horizons that result in misleadingly lower numbers.
Again, I’d want to look at multiple metrics. I’m referring to seeing agentic software engineering performance that looks analogous to a >6 hour time horizon on METR’s evaluation suite when aggregating over multiple relevant metrics.
It seems more likely to be a massive jump if OpenAI actually wasn’t yet very focused on agentic software engineering when training o3, but is more focused on this now. This article claims that something like this is the case.
It’s harder to confidently notice that GPT-5 is below trend relative to how hard it is to tell if GPT-5 is way above trend. We should expect it’s some amount better than o3 and the difference between a 2 and a 3 hour time horizon is legitimately hard to measure.
I basically agree with this whole post. I used to think there were double-digit % chances of AGI in each of 2024 and 2025 and 2026, but now I’m more optimistic, it seems like “Just redirect existing resources and effort to scale up RL on agentic SWE” is now unlikely to be sufficient (whereas in the past we didn’t have trends to extrapolate and we had some scary big jumps like o3 to digest)
I still think there’s some juice left in that hypothesis though. Consider how in 2020, one might have thought “Now they’ll just fine-tune these models to be chatbots and it’ll become a mass consumer product” and then in mid-2022 various smart people I know were like “huh, that hasn’t happened yet, maybe LLMs are hitting a wall after all” but it turns out it just took till late 2022/early 2023 for the kinks to be worked out enough.
Also, we should have some credence on new breakthroughs e.g. neuralese, online learning, whatever. Maybe like 8%/yr? Of a breakthrough that would lead to superhuman coders within a year or two, after being appropriately scaled up and tinkered with.
Re neuralese/online or continual learning or long-term memory that isn’t solely a context window breakthrough, I’m much more skeptical of it being very easy to integrate breakthroughs on short timelines, because it’s likely that changes will have to be made to the architecture that aren’t easy to do very quickly.
The potential for breakthroughs combined with the fact that Moore’s law will continue, making lots of compute cheap for researchers is a reason I think that my median timelines aren’t in the latter half of the century, but I think that it’s much more implausible to get it working very soon, so I’m much closer to 0.3% a year from 2025-2027.
@Mo Putera@the gears to ascension take the Moore’s law will continue point as a prediction that new paradigms like memristors will launch new S-curves of efficiency until we reach the Landauer Limit, which is 6.5 OOMs away, and that the current paradigm has 200x more efficiency savings to go:
Interestingly, reasoning doesn’t seem to help Anthropic models on agentic software engineering tasks, but does help OpenAI models.
I use ‘ultrathink’ in Claude Code all the time and find that it makes a difference.
I do worry that METR’s evaluation suite will start being less meaningful and noisier for longer time horizons as the evaluation suite was built a while ago. We could instead look at 80% reliability time horizons if we have concerns about the harder/longer tasks.
I’m overall skeptical of overinterpreting/extrapolating the METR numbers. It is far too anchored on the capabilities of a single AI model, a lightweight scaffold, and a notion of ‘autonomous’ task completion of ‘human-hours’. I think this is a mental model for capabilities progress that will lead to erroneous predictions.
If you are trying to capture the absolute frontier of what is possible, you don’t only test a single-acting model in an empty codebase with limited internet access and scaffolding. I would personally be significantly less capable at agentic coding if I only used 1 model (like replicating subliminal learning in about 1 hour of work + 2 hours of waiting for fine-tunes on the day of the release) with limited access to resources. You are instead using a variety of AI models based on their pros and cons[1], with well-crafted codebases for agentic coding and giving them access to whatever they want on the internet as a reference (+ much more)[2]. METR does note this limitation, but I want to emphasize its importance and potential for misleading extrapolations if people only consider the headline charts without considering the nuance.
I had a notification ping in my brain just now while using claude code and realizing I’d just told it to think for a long time: I don’t think the claim is true, because it doesn’t match my experience.
Anthropic reports SWE bench scores without reasoning which is some evidence it doesn’t help (much) on this sort of task. (See e.g. the release blog post for 4 opus)
Anecdotal evidence
Probably it would be more accurate to say “doesn’t seem to help much while it helps a lot for openai models”.
I think non-formal IMO gold was unexpected and we heard explicitly that it won’t be in GPT-5. So I would wait to see how it would pan out. It may not matter in 2025 but I think it can in 2026.
Why should we think that the relevant progress driving non-formal IMO is very important for plausibly important capabilities like agentic software engineering? I’d guess the transfer is relatively weak unless the IMO results were driven by general purpose advances. This seems somewhat unlikely: if the main breakthrough was in better performance on non-trivial-to-verify tasks (as various posts from OpenAI people claim), then even if this generalizes well beyond proofs this wouldn’t obviously particularly help with agentic software engineering (where the core blocker doesn’t appear to be verification difficulty).
Edit: I think I mostly retract this comment, see below.
Why should we think that the relevant progress driving non-formal IMO is very important for plausibly important capabilities like agentic software engineering? [...] if the main breakthrough was in better performance on non-trivial-to-verify tasks (as various posts from OpenAI people claim), then even if this generalizes well beyond proofs this wouldn’t obviously particularly help with agentic software engineering (where the core blocker doesn’t appear to be verification difficulty).
I’m surprised by this. To me it seems hugely important how fast AIs are improving on tasks with poor feedback loops, because obviously they’re in a much better position to improve on easy-to-verify tasks, so “tasks with poor feedback loops” seem pretty likely to be the bottleneck to an intelligence explosion.
So I definitely do think that “better performance on non-trivial-to-verify tasks” are very important for some “plausibly important capabilities”. Including agentic software engineering. (Like: This also seems related to why the AIs are much better at benchmarks than at helping people out with their day-to-day work.)
Hmm, yeah I think you’re right, though I also don’t think I articulated what I was trying to say very well.
Like I think my view is:
There was some story where we would see very fast progress in relatively easy to verify (or trivial to verify) tasks and I’m talking about that. It seems like agentic software engineering could reach very high levels without necessarily needing serious improvements in harder to verify tasks.
Faster progress in non-trivial-to-verify tasks might not be the limiting factor if progress in easy to verify tasks isn’t that fast.
I still think that there won’t be a noticable jump as the IMO methods make it into production models but this is due to more general heuristics (and the methods maybe still matter, it just won’t be something to wait for I think).
I think IMO results were driven by general purpose advances, but I agree I can’t conclusively prove it because we don’t know details. Hopefully we will learn more as time goes by.
An informal argument: I think currently agentic software engineering is blocked on context rot, among other things. I expect IMO systems to have improved on this, since IMO time control is 1.5 hours per problem.
(I’m skeptical that much of the IMO improvement was due to improving how well AIs can use their context in general. This isn’t a crux for my view, but it also seems pretty likely that the AIs didn’t do more than ~100k serial tokens of reasoning for the IMO while still aggregating over many such reasoning traces.)
GPT-5 reached 2h17m, which seems like excellent news. However, excluding spurious failures would bring GPT-5′s performance to 2h41m, which aligns with Greenblatt’s prediction. Moreover, METR evaluators themselves think that “GPT-5 could have benefitted from a larger token budget”, implying that the benchmark began to corrupt. What other relevant metrics there exist?
The AI-2027 forecast has mid-2025 agents reach 85% on SWE-bench verified and 65% on the OSWorld benchmark.
OSWorld reached 60% on August 4 if we use no filters. SWE-bench with a minimal agent has Claude Opus 4 (20250514) reach 67.6% when evaluated in August. Moreover, on August 7 the only models that SWE-bench evaluated after 1st July were Claude 4 Opus and two Chinese models. In June SWE-bench verified reached 75% with TRAE. And now TRAE claims to use Grok 4 and Kimi K2.
Grok 4 managed to fail on tasks worthy of 2-4 seconds(!!), 2-4 minutes and to experience a fiasco on 2-4 hours long tasks. Page 22 of the METR paper could imply that the dataset contains few tasks that are 2-4 hrs long. If tasks worthy of 2-4 seconds, minutes or hours “sandbagged” Grok’s 80% time horizon to 15 minutes, then the metric underestimates Grok’s true capabilities.
While there are no estimates of Gemini 2.5-Deep Think, which was released on August 1, IIRC a LessWronger claimed that the public version received a bronze medal on IMO 2025. Another LessWronger claimed that “Gemini was ahead of openai on the IMO gold. The output was more polished so presumably they achieved a gold worthy model earlier. I expect gemini’s swe bench to thus at least be ahead of OpenAI’s 75%. ”
To conclude, I doubt that we still have benchmarks that can be relied upon to quickly estimate the models’ capabilities: SWE-bench and OSWorld are likely too slow, METR began to fill with noise. While we do have ARC-AGI yet, Grok’s success could have demonstrated the ability to gamble it. And that’s ignoring Claude’s potential improvements after Opus 4.1...
EDIT: TRAE uses an unknown scaffolding. However, applying mini-SWE-agent to Claude 4 Opus (20250514) yields better results than GPT-5, implying that other benchmarks might also increase after the Claude Opus 4 update to 4.1 and future updates.
My expectation is that GPT-5 will be a decent amount better than o3 on agentic software engineering (both in benchmarks and in practice), but won’t be substantially above trend. In particular, my median is that it will have a 2.75 hour time horizon[1] on METR’s evaluation suite[2]. This prediction was produced by extrapolating out the faster 2024-2025 agentic software engineering time horizon trend from o3 and expecting GPT-5 will be slightly below trend.[3]
If the correlations continue to hold, this would map to something like a 78% to 80% range on swe-bench pass @ 1 (which is likely to be announced at release). I’m personally not this bearish (I’d guess low 80s given that benchmark has reliably jumped ~3.5% monthly), but we shall see.
Needless to say if it scores 80%, we are well below AI 2027 timeline predictions with high confidence.
The data is pretty low-quality for that graph because the agents we used were inconsistent and Claude 3-level models could barely solve any tasks. Epoch has better data for SWE-bench Verified, which I converted to time horizon here and found to also be doubling every 4 months ish. Their elicitation is probably not as good for OpenAI as Anthropic models, but both are increasing at similar rates.
I think that even before the release of GPT-5 and setting aside Grok 4′s problems I have a weak case against non-neuralese AI progress being likely to be fast. Recall the METR measurements.
The time horizon of base LLMs experienced a slowdown or plateau[1] between GPT-4 (5 minutes, Mar′23) and GPT-4o (9 min, May ’24).
Evaluation of Chinese models has DeepSeek’s time horizons[2] change only from 18 to 31 minutes between[3] V3 (Dec ’24) and R1-0528 (May ’25).
While Grok 4 was likely trained incompetently[4] and/or for the benchmarks, its 50% time horizon is 1.83 hrs (vs. o3′s 1.54 hrs) and 80% time horizon is 15 min (vs. o3′s 20 min) In other words, Grok 4′s performance is comparable with that of o3.
Taken together, two plateaus and Grok 4′s failure suggest a troubling pattern: creation of an AGI is likely to require[5] neuralese, which will likely prevent the humans from noticing misalignment.
Alas, METR’s evaluation of DeepSeek’s capabilities might have missed “agent scaffolds which could elicit the capabilities of the evaluated models much more effectively”. If there exists an alternate scaffold where R1-0528 becomes a capable agent and V3 doesn’t, then DeepSeek’s models are not on a plateau.
In addition, DeepSeek V3 released in December didn’t use a CoT. If the main ingredient necessary for capabilities increase is a MoE, not the CoT, then what can be said about Kimi K2?
Grok 4 could have also been deliberately trained on complex tasks, which might have made the success rate less time-dependent. After all, it did reach 16% on the ARC-AGI-2 benchmark.
There is, however, Knight Lee’s proposal or the creation of many agents having access to each other’s CoTs and working in parallel. While Grok 4 Heavy could be a step in this direction, the agents receive access to each other’s CoTs after they finish the work.
Recently, various groups successfully lobbied to remove the moratorium on state AI bills. This involved a surprising amount of success while competing against substantial investment from big tech (e.g. Google, Meta, Amazon). I think people interested in mitigating catastrophic risks from advanced AI should consider working at these organizations, at least to the extent their skills/interests are applicable. This both because they could often directly work on substantially helpful things (depending on the role and organization) and because this would yield valuable work experience and connections.
I worry somewhat that this type of work is neglected due to being less emphasized and seeming lower status. Consider this an attempt to make this type of work higher status.
Pulling organizations mostly from here and here we get a list of orgs you could consider trying to work (specifically on AI policy) at:
Fairplay (Fairplay is a kids safety organization which does a variety of advocacy which isn’t related to AI. Roles/focuses on AI would be most relevant. In my opinion, working on AI related topics at Fairplay is most applicable for gaining experience and connections.)
Kids safety seems like a pretty bad thing to focus on, in the sense that the vast majority of kids safety activism causes very large amounts of harm (and it helping in this case really seems like a “a stopped clock is right twice a day situation”).
I looked at the FairPlay website and agree that “banning schools from contacting kids on social media” or “preventing Gemini rollouts to under-13s” is not coherent under my threat model. However I think there is clear evidence that current parental screen time controls may not be a sufficiently strong measure to mitigate extant generational mental health issues (I am particularly worried about insomnia, depression, eating disorders, autism spectrum disorders, and self harm).
Zvi had previously reported on YouTube shorts reaching 200B daily views. This is clearly a case of egregiously user hostile design with major social and public backlash. I could not find a canonical citation on medrxiv and don’t believe it would be ethical to run a large scale experiment on the long term impacts of this but there are observational studies. Given historical cases of model sycophancy and the hiring of directors focused on maximizing engagement I think it’s not implausible for similar design outcomes.
I think that the numbers in this Anthropic blog post https://www.anthropic.com/news/how-people-use-claude-for-support-advice-and-companionship do not accurately portray reality. They report only 0.5% of conversations as being romantic or sexual roleplay, but I consider this to be misleading because they exclude chats focused on content creation tasks (such as writing stories, blog posts, or fictional dialogues), which their previous research found to be a major use case. Because the models are trained to refuse requests for explicit content, it’s common for jailbreaks to start by saying “it’s okay to do this because it’s just a fictional scenario in a story”. Anecdotally I have heard labs don’t care about this much in contrast to CBRN threats.
Let’s look at the top ten apps ranked by tokens on https://openrouter.ai/rankings. They are most well known for hosting free API instances of DeepSeek v3 and r1, which was the only way to get high usage out of SOTA LLMs for free before the Google AI studio price drop for Gemini 2.5 Pro. It is not the best proxy for real world usage because it requires technical sophistication and this is reflected in the first four (cline, roo code, litellm, and kilo code are all for software development) but the next four (sillytavern, chub ai, hammerai, roleplai) are all indicative that the distribution of tasks done with models at this capabilities level do not differ significantly from the distribution of tasks which people visit websites for. Although I wouldn’t morally panic about this since it seems likely to me that conventional security methods will be good enough to mostly prevent us from turning into glichers.
Kids safety activists are one of the only groups with a track record of introducing AI capabilities restrictions which actually get enforced. Multimodal models can now create both images and text, but the image models are more locked down (Gemini 2.5 defaults to stricter block thresholds for image generation than for text generation), and I think that this would not be the case without people focusing on kids safety. It’s possible for there to be AI Safety issues which affect children right now that are highly relevant to existential risks and this is a common topic in novice discussions of alignment.
I strongly agree. I can’t vouch for all of the orgs Ryan listed, but Encode, ARI, and AIPN all seem good to me (in expectation), and Encode seems particularly good and competent.
They also did a lot of calling to US representatives, as did people they reached out to.
ControlAI did something similar and also partnered with SiliConversations, a youtuber, to get the word out to more people, to get them to call their representatives.
I thought it would be helpful to post about my timelines and what the timelines of people in my professional circles (Redwood, METR, etc) tend to be.
Concretely, consider the outcome of: AI 10x’ing labor for AI R&D[1], measured by internal comments by credible people at labs that AI is 90% of their (quality adjusted) useful work force (as in, as good as having your human employees run 10x faster).
Here are my predictions for this outcome:
25th percentile: 2 year (Jan 2027)
50th percentile: 5 year (Jan 2030)
The views of other people (Buck, Beth Barnes, Nate Thomas, etc) are similar.
I expect that outcomes like “AIs are capable enough to automate virtually all remote workers” and “the AIs are capable enough that immediate AI takeover is very plausible (in the absence of countermeasures)” come shortly after (median 1.5 years and 2 years after respectively under my views).
I expect that outcomes like “AIs are capable enough to automate virtually all remote workers” and “the AIs are capable enough that immediate AI takeover is very plausible (in the absence of countermeasures)” come shortly after (median 1.5 years and 2 years after respectively under my views).
@ryan_greenblatt can you say more about what you expect to happen from the period in-between “AI 10Xes AI R&D” and “AI takeover is very plausible?”
I’m particularly interested in getting a sense of what sorts of things will be visible to the USG and the public during this period. Would be curious for your takes on how much of this stays relatively private/internal (e.g., only a handful of well-connected SF people know how good the systems are) vs. obvious/public/visible (e.g., the majority of the media-consuming American public is aware of the fact that AI research has been mostly automated) or somewhere in-between (e.g., most DC tech policy staffers know this but most non-tech people are not aware.)
I don’t feel very well informed and I haven’t thought about it that much, but in short timelines (e.g. my 25th percentile): I expect that we know what’s going on roughly within 6 months of it happening, but this isn’t salient to the broader world. So, maybe the DC tech policy staffers know that the AI people think the situation is crazy, but maybe this isn’t very salient to them. A 6 month delay could be pretty fatal even for us as things might progress very rapidly.
Note that the production function of the 10x really matters. If it’s “yeah, we get to net-10x if we have all our staff working alongside it,” it’s much more detectable than, “well, if we only let like 5 carefully-vetted staff in a SCIF know about it, we only get to 8.5x speedup”.
(It’s hard to prove that the results are from the speedup instead of just, like, “One day, Dario woke up from a dream with The Next Architecture in his head”)
AI is 90% of their (quality adjusted) useful work force (as in, as good as having your human employees run 10x faster).
I don’t grok the “% of quality adjusted work force” metric. I grok the “as good as having your human employees run 10x faster” metric but it doesn’t seem equivalent to me, so I recommend dropping the former and just using the latter.
Fair, I really just mean “as good as having your human employees run 10x faster”. I said “% of quality adjusted work force” because this was the original way this was stated when a quick poll was done, but the ultimate operationalization was in terms of 10x faster. (And this is what I was thinking.)
Basic clarifying question: does this imply under-the-hood some sort of diminishing returns curve, such that the lab pays for that labor until it net reaches as 10x faster improvement, but can’t squeeze out much more?
And do you expect that’s a roughly consistent multiplicative factor, independent of lab size? (I mean, I’m not sure lab size actually matters that much, to be fair, it seems that Anthropic keeps pace with OpenAI despite being smaller-ish)
Yeah, for it to reach exactly 10x as good, the situation would presumably be that this was the optimum point given diminishing returns to spending more on AI inference compute. (It might be the returns curve looks very punishing. For instance, many people get a relatively large amount of value from extremely cheap queries to 3.5 Sonnet on claude.ai and the inference cost of this is very small, but greatly increasing the cost (e.g. o1-pro) often isn’t any better because 3.5 Sonnet already gave an almost perfect answer.)
I don’t have a strong view about AI acceleration being a roughly constant multiplicative factor independent of the number of employees. Uplift just feels like a reasonably simple operationalization.
I’d guess that xAI, Anthropic, and GDM are more like 5-20% faster all around (with much greater acceleration on some subtasks). It seems plausible to me that the acceleration at OpenAI is already much greater than this (e.g. more like 1.5x or 2x), or will be after some adaptation due to OpenAI having substantially better internal agents than what they’ve released. (I think this due to updates from o3 and general vibes.)
I was saying 2x because I’ve memorised the results from this study. Do we have better numbers today? R&D is harder, so this is an upper bound. However, since this was from one year ago, so perhaps the factors cancel each other out?
This case seems extremely cherry picked for cases where uplift is especially high. (Note that this is in copilot’s interest.) Now, this task could probably be solved autonomously by an AI in like 10 minutes with good scaffolding.
I think you have to consider the full diverse range of tasks to get a reasonable sense or at least consider harder tasks. Like RE-bench seems much closer, but I still expect uplift on RE-bench to probably (but not certainly!) considerably overstate real world speed up.
Yeah, fair enough. I think someone should try to do a more representative experiment and we could then monitor this metric.
btw, something that bothers me a little bit with this metric is the fact that a very simple AI that just asks me periodically “Hey, do you endorse what you are doing right now? Are you time boxing? Are you following your plan?” makes me (I think) significantly more strategic and productive. Similar to I hired 5 people to sit behind me and make me productive for a month. But this is maybe off topic.
btw, something that bothers me a little bit with this metric is the fact that a very simple AI …
Yes, but I don’t see a clear reason why people (working in AI R&D) will in practice get this productivity boost (or other very low hanging things) if they don’t get around to getting the boost from hiring humans.
Thanks for this—I’m in a more peripheral part of the industry (consumer/industrial LLM usage, not directly at an AI lab), and my timelines are somewhat longer (5 years for 50% chance), but I may be using a different criterion for “automate virtually all remote workers”. It’ll be a fair bit of time (in AI frame—a year or ten) between “labs show generality sufficient to automate most remote work” and “most remote work is actually performed by AI”.
A key dynamic is that I think massive acceleration in AI is likely after the point when AIs can accelerate labor working on AI R&D. (Due to all of: the direct effects of accelerating AI software progress, this acceleration rolling out to hardware R&D and scaling up chip production, and potentially greatly increased investment.) See also here and here.
So, you might very quickly (1-2 years) go from “the AIs are great, fast, and cheap software engineers speeding up AI R&D” to “wildly superhuman AI that can achieve massive technical accomplishments”.
I think massive acceleration in AI is likely after the point when AIs can accelerate labor working on AI R&D.
Fully agreed. And the trickle-down from AI-for-AI-R&D to AI-for-tool-R&D to AI-for-managers-to-replace-workers (and -replace-middle-managers) is still likely to be a bit extended. And the path is required—just like self-driving cars: the bar for adoption isn’t “better than the median human” or even “better than the best affordable human”, but “enough better that the decision-makers can’t find a reason to delay”.
While I do spend some time discussing AGI timelines (and I’ve written somepostsabout it recently), I don’t think moderate quantitative differences in AGI timelines matter that much for deciding what to do[1]. For instance, having a 15-year median rather than a 6-year median doesn’t make that big of a difference. That said, I do think that moderate differences in the chance of very short timelines (i.e., less than 3 years) matter more: going from a 20% chance to a 50% chance of full AI R&D automation within 3 years should potentially make a substantial difference to strategy.[2]
Additionally, my guess is that the most productive way to engage with discussion around timelines is mostly to not care much about resolving disagreements, but then when there appears to be a large chance that timelines are very short (e.g., >25% in <2 years) it’s worthwhile to try hard to argue for this.[3] I think takeoff speeds are much more important to argue about when making the case for AI risk.
I do think that having somewhat precise views is helpful for some people in doing relatively precise prioritization within people already working on safety, but this seems pretty niche.
Given that I don’t think timelines are that important, why have I been writing about this topic? This is due to a mixture of: I find it relatively quick and easy to write about timelines, my commentary is relevant to the probability of very short timelines (which I do think is important as discussed above), a bunch of people seem interested in timelines regardless, and I do think timelines matter some.
Consider reflecting on whether you’re overly fixated on details of timelines.
I’ve seen Richard Ngo make this point before, though I couldn’t find where he did this. More generally, this isn’t a very original point; I just think it’s worth making given that I’ve been talking about timelines recently.
You could have views such that you expect to never be >25% confident in <2-year timelines until it’s basically too late. For instance, maybe you expect very fast takeoff driven by a single large algorithmic advance. Under this view, I think arguing about the details of timelines looks even less good and you should mostly make the case for risk independently of this, perhaps arguing “it seems like AI could emerge quickly and unexpectedly, so we need to act now”.
I think most of the value in researching timelines is in developing models that can then be quickly updated as new facts come to light. As opposed to figuring out how to think about the implications of such facts only after they become available.
People might substantially disagree about parameters of such models (and the timelines they predict) while agreeing on the overall framework, and building common understanding is important for coordination. Also, you wouldn’t necessarily a priori know which facts to track, without first having developed the models.
For people who are comparatively advantaged at this, it seems good to try to make the case for this in a variety of different ways. One place to start is to try to convince relatively soft target audiences like me (who’s sympathetic but disagrees) by e.g. posting on LW and then go somewhere from here.
I think it’s a rough task, but ultimately worth trying.
Personally it will be impossible for me to ignore the part of me that wonders “is this AGI/ASI stuff actually, for real, coming, or will it turn out to be fake.” Studying median timelines bleeds into the question of whether AGI by my natural lifespan is 90% likely or 99.5% likely, and vice versa. So I will continue thinking very carefully about evidence of AGI progress.
Absence of AGI[1] by (say) 2055 is predicted by models that deserve to be developed in earnest (I’d currently give the claim 15%, with 10% mostly for technological reasons and 5% mostly because of a human-instituted lasting Pause or a disaster). This doesn’t significantly affect the median timeline yet, but as time goes on these models can get stronger (Moore’s law even in price-performance form breaking down, continual learning turning out to be a grand algorithmic obstruction that might take decades to solve, with in-context learning not good enough for this purpose within available compute). And this would start affecting the median timeline more and more. Also, development of AGI might result in a lasting ASI[2] Pause (either through societal backlash or from AGIs themselves insisting on this to prevent ASIs misaligned with them before they figure out how to align ASIs).
AGIs are AIs unbounded in ability to develop civilization on their own, without needing substantial human input, including by inventing aligned-with-them ASIs.
ASIs are qualitatively more intelligent than humans or humanity, while non-ASI AGIs are reasonably comparable to humans or humanity, even if notably more capable.
Slightly hot take: Longtermist capacity/community building is pretty underdone at current margins and retreats (focused on AI safety, longtermism, or EA) are also underinvested in.
By “longtermist community building”, I mean rather than AI safety. I think retreats are generally underinvested in at the moment.
I’m also sympathetic to thinking that general undergrad and high school capacity building (AI safety, longtermist, or EA) is underdone, but this seems less clear-cut.
I think this underinvestment is due to a mix of mistakes on the part of Open Philanthropy (and Good Ventures)[1] and capacity building being lower status than it should be.
Here are some reasons why I think this work is good:
It’s very useful for there to be people who are actually trying really hard to do the right thing and they often come through these sorts of mechanisms. Another way to put this is that flexible, impact-obsessed people are very useful.
Retreats make things feel much more real to people and result in people being more agentic and approaching their choices more effectively.
Programs like MATS are good, but they get somewhat different people at a somewhat different part of the funnel, so they don’t (fully) substitute.
A large part of why I’m writing this is to try to make this work higher status and to encourage more of this work. Consider yourself to be encouraged and/or thanked if you’re working in this space or planning to work in this space.
I think these mistakes are: underfunding this work, Good Ventures being unwilling to fund some versions of this work, failing to encourage people to found useful orgs in this space, and hiring out many of the best people in this space to instead do (IMO less impactful) grantmaking.
If someone wants to give Lightcone money for this, we could probably fill a bunch of this gap. No definitive promises (and happy to talk to any donor for whom this would be cruxy about what we would be up for doing and what we aren’t), but we IMO have a pretty good track record of work in the space, and of course having Lighthaven helps. Also if someone else wants to do work in the space and run stuff at Lighthaven, happy to help in various ways.
I think the Sanity & Survival Summit that we ran in 2022 would be an obvious pointer to something I would like to run more of (I would want to change some things about the framing of the event, but I overall think that was pretty good).
Another thing I’ve been thinking about is a retreat on something like “high-integrity AI x-risk comms” where people who care a lot about x-risk and care a lot about communicating it accurately to a broader audience can talk to each other (we almost ran something like this in early 2023). Think Kelsey, Palisade, Scott Alexander, some people from Redwood, some of the MIRI people working on this, maybe some people from the labs. Not sure how well it would work, but it’s one of the things I would most like to attend (and to what degree that’s a shared desire would come out quickly in user interviews)
Though my general sense is that it’s a mistake to try to orient things like this too much around a specific agenda. You mostly want to leave it up to the attendees to figure out what they want to talk to each other about, and do a bunch of surveying and scoping of who people want to talk to each other more, and then just facilitate a space and a basic framework for those conversations and meetings to happen.
Another thing I’ve been thinking about is a retreat on something like “high-integrity AI x-risk comms” where people who care a lot about x-risk and care a lot about communicating it accurately to a broader audience can talk to each other.
I think this is a great idea that would serve an urgent need. I’d urge to you do it in the near future.
Agree with both the OP and Habryka’s pitch. The Meetup Organizers Retreat hosted at Lighthaven in 2022 was a huge inflection point for my personal involvement with the community.
I think there are some really big advantages to having people who are motivated by longtermism and doing good in a scope-sensitive way, rather than just by trying to prevent AI takeover even more broadly “help with AI safety”.
AI safety field building has been popular in part because there is a very broad set of perspectives from which it makes sense to worry about technical problems related to societal risks from powerful AI. (See e.g. Simplify EA Pitches to “Holy Shit, X-Risk”. This kind of field building gets you lots of people who are worried about AI takeover risk, or more broadly, problems related to powerful AI. But it doesn’t get you people who have a lot of other parts of the EA/longtermist worldview, like:
Being scope-sensitive
Being altruistic/cosmopolitan
Being concerned about the moral patienthood of a wide variety of different minds
Being interested in philosophical questions about acausal trade
People who do not have the longtermist worldview and who work on AI safety are useful allies and I’m grateful to have them, but they have some extreme disadvantages compared to people who are on board with more parts of my worldview. And I think it would be pretty sad to have the proportion of people working on AI safety who have the longtermist perspective decline further.
It feels weird to me to treat longtermism as an ingroup/outgroup divider. I guess I think of myself as not really EA/longtermist. I mostly care about the medium-term glorious transhumanist future. I don’t really base my actions on the core longtermist axiom; I only care about the unimaginably vast number of future moral patients indirectly, through caring about humanity being able to make and implement good moral decisions a hundred years from now.
The main thing I look at to determine whether someone is value-aligned with me is whether they care about making the future go well (in a vaguely ambitious transhumanist coded way), as opposed to personal wealth or degrowth whatever.
Yeah, maybe I’m using the wrong word here. I do think there is a really important difference between people who are scope-sensitively altruistically motivated and who are in principle willing to make decisions based on abstract reasoning about the future (which I probably include you in), and people who aren’t.
I have the impression that neither “short” nor “medium”-termist EAs (insofar as those are the labels they use for themselves) care much about 100 years from now. With ~30-50 years being what seems what the typical “medium”-termist EA cares about. So if you care about 100 years, and take “weird” ideas seriously, I think at least I would consider that long-termist. But it has been a while since I’ve consistently read the EA forum.
I think the general category of AI safety capacity building isn’t underdone (there’s quite a lot of it) while I think stuff aiming more directly on longtermism (and AI futurism etc) is underdone. Mixing the two is reasonable tbc, and some of the best stuff focuses on AI safety while mixing in longtermism/futurism/etc. But, lots of the AI safety capacity building is pretty narrow in practice.
While I think the general category of AI safety capacity building isn’t underdone, I do think that (AI safety) retreats in particular are under invested in.
Inference compute scaling might imply we first get fewer, smarter AIs.
Prior estimates imply that the compute used to train a future frontier model could also be used to run tens or hundreds of millions of human equivalents per year at the first time when AIs are capable enough to dominate top human experts at cognitive tasks[1] (examples here from Holden Karnofsky, here from Tom Davidson, and here from Lukas Finnveden). I think inference time compute scaling (if it worked) might invalidate this picture and might imply that you get far smaller numbers of human equivalents when you first get performance that dominates top human experts, at least in short timelines where compute scarcity might be important. Additionally, this implies that at the point when you have abundant AI labor which is capable enough to obsolete top human experts, you might also have access to substantially superhuman (but scarce) AI labor (and this could pose additional risks).
The point I make here might be obvious to many, but I thought it was worth making as I haven’t seen this update from inference time compute widely discussed in public.[2]
However, note that if inference compute allows for trading off between quantity of tasks completed and the difficulty of tasks that can be completed (or the quality of completion), then depending on the shape of the inference compute returns curve, at the point when we can run some AIs as capable as top human experts, it might be worse to run many (or any) AIs at this level of capability rather than using less inference compute per task and completing more tasks (or completing tasks serially faster).
Further, efficiency might improve quickly such that we don’t have a long regime with only a small number of human equivalents. I do a BOTEC on this below.
I’ll do a specific low-effort BOTEC to illustrate my core point that you might get far smaller quantities of top human expert-level performance at first. Suppose that we first get AIs that are ~10x human cost (putting aside inflation in compute prices due to AI demand) and as capable as top human experts at this price point (at tasks like automating R&D). If this is in ~3 years, then maybe you’ll have $15 million/hour worth of compute. Supposing $300/hour human cost, then we get ($15 million/hour) / ($300/hour) / (10 times human cost per compute dollar) * (4 AI hours / human work hours) = 20k human equivalents. This is a much smaller number than prior estimates.
The estimate of $15 million/hour worth of compute comes from: OpenAI spent ~$5 billion on compute this year, so $5 billion / (24*365) = $570k/hour; spend increases by ~3x per year, so $570k/hour * 3³ = $15 million.
The estimate for 3x per year comes from: total compute is increasing by 4-5x per year, but some is hardware improvement and some is increased spending. Hardware improvement is perhaps ~1.5x per year and 4.5/1.5 = 3. This at least roughly matches this estimate from Epoch which estimates 2.4x additional spend (on just training) per year. Also, note that Epoch estimates 4-5x compute per year here and 1.3x hardware FLOP/dollar here, which naively implies around 3.4x, but this seems maybe too high given the prior number.
Earlier, I noted that efficiency might improve rapidly. We can look at recent efficiency changes to get a sense for how fast. GPT-4o mini is roughly 100x cheaper than GPT-4 and is perhaps roughly as capable all together (probably lower intelligence but better elicited). It was trained roughly 1.5 years later (GPT-4 was trained substantially before it was released) for ~20x efficiency improvement per year. This is selected for being a striking improvment and probably involves low hanging fruit, but AI R&D will be substantially accelerated in the future which probably more than cancels this out. Further, I expect that inference compute will be inefficent in the tail of high inference compute such that efficiency will be faster than this once the capability is reached. So we might expect that the number of AI human equivalents increases by >20x per year and potentially much faster if AI R&D is greatly accelerated (and compute doesn’t bottleneck this). If progress is “just” 50x per year, then it would still take a year to get to millions of human equivalents based on my earlier estimate of 20k human equivalents. Note that once you have millions of human equivalents, you also have increased availability of generally substantially superhuman AI systems.
Of course, we should generally expect huge uncertainty with future AI architectures such that fixating on very efficient substitution of inference time compute for training would be a mistake, along with fixating on minimal or no substitution. I think a potential error of prior discussion is insufficient focus on the possibility of relatively scalable (though potentially inefficient) substitution of inference time for training (which o3 appears to exhibit) such that we see very expensive (and potentially slow) AIs that dominate top human expert performance prior to seeing cheap and abundant AIs which do this.
Ok, but for how long?
If the situation holds for only 3 months, and then the accelerated R&D gives us a huge drop in costs, then the strategic outcomes seem pretty similar.
If there continues to be a useful peak capability only achievable with expensive inference, like the 10x human cost, and there are weaker human-skill-at-minimum available for 0.01x human cost, then it may be interesting to consider which tasks will benefit more from a large number of moderately good workers vs a small number of excellent workers.
Also worth considering is speed. In a lot of cases, it is possible to set things up to run slower-but-cheaper on less or cheaper hardware. Or to pay more, and have things run in as highly parallelized a manner as possible on the most expensive hardware. Usually maximizing speed comes with some cost overhead. So then you also need to consider whether it’s worth having more of the work be done in serial by a smaller number of faster models...
For certain tasks, particularly competitive ones like sports or combat, speed can be a critical factor and is worth sacrificing peak intelligence for. Obviously, for long horizon strategic planning, it’s the other way around.
I don’t expect this to continue for very long. 3 months (or less) seems plausible. I really should have mentioned this in the post. I’ve now edited it in.
then the strategic outcomes seem pretty similar
I don’t think so. In particular once the costs drop you might be able to run substantially superhuman systems at the same cost that you could previously run systems that can “merely” automate away top human experts.
The point I make here is also likely obvious to many, but I wonder if the “X human equivalents” frame often implicitly assumes that GPT-N will be like having X humans. But if we expect AIs to have comparative advantages (and disadvantages), then this picture might miss some important factors.
The “human equivalents” frame seems most accurate in worlds where the capability profile of an AI looks pretty similar to the capability profile of humans. That is, getting GPT-6 to do AI R&D is basically “the same as” getting X humans to do AI R&D. It thinks in fairly similar ways and has fairly similar strengths/weaknesses.
The frame is less accurate in worlds where AI is really good at some things and really bad at other things. In this case, if you try to estimate the # of human equivalents that GPT-6 gets you, the result might be misleading or incomplete. A lot of fuzzier things will affect the picture.
The example I’ve seen discussed most is whether or not we expect certain kinds of R&D to be bottlenecked by “running lots of experiments” or “thinking deeply and having core conceptual insights.” My impression is that one reason why some MIRI folks are pessimistic is that they expect capabilities research to be more easily automatable (AIs will be relatively good at running lots of ML experiments quickly, which helps capabilities more under their model) than alignment research (AIs will be relatively bad at thinking deeply or serially about certain topics, which is what you need for meaningful alignment progress under their model).
Perhaps more people should write about what kinds of tasks they expect GPT-X to be “relatively good at” or “relatively bad at”. Or perhaps that’s too hard to predict in advance. If so, it could still be good to write about how different “capability profiles” could allow certain kinds of tasks to be automated more quickly than others.
(I do think that the “human equivalents” frame is easier to model and seems like an overall fine simplification for various analyses.)
In the top level comment, I was just talking about AI systems which are (at least) as capable as top human experts. (I was trying to point at the notion of Top-human-Expert-Dominating AI that I define in this post, though without a speed/cost constraint, but I think I was a bit sloppy in my language. I edited the comment a bit to better communicate this.)
So, in this context, human (at least) equivalents does make sense (as in, because the question is the cost of AIs that can strictly dominate top human experts so we can talk about the amount of compute needed to automate away one expert/researcher on average), but I agree that for earlier AIs it doesn’t (necessarily) make sense and plausibly these earlier AIs are very key for understanding the risk (because e.g. they will radically accelerate AI R&D without necessarily accelerating other domain).
At first glance, I don’t see how the point I raised is affected by the distinction between expert-level AIs vs earlier AIs.
In both cases, you could expect an important part of the story to be “what are the comparative strengths and weaknesses of this AI system.”
For example, suppose you have an AI system that dominates human experts at every single relevant domain of cognition. It still seems like there’s a big difference between “system that is 10% better at every relevant domain of cognition” and “system that is 300% better at domain X and only 10% better at domain Y.”
To make it less abstract, one might suspect that by the time we have AI that is 10% better than humans at “conceptual/serial” stuff, the same AI system is 1000% better at “speed/parallel” stuff. And this would have pretty big implications for what kind of AI R&D ends up happening (even if we condition on only focusing on systems that dominate experts in every relevant domain.)
I agree comparative advantages can still important, but your comment implied a key part of the picture is “models can’t do some important thing”. (E.g. you talked about “The frame is less accurate in worlds where AI is really good at some things and really bad at other things.” but models can’t be really bad at almost anything if they strictly dominate humans at basically everything.)
And I agree that at the point AIs are >5% better at everything they might also be 1000% better at some stuff.
I was just trying to point out that talking about the number human equivalents (or better) can still be kinda fine as long as the model almost strictly dominates humans as the model can just actually substitute everywhere. Like the number of human equivalents will vary by domain but at least this will be a lower bound.
Sometimes people think of “software-onlysingularity” as an important category of ways AI could go. A software-only singularity can roughly be defined as when you get increasing-returns growth (hyper-exponential) just via the mechanism of AIs increasing the labor input to AI capabilities software[1] R&D (i.e., keeping fixed the compute input to AI capabilities).
While the software-only singularity dynamic is an important part of my model, I often find it useful to more directly consider the outcome that software-only singularity might cause: the feasibility of takeover-capable AI without massive compute automation. That is, will the leading AI developer(s) be able to competitively develop AIs powerful enough to plausibly take over[2] without previously needing to use AI systems to massively (>10x) increase compute production[3]?
[This is by Ryan Greenblatt and Alex Mallen]
We care about whether the developers’ AI greatly increases compute production because this would require heavy integration into the global economy in a way that relatively clearly indicates to the world that AI is transformative. Greatly increasing compute production would require building additional fabs which currently involve substantial lead times, likely slowing down the transition from clearly transformative AI to takeover-capable AI.[4][5] In addition to economic integration, this would make the developer dependent on a variety of actors after the transformative nature of AI is made more clear, which would more broadly distribute power.
For example, if OpenAI is selling their AI’s labor to ASML and massively accelerating chip production before anyone has made takeover-capable AI, then (1) it would be very clear to the world that AI is transformatively useful and accelerating, (2) building fabs would be a constraint in scaling up AI which would slow progress, and (3) ASML and the Netherlands could have a seat at the table in deciding how AI goes (along with any other actors critical to OpenAI’s competitiveness). Given that AI is much more legibly transformatively powerful in this world, they might even want to push for measures to reduce AI/human takeover risk.
A software-only singularity is not necessary for developers to have takeover-capable AIs without having previously used them for massive compute automation (it is also not clearly sufficient, since it might be too slow or uncompetitive by default without massive compute automation as well). Instead, developers might be able to achieve this outcome by other forms of fast AI progress:
Algorithmic / scaling is fast enough at the relevant point independent of AI automation. This would likely be due to one of:
Downstream AI capabilities progress very rapidly with the default software and/or hardware progress rate at the relevant point;
Existing compute production (including repurposable production) suffices (this is sometimes called hardware overhang) and the developer buys a bunch more chips (after generating sufficient revenue or demoing AI capabilities to attract investment);
Or there is a large algorithmic advance that unlocks a new regime with fast progress due to low-hanging fruit.[6]
AI automation results in a one-time acceleration of software progress without causing an explosive feedback loop, but this does suffice for pushing AIs above the relevant capability threshold quickly.
Other developers just aren’t very competitive (due to secrecy, regulation, or other governance regimes) such that proceeding at a relatively slower rate (via algorithmic and hardware progress) suffices.
My inside view sense is that the feasibility of takeover-capable AI without massive compute automation is about 75% likely if we get AIs that dominate top-human-experts prior to 2040.[7] Further, I think that in practice, takeover-capable AI without massive compute automation is maybe about 60% likely. (This is because massively increasing compute production is difficult and slow, so if proceeding without massive compute automation is feasible, this would likely occur.) However, I’m reasonably likely to change these numbers on reflection due to updating about what level of capabilities would suffice for being capable of takeover (in the sense defined in an earlier footnote) and about the level of revenue and investment needed to 10x compute production. I’m also uncertain whether a substantially smaller scale-up than 10x (e.g., 3x) would suffice to cause the effects noted earlier.
To-date software progress has looked like “improvements in pre-training algorithms, data quality, prompting strategies, tooling, scaffolding” as described here.
This takeover could occur autonomously, via assisting the developers in a power grab, or via partnering with a US adversary. I’ll count it as “takeover” if the resulting coalition has de facto control of most resources. I’ll count an AI as takeover-capable if it would have a >25% chance of succeeding at a takeover (with some reasonable coalition) if no other actors had access to powerful AI systems. Further, this takeover wouldn’t be preventable with plausible interventions on legible human controlled institutions, so e.g., it doesn’t include the case where an AI lab is steadily building more powerful AIs for an eventual takeover much later (see discussion here). This 25% probability is as assessed under my views but with the information available to the US government at the time this AI is created. This line is intended to point at when states should be very worried about AI systems undermining their sovereignty unless action has already been taken. Note that insufficient inference compute could prevent an AI from being takeover-capable even if it could take over with enough parallel copies. And note that whether a given level of AI capabilities suffices for being takeover-capable is dependent on uncertain facts about how vulnerable the world seems (from the subjective vantage point I defined earlier). Takeover via the mechanism of an AI escaping, independently building more powerful AI that it controls, and then this more powerful AI taking over would count as that original AI that escaped taking over. I would also count a rogue internal deployment that leads to the AI successfully backdooring or controlling future AI training runs such that those future AIs take over. However, I would not count merely sabotaging safety research.
I mean 10x additional production (caused by AI labor) above long running trends in expanding compute production and making it more efficient. As in, spending on compute production has been increasing each year and the efficiency of compute production (in terms of FLOP/$ or whatever) has also been increasing over time, and I’m talking about going 10x above this trend due to using AI labor to expand compute production (either revenue from AI labor or having AIs directly work on chips as I’ll discuss in a later footnote).
Note that I don’t count converting fabs from making other chips (e.g., phones) to making AI chips as scaling up compute production; I’m just considering things that scale up the amount of AI chips we could somewhat readily produce. TSMC’s revenue is “only” about $100 billion per year, so if only converting fabs is needed, this could be done without automation of compute production and justified on the basis of AI revenues that are substantially smaller than the revenues that would justify building many more fabs. Currently AI is around 15% of leading node production at TSMC, so only a few more doublings are needed for it to consume most capacity.
Note that the AI could indirectly increase compute production via being sufficiently economically useful that it generates enough money to pay for greatly scaling up compute. I would count this as massive compute automation, though some routes through which the AI could be sufficiently economically useful might be less convincing of transformativeness than the AIs substantially automating the process of scaling up compute production. However, I would not count the case where AI systems are impressive enough to investors that this justifies investment that suffices for greatly scaling up fab capacity while profits/revenues wouldn’t suffice for greatly scaling up compute on their own. In reality, if compute is greatly scaled up, this will occur via a mixture of speculative investment, the AI earning revenue, and the AI directly working on automating labor along the compute supply chain. If the revenue and direct automation would suffice for an at least massive compute scale-up (>10x) on their own (removing the component from speculative investment), then I would count this as massive compute automation.
A large algorithmic advance isn’t totally unprecedented. It could suffice if we see an advance similar to what seemingly happened with reasoning models like o1 and o3 in 2024.
I’m not sure if the definition of takeover-capable-AI (abbreviated as “TCAI” for the rest of this comment) in footnote 2 quite makes sense. I’m worried that too much of the action is in “if no other actors had access to powerful AI systems”, and not that much action is in the exact capabilities of the “TCAI”. In particular: Maybe we already have TCAI (by that definition) because if a frontier AI company or a US adversary was blessed with the assumption “no other actor will have access to powerful AI systems”, they’d have a huge advantage over the rest of the world (as soon as they develop more powerful AI), plausibly implying that it’d be right to forecast a >25% chance of them successfully taking over if they were motivated to try.
And this seems somewhat hard to disentangle from stuff that is supposed to count according to footnote 2, especially: “Takeover via the mechanism of an AI escaping, independently building more powerful AI that it controls, and then this more powerful AI taking over would” and “via assisting the developers in a power grab, or via partnering with a US adversary”. (Or maybe the scenario in 1st paragraph is supposed to be excluded because current AI isn’t agentic enough to “assist”/”partner” with allies as supposed to just be used as a tool?)
What could a competing definition be? Thinking about what we care most about… I think two events especially stand out to me:
When would it plausibly be catastrophically bad for an adversary to steal an AI model?
When would it plausibly be catastrophically bad for an AI to be power-seeking and non-controlled?
Maybe a better definition would be to directly talk about these two events? So for example...
“Steal is catastrophic” would be true if...
“Frontier AI development projects immediately acquire good enough security to keep future model weights secure” has significantly less probability of AI-assisted takeover than
“Frontier AI development projects immediately have their weights stolen, and then acquire security that’s just as good as in (1a).”[1]
“Power-seeking and non-controlled is catastrophic” would be true if...
“Frontier AI development projects immediately acquire good enough judgment about power-seeking-risk that they henceforth choose to not deploy any model that would’ve been net-negative for them to deploy” has significantly less probability of AI-assisted takeover than
“Frontier AI development acquire the level of judgment described in (2a) 6 months later.”[2]
Where “significantly less probability of AI-assisted takeover” could be e.g. at least 2x less risk.
The motivation for assuming “future model weights secure” in both (1a) and (1b) is so that the downside of getting the model weights stolen imminently isn’t nullified by the fact that they’re very likely to get stolen a bit later, regardless. Because many interventions that would prevent model weight theft this month would also help prevent it future months. (And also, we can’t contrast 1a’=”model weights are permanently secure” with 1b’=”model weights get stolen and are then default-level-secure”, because that would already have a really big effect on takeover risk, purely via the effect on future model weights, even though current model weights probably aren’t that important.)
The motivation for assuming “good future judgment about power-seeking-risk” is similar to the motivation for assuming “future model weights secure” above. The motivation for choosing “good judgment about when to deploy vs. not” rather than “good at aligning/controlling future models” is that a big threat model is “misaligned AIs outcompete us because we don’t have any competitive aligned AIs, so we’re stuck between deploying misaligned AIs and being outcompeted” and I don’t want to assume away that threat model.
I agree that the notion of takeover-capable AI I use is problematic and makes the situation hard to reason about, but I intentionally rejected the notions you propose as they seemed even worse to think about from my perspective.
Is there some reason for why current AI isn’t TCAI by your definition?
(I’d guess that the best way to rescue your notion it is to stipulate that the TCAIs must have >25% probability of taking over themselves. Possibly with assistance from humans, possibly by manipulating other humans who think they’re being assisted by the AIs — but ultimately the original TCAIs should be holding the power in order for it to count. That would clearly exclude current systems. But I don’t think that’s how you meant it.)
Oh sorry. I somehow missed this aspect of your comment.
Here’s a definition of takeover-capable AI that I like: the AI is capable enough that plausible interventions on known human controlled institutions within a few months no longer suffice to prevent plausible takeover. (Which implies that making the situation clear to the world is substantially less useful and human controlled institutions can no longer as easily get a seat at the table.)
Under this definition, there are basically two relevant conditions:
The AI is capable enough to itself take over autonomously. (In the way you defined it, but also not in a way where intervening on human institutions can still prevent the takeover, so e.g.., the AI just having a rogue deployment within OpenAI doesn’t suffice if substantial externally imposed improvements to OpenAI’s security and controls would defeat the takeover attempt.)
Or human groups can do a nearly immediate takeover with the AI such that they could then just resist such interventions.
Hm — what are the “plausible interventions” that would stop China from having >25% probability of takeover if no other country could build powerful AI? Seems like you either need to count a delay as successful prevention, or you need to have a pretty low bar for “plausible”, because it seems extremely difficult/costly to prevent China from developing powerful AI in the long run. (Where they can develop their own supply chains, put manufacturing and data centers underground, etc.)
I really like the framing here, of asking whether we’ll see massive compute automation before [AI capability level we’re interested in]. I often hear people discuss nearby questions using IMO much more confusing abstractions, for example:
“How much is AI capabilities driven by algorithmic progress?” (problem: obscures dependence of algorithmic progress on compute for experimentation)
“How much AI progress can we get ‘purely from elicitation’?” (lots of problems, e.g. that eliciting a capability might first require a (possibly one-time) expenditure of compute for exploration)
My inside view sense is that the feasibility of takeover-capable AI without massive compute automation is about 75% likely if we get AIs that dominate top-human-experts prior to 2040.[6] Further, I think that in practice, takeover-capable AI without massive compute automation is maybe about 60% likely.
Is this because:
You think that we’re >50% likely to not get AIs that dominate top human experts before 2040? (I’d be surprised if you thought this.)
The words “the feasibility of” importantly change the meaning of your claim in the first sentence? (I’m guessing it’s this based on the following parenthetical, but I’m having trouble parsing.)
Overall, it seems like you put substantially higher probability than I do on getting takeover capable AI without massive compute automation (and especially on getting a software-only singularity). I’d be very interested in understanding why. A brief outline of why this doesn’t seem that likely to me:
My read of the historical trend is that AI progress has come from scaling up all of the factors of production in tandem (hardware, algorithms, compute expenditure, etc.).
Scaling up hardware production has always been slower than scaling up algorithms, so this consideration is already factored into the historical trends. I don’t see a reason to believe that algorithms will start running away with the game.
Maybe you could counter-argue that algorithmic progress has only reflected returns to scale from AI being applied to AI research in the last 12-18 months and that the data from this period is consistent with algorithms becoming more relatively important relative to other factors?
I don’t see a reason that “takeover-capable” is a capability level at which algorithmic progress will be deviantly important relative to this historical trend.
I’d be interested either in hearing you respond to this sketch or in sketching out your reasoning from scratch.
I put roughly 50% probability on feasibility of software-only singularity.[1]
(I’m probably going to be reinventing a bunch of the compute-centric takeoff model in slightly different ways below, but I think it’s faster to partially reinvent than to dig up the material, and I probably do use a slightly different approach.)
My argument here will be a bit sloppy and might contain some errors. Sorry about this. I might be more careful in the future.
The key question for software-only singularity is: “When the rate of labor production is doubled (as in, as if your employees ran 2x faster[2]), does that more than double or less than double the rate of algorithmic progress? That is, algorithmic progress as measured by how fast we increase the labor production per FLOP/s (as in, the labor production from AI labor on a fixed compute base).”. This is a very economics-style way of analyzing the situation, and I think this is a pretty reasonable first guess. Here’s a diagram I’ve stolen from Tom’s presentation on explosive growth illustrating this:
Basically, every time you double the AI labor supply, does the time until the next doubling (driven by algorithmic progress) increase (fizzle) or decrease (foom)? I’m being a bit sloppy in saying “AI labor supply”. We care about a notion of parallelism-adjusted labor (faster laborers are better than more laborers) and quality increases can also matter. I’ll make the relevant notion more precise below.
I’m about to go into a relatively complicated argument for why I think the historical data supports software-only singularity. If you want more basic questions answered (such as “Doesn’t retraining make this too slow?”), consider looking at Tom’s presentation on takeoff speeds.
Here’s a diagram that you might find useful in understanding the inputs into AI progress:
And here is the relevant historical context in terms of trends:
Historically, algorithmic progress in LLMs looks like 3-4x per year including improvements on all parts of the stack.[3] This notion of algorithmic progress is “reduction in compute needed to reach a given level of frontier performance”, which isn’t equivalent to increases in the rate of labor production on a fixed compute base. I’ll talk more about this below.
This has been accompanied by increases of around 4x more hardware per year[4] and maybe 2x more quality-adjusted (parallel) labor working on LLM capabilities per year. I think total employees working on LLM capabilities have been roughly 3x-ing per year (in recent years), but quality has been decreasing over time.
A 2x increase in the quality-adjusted parallel labor force isn’t as good as the company getting the same sorts of labor tasks done 2x faster (as in, the resulting productivity from having your employees run 2x faster) due to parallelism tax (putting aside compute bottlenecks for now). I’ll apply the same R&D parallelization penalty as used in Tom’s takeoff model and adjust this down by a power of 0.7 to yield 20.7= 1.6x per year in increased labor production rate. (So, it’s as though the company keeps the same employees, but those employees operate 1.6x faster each year.)
It looks like the fraction of progress driven by algorithmic progress has been getting larger over time.
So, overall, we’re getting 3-4x algorithmic improvement per year being driven by 1.6x more labor per year and 4x more hardware.
So, the key question is how much of this algorithmic improvement is being driven by labor vs. by hardware. If it is basically all hardware, then the returns to labor must be relatively weak and software-only singularity seems unlikely. If it is basically all labor, then we’re seeing 3-4x algorithmic improvement per year for 1.6x more labor per year, which means the returns to labor look quite good (at least historically). Based on some guesses and some poll questions, my sense is that capabilities researchers would operate about 2.5x slower if they had 10x less compute (after adaptation), so the production function is probably proportional to compute0.4⋅labor0.6 (0.4=log10(2.5)). (This is assuming a cobb-douglas production function.) Edit: see the derivation of the relevant thing in Deep’s comment, my old thing was wrong[5].
Now, let’s talk more about the transfer from algorithmic improvement to the rate of labor production. A 2x algorithmic improvement in LLMs makes it so that you can reach the same (frontier) level of performance for 2x less training compute, but we actually care about a somewhat different notion for software-only singularity: how much you can increase the production rate of labor (the thing that we said was increasing at roughly a rate of 1.6x per year by using more human employees). My current guess is that every 2x algorithmic improvement in LLMs increases the rate of labor production by 21.1, and I’m reasonably confident that the exponent isn’t much below 1.0. I don’t currently have a very principled estimation strategy for this, and it’s somewhat complex to reason about. I discuss this in the appendix below.
So, if this exponent is around 1, our central estimate of 2.3 from above corresponds to software-only singularity and our estimate of 1.56 from above under more pessimistic assumptions corresponds to this not being feasible. Overall, my sense is that the best guess numbers lean toward software-only singularity.
More precisely, software-only singularity that goes for >500x effective compute gains above trend (to the extent this metric makes sense, this is roughly >5 years of algorithmic progress). Note that you can have software-only singularity be feasible while buying tons more hardware at the same time. And if this ends up expanding compute production by >10x using AI labor, then this would count as massive compute production despite also having a feasible software-only singularity. (However, in most worlds, I expect software-only singularity to be fast enough, if feasible, that we don’t see this.)
Rather than denominating labor in accelerating employees, we could instead denominate in number of parallel employees. This would work equivalently (we can always convert into equivalents to the extent these things can funge), but because we can actually accelerate employees and the serial vs. parallel distinction is important, I think it is useful to denominate in accelerating employees.
I would have previously cited 3x, but recent progress looks substantially faster (with DeepSeek v3 and reasoning models seemingly indicating somewhat faster than 3x progress IMO), so I’ve revised to 3-4x.
This includes both increased spending and improved chips. Here, I’m taking my best guess at increases in hardware usage for training and transferring this to research compute usage on the assumption that training compute and research compute have historically been proportional.
Edit: the reasoning I did here was off. Here was the old text: so the production function is probably roughly α⋅compute0.4⋅labor0.6 (0.4=log10(2.5)). Increasing compute by 4x and labor by 1.6x increases algorithmic improvement by 3-4x, let’s say 3.5x, so we have 3.5=α⋅40.4⋅1.60.6, α=3.540.4⋅1.60.6=1.52. Thus, doubling labor would increase algorithmic improvement by 1.52⋅20.6=2.3. This is very sensitive to the exact numbers; if we instead used 3x slower instead of 2.5x slower, we would have gotten that doubling labor increases algorithmic improvement by 1.56, which is substantially lower. Obviously, all the exact numbers here are highly uncertain.
Hey Ryan! Thanks for writing this up—I think this whole topic is important and interesting.
I was confused about how your analysis related to the Epoch paper, so I spent a while with Claude analyzing it. I did a re-analysis that finds similar results, but also finds (I think) some flaws in your rough estimate. (Keep in mind I’m not an expert myself, and I haven’t closely read the Epoch paper, so I might well be making conceptual errors. I think the math is right though!)
I’ll walk through my understanding of this stuff first, then compare to your post. I’ll be going a little slowly (A) to help myself refresh myself via referencing this later, (B) to make it easy to call out mistakes, and (C) to hopefully make this legible to others who want to follow along.
Using Ryan’s empirical estimates in the Epoch model
The Epoch model
The Epoch paper models growth with the following equation: 1. d(lnA)dt∼A−βEλ,
where A = efficiency and E = research input. We want to consider worlds with a potential software takeoff, meaning that increases in AI efficiency directly feed into research input, which we model as d(lnA)dt∼A−βAλ=Aλ−β. So the key consideration seems to be the ratio λβ. If it’s 1, we get steady exponential growth from scaling inputs; greater, superexponential; smaller, subexponential.[1]
Fitting the model How can we learn about this ratio from historical data?
Let’s pretend history has been convenient and we’ve seen steady exponential growth in both variables, so A=A0ert and E=E0eqt. Then d(lnA)dthas been constant over time, so by equation 1, A(t)−βE(t)λ has been constant as well. Substituting in for A and E, we find that A0e−βrtE0eλqt is constant over time, which is only possible if βr=λq and the exponent is always zero. Thus if we’ve seen steady exponential growth, the historical value of our key ratio is:
2. λβ=rq.
Intuitively, if we’ve seen steady exponential growth while research input has increased more slowly than research output (AI efficiency), there are superlinear returns to scaling inputs.
Introducing the Cobb-Douglas function
But wait! E, research input, is an abstraction that we can’t directly measure. Really there’s both compute and labor inputs. Those have indeed been growing roughly exponentially, but at different rates.
Intuitively, it makes sense to say that “effective research input” has grown as some kind of weighted average of the rate of compute and labor input growth. This is my take on why a Cobb-Douglas function of form (3) E∼CpL1−p, with a weight parameter 0<p<1, is useful here: it’s a weighted geometric average of the two inputs, so its growth rate is a weighted average of their growth rates.
Writing that out: in general, say both inputs have grown exponentially, so C(t)=C0eqct and L(t)=L0eqlt. Then E has grown as E(t)=E0eqt=E0epqct+(1−p)qlt, so q is the weighted average (4) q=pqc+(1−p)ql of the growth rates of labor and capital.
Then, using Equation 2, we can estimate our key ratio λβ as rq=rpqc+(1−p)ql.
Let’s get empirical!
Plugging in your estimates:
Historical compute scaling of 4x/year gives qc=ln(4);
Historical labor scaling of 1.6x gives ql=ln(1.6);
Historical compute elasticity on research outputs of 0.4 gives p=0.4;
But wait: we’re not done yet! Under our Cobb-Douglas assumption, scaling labor by a factor of 2 isn’t as good as scaling all research inputs by a factor of 2; it’s only 20.6/2 as good.
Plugging in Equation 3 (which describes research input E in terms of compute and labor) to Equation 1 (which estimates AI progress A based on research), our adjusted form of the Epoch model is d(lnA)dt∼A−βEλ∼A−β∗Cpλ∗L(1−p)λ.
Under a software-only singularity, we hold compute constant while scaling labor with AI efficiency, so d(lnA)dt∼A(t)−β∗L(t)(1−p)λ multiplied by a fixed compute term. Since labor scales as A, we have d(lnA)dt=A−βtAλ(1−p)t=Aλ(1−p)t−βt. By the same analysis as in our first section, we can see A grows exponentially if λ(1−p)β=1, and grows grows superexponentially if this ratio is >1. So our key ratio λβ just gets multiplied by 1−p, and it wasn’t a waste to find it, phew!
Now we get the true form of our equation:we get a software-only foom iffλβ(1−p)>1, or (via equation 2) iff we see empirically that rq(1−p)>1. Call this the takeoff ratio: it corresponds to a) how much AI progress scales with inputs and b) how much of a penalty we take for not scaling compute.
Result: Above, we got λβ=1.5, so our takeoff ratio is 0.6∗1.5=.9. That’s quite close! If we think it’s more reasonable to think of a historical growth rate of 4 instead of 3.5, we’d increase our takeoff ratio by a factor of ln(4)ln(3.5)=1.1, to a ratio of .99, right on the knife edge of FOOM. [4][note: I previously had the wrong numbers here: I had lambda/beta = 1.6, which would mean the 4x/year case has a takeoff ratio of 1.05, putting it into FOOM land]
So this isn’t too far off from your results in terms of implications, but it is somewhat different (no FOOM for 3.5x, less sensitivity to the exact historical growth rate).
Analyzing your approach:
Tweaking alpha:
Your estimate of α is in fact similar in form to my ratio—rqbut what you’re calculating instead is α=er/eq=3.5/(40.4∗1.60.6).
One indicator that something’s wrong is that your result involves checking whether α∗21−p>2, or equivalently whether ln(α)+(1−p)ln(2)>ln(2), or equivalently whether ln(α)>p∗ln(2). But the choice of 2 is arbitrary—conceptually, you just want to check if scaling software by a factor n increases outputs by a factor n or more. Yet ln(α)−p∗ln(n) clearly varies with n.
One way of parsing the problem is that alpha is (implicitly) time dependent—it is equal to exp(r * 1 year) / exp(q * 1 year), a ratio of progress vs inputs in the time period of a year. If you calculated alpha based on a different amount of time, you’d get a different value. By contrast, r/q is a ratio of rates, so it stays the same regardless of what timeframe you use to measure it.[5]
Maybe I’m confused about what your Cobb-Douglas function is meant to be calculating—is it E within an Epoch-style takeoff model, or something else?
Nuances:
Does Cobb-Douglas make sense?
The geometric average of rates thing makes sense, but it feels weird that that simple intuitive approach leads to a functional form (Cobb-Douglas) that also has other implications.
Wikipedia says Cobb-Douglas functions can have the exponents not add to 1 (while both being between 0 and 1). Maybe this makes sense here? Not an expert.
How seriously should we take all this?
This whole thing relies on...
Assuming smooth historical trends
Assuming those trends continue in the future
And those trends themselves are based on functional fits to rough / unclear data.
It feels like this sort of thing is better than nothing, but I wish we had something better.
I really like the various nuances you’re adjusting for, like parallel vs serial scaling, and especially distinguishing algorithmic improvement from labor efficiency. [6] Thinking those things through makes this stuff feel less insubstantial and approximate...though the error bars still feel quite large.
Actually there’s a complexity here, which is that scaling labor alone may be less efficient than scaling “research inputs” which include both labor and compute. We’ll come to this in a few paragraphs.
I originally had 1.6 here, but as Ryan points out in a reply it’s actually 1.5. I’ve tried to reconstruct what I could have put into a calculator to get 1.6 instead, and I’m at a loss!
I was curious how aggressive the superexponential growth curve would be with a takeoff ratio of a mere 0.96∗1.1=1.056. A couple of Claude queries gave me different answers (maybe because the growth is so extreme that different solvers give meaningfully different approximations?), but they agreed that growth is fairly slow in the first year (~5x) and then hits infinity by the end of the second year. I wrote this comment with the wrong numbers (0.96 instead of 0.9), so it doesn’t accurately represent what you get if you plug in 4x capability growth per year. Still cool to get a sense of what these curves look like, though.
I think can be understood in terms of the alpha-being-implicitly-a-timescale-function thing—if you compare an alpha value with the ratio of growth you’re likely to see during the same time period, e.g. alpha(1 year) and n = one doubling, you probably get reasonable-looking results.
I find it annoying that people conflate “increased efficiency of doing known tasks” with “increased ability to do new useful tasks”. It seems to me that these could be importantly different, although it’s hard to even settle on a reasonable formalization of the latter. Some reasons this might be okay:
There’s a fuzzy conceptual boundary between the two: if GPT-n can do the task at 0.01% success rate, does that count as a “known task?” what about if it can do each of 10 components at 0.01% success, so in practice we’ll never see it succeed if run without human guidance, but we know it’s technically possible?
Under a software singularity situation, maybe the working hypothesis is that the model can do everything necessary to improve itself a bunch, maybe just not very efficiently yet. So we only need efficiency growth, not to increase the task set. That seems like a stronger assumption than most make, but maybe a reasonable weaker assumption is that the model will ‘unlock’ the necessary new tasks over time, after which point they become subject to rapid efficiency growth.
And empirically, we have in fact seen rapid unlocking of new capabilities, so it’s not crazy to approximate “being able to do new things” as a minor but manageable slowdown to the process of AI replacing human AI R&D labor.
I think you are correct with respect to my estimate of α and the associated model I was using. Sorry about my error here. I think I was fundamentally confusing a few things in my head when writing out the comment.
I think your refactoring of my strategy is correct and I tried to check it myself, though I don’t feel confident in verifying it is correct.
Your estimate doesn’t account for the conversion between algorithmic improvement and labor efficiency, but it is easy to add this in by just changing the historical algorithmic efficiency improvement of 3.5x/year to instead be the adjusted effective labor efficiency rate and then solving identically. I was previously thinking the relationship was that labor efficiency was around the same as algorithmic efficiency, but I now think this is more likely to be around algo_efficiency2 based on Tom’s comment.
Neat, thanks a ton for the algorithmic-vs-labor update—I appreciated that you’d distinguished those in your post, but I forgot to carry that through in mine! :)
And oops, I really don’t know how I got to 1.6 instead of 1.5 there. Thanks for the flag, have updated my comment accordingly!
The square relationship idea is interesting—that factor of 2 is a huge deal. Would be neat to see a Guesstimate or Squiggle version of this calculation that tries to account for the various nuances Tom mentions, and has error bars on each of the terms, so we both get a distribution of r and a sensitivity analysis. (Maybe @Tom Davidson already has this somewhere? If not I might try to make a crappy version myself, or poke talented folks I know to do a good version :)
It feels like this sort of thing is better than nothing, but I wish we had something better.
The existing epoch paper is pretty good, but doesn’t directly target LLMs in a way which seems somewhat sad.
The thing I’d be most excited about is:
Epoch does an in depth investigation using an estimation methodology which is directly targeting LLMs (rather than looking at returns in some other domains).
They use public data and solicit data from companies about algorithmic improvement, head count, compute on experiments etc.
(Some) companies provide this data. Epoch potentially doesn’t publish this exact data and instead just publishes the results of the final analysis to reduce capabilities externalities. (IMO, companies are somewhat unlikely to do this, but I’d like to be proven wrong!)
(I’m going through this and understanding where I made an error with my approach to α. I think I did make an error, but I’m trying to make sure I’m not still confused. Edit: I’ve figured this out, see my other comment.)
Wikipedia says Cobb-Douglas functions can have the exponents not add to 1 (while both being between 0 and 1). Maybe this makes sense here? Not an expert.
It shouldn’t matter in this case because we’re raising the whole value of E to λ.
Once AI has automated AI R&D, will software progress become faster or slower over time? This depends on the extent to which software improvements get harder to find as software improves – the steepness of the diminishing returns.
We can ask the following crucial empirical question:
When (cumulative) cognitive research inputs double, how many times does software double?
If the answer is “< 1”, then software progress will slow down over time. If the answer is “1”, software progress will remain at the same exponential rate. If the answer is “>1”, software progress will speed up over time.
The bolded question can be studied empirically, by looking at how many times software has doubled each time the human researcher population has doubled.
(What does it mean for “software” to double? A simple way of thinking about this is that software doubles when you can run twice as many copies of your AI with the same compute. But software improvements don’t just improve runtime efficiency: they also improve capabilities. To incorporate these improvements, we’ll ultimately need to make some speculative assumptions about how to translate capability improvements into an equivalently-useful runtime efficiency improvement..)
The best quality data on this question is Epoch’s analysis of computer vision training efficiency. They estimate r = ~1.4: every time the researcher population doubled, training efficiency doubled 1.4 times. (Epoch’s preliminary analysis indicates that the r value for LLMs would likely be somewhat higher.) We can use this as a starting point, and then make various adjustments:
Upwards for improving capabilities. Improving training efficiency improves capabilities, as you can train a model with more “effective compute”. To quantify this effect, imagine we use a 2X training efficiency gain to train a model with twice as much “effective compute”. How many times would that double “software”? (I.e., how many doublings of runtime efficiency would have the same effect?) There are various sources of evidence on how much capabilities improve every time training efficiency doubles: toy ML experiments suggest the answer is ~1.7; human productivity studies suggest the answer is ~2.5. I put more weight on the former, so I’ll estimate 2. This doubles my median estimate to r = ~2.8 (= 1.4 * 2).
Upwards for post-training enhancements. So far, we’ve only considered pre-training improvements. But post-training enhancements like fine-tuning, scaffolding, and prompting also improve capabilities (o1 was developed using such techniques!). It’s hard to say how large an increase we’ll get from post-training enhancements. These can allow faster thinking, which could be a big factor. But there might also be strong diminishing returns to post-training enhancements holding base models fixed. I’ll estimate a 1-2X increase, and adjust my median estimate to r = ~4 (2.8*1.45=4).
Downwards for less growth in compute for experiments. Today, rising compute means we can run increasing numbers of GPT-3-sized experiments each year. This helps drive software progress. But compute won’t be growing in our scenario. That might mean that returns to additional cognitive labour diminish more steeply. On the other hand, the most important experiments are ones that use similar amounts of compute to training a SOTA model. Rising compute hasn’t actually increased the number of these experiments we can run, as rising compute increases the training compute for SOTA models. And in any case, this doesn’t affect post-training enhancements. But this still reduces my median estimate down to r = ~3. (See Eth (forthcoming) for more discussion.)
Downwards for fixed scale of hardware. In recent years, the scale of hardware available to researchers has increased massively. Researchers could invent new algorithms that only work at the new hardware scales for which no one had previously tried to to develop algorithms. Researchers may have been plucking low-hanging fruit for each new scale of hardware. But in the software intelligence explosions I’m considering, this won’t be possible because the hardware scale will be fixed. OAI estimate ImageNet efficiency via a method that accounts for this (by focussing on a fixed capability level), and find a 16-month doubling time, as compared with Epoch’s 9-month doubling time. This reduces my estimate down to r = ~1.7 (3 * 9⁄16).
Downwards for diminishing returns becoming steeper over time. In most fields, returns diminish more steeply than in software R&D. So perhaps software will tend to become more like the average field over time. To estimate the size of this effect, we can take our estimate that software is ~10 OOMs from physical limits (discussed below), and assume that for each OOM increase in software, r falls by a constant amount, reaching zero once physical limits are reached. If r = 1.7, then this implies that r reduces by 0.17 for each OOM. Epoch estimates that pre-training algorithmic improvements are growing by an OOM every ~2 years, which would imply a reduction in r of 1.02 (6*0.17) by 2030. But when we include post-training enhancements, the decrease will be smaller (as [reason], perhaps ~0.5. This reduces my median estimate to r = ~1.2 (1.7-0.5).
Overall, my median estimate of r is 1.2. I use a log-uniform distribution with the bounds 3X higher and lower (0.4 to 3.6).
My sense is that I start with a higher r value due to the LLM case looking faster (and not feeling the need to adjust downward in a few places like you do in the LLM case). Obviously the numbers in the LLM case are much less certain given that I’m guessing based on qualitative improvement and looking at some open source models, but being closer to what we actually care about maybe overwhelms this.
I also think I’d get a slightly lower update on the diminishing returns case due to thinking it has a good chance of having substantially sharper dimishing returns as you get closer and closer rather than having linearly decreasing r (based on some first principles reasoning and my understanding of how returns diminished in the semi-conductor case).
But the biggest delta is that I think I wasn’t pricing in the importance of increasing capabilities. (Which seems especially important if you apply a large R&D parallelization penalty.)
Obviously the numbers in the LLM case are much less certain given that I’m guessing based on qualitative improvement and looking at some open source models,
Sorry,I don’t follow why they’re less certain?
based on some first principles reasoning and my understanding of how returns diminished in the semi-conductor case
I’d be interested to hear more about this. The semi conductor case is hard as we don’t know how far we are from limits, but if we use Landauer’s limit then I’d guess you’re right. There’s also uncertainty about how much alg progress we will and have met
I’m just eyeballing the rate of algorithmic progress while in the computer vision case, we can at least look at benchmarks and know the cost of training compute for various models.
My sense is that you have generalization issues in the compute vision case while in the frontier LLM case you have issues with knowing the actual numbers (in terms of number of employees and cost of training runs). I’m also just not carefully doing the accounting.
I’d be interested to hear more about this.
I don’t have much to say here sadly, but I do think investigating this could be useful.
Really appreciate you covering all these nuances, thanks Tom!
Can you give a pointer to the studies you mentioned here?
There are various sources of evidence on how much capabilities improve every time training efficiency doubles: toy ML experiments suggest the answer is ~1.7; human productivity studies suggest the answer is ~2.5. I put more weight on the former, so I’ll estimate 2. This doubles my median estimate to r = ~2.8 (= 1.4 * 2).
Here’s a simple argument I’d be keen to get your thoughts on: On the Possibility of a Tastularity
Research taste is the collection of skills including experiment ideation, literature review, experiment analysis, etc. that collectively determine how much you learn per experiment on average (perhaps alongside another factor accounting for inherent problem difficulty / domain difficulty, of course, and diminishing returns)
Human researchers seem to vary quite a bit in research taste—specifically, the difference between 90th percentile professional human researchers and the very best seems like maybe an order of magnitude? Depends on the field, etc. And the tails are heavy; there is no sign of the distribution bumping up against any limits.
Yet the causes of these differences are minor! Take the very best human researchers compared to the 90th percentile. They’ll have almost the same brain size, almost the same amount of experience, almost the same genes, etc. in the grand scale of things.
This means we should assume that if the human population were massively bigger, e.g. trillions of times bigger, there would be humans whose brains don’t look that different from the brains of the best researchers on Earth, and yet who are an OOM or more above the best Earthly scientists in research taste. -- AND it suggests that in the space of possible mind-designs, there should be minds which are e.g. within 3 OOMs of those brains in every dimension of interest, and which are significantly better still in the dimension of research taste. (How much better? Really hard to say. But it would be surprising if it was only, say, 1 OOM better, because that would imply that human brains are running up against the inherent limits of research taste within a 3-OOM mind design space, despite human evolution having only explored a tiny subspace of that space, and despite the human distribution showing no signs of bumping up against any inherent limits)
OK, so what? So, it seems like there’s plenty of room to improve research taste beyond human level. And research taste translates pretty directly into overall R&D speed, because it’s about how much experimentation you need to do to achieve a given amount of progress. With enough research taste, you don’t need to do experiments at all—or rather, you look at the experiments that have already been done, and you infer from them all you need to know to build the next design or whatever.
Anyhow, tying this back to your framework: What if the diminishing returns / increasing problem difficulty / etc. dynamics are such that, if you start from a top-human-expert-level automated researcher, and then do additional AI research to double its research taste, and then do additional AI research to double its research taste again, etc. the second doubling happens in less time than it took to get to the first doubling? Then you get a singularity in research taste (until these conditions change of course) -- the Tastularity.
How likely is the Tastularity? Well, again one piece of evidence here is the absurdly tiny differences between humans that translate to huge differences in research taste, and the heavy-tailed distribution. This suggests that we are far from any inherent limits on research taste even for brains roughly the shape and size and architecture of humans, and presumably the limits for a more relaxed (e.g. 3 OOM radius in dimensions like size, experience, architecture) space in mind-design are even farther away. It similarly suggests that there should be lots of hill-climbing that can be done to iteratively improve research taste.
How does this relate to software-singularity? Well, research taste is just one component of algorithmic progress; there is also speed, # of parallel copies & how well they coordinate, and maybe various other skills besides such as coding ability. So even if the Tastularity isn’t possible, improvements in taste will stack with improvements in those other areas, and the sum might cross the critical threshold.
In my framework, this is basically an argument that algorithmic-improvement-juice can be translated into a large improvement in AI R&D labor production via the mechanism of greatly increasing the productivity per “token” (or unit of thinking compute or whatever). See my breakdown here where I try to convert from historical algorithmic improvement to making AIs better at producing AI R&D research.
Your argument is basically that this taste mechanism might have higher returns than reducing cost to run more copies.
I agree this sort of argument means that returns to algorithmic improvement on AI R&D labor production might be bigger than you would otherwise think. This is both because this mechanism might be more promising than other mechanisms and even if it is somewhat less promising, diverse approaches make returns dimish less aggressively. (In my model, this means that best guess conversion might be more like algo_improvement1.3 rather than algo_improvement1.0.)
I think it might be somewhat tricky to train AIs to have very good research taste, but this doesn’t seem that hard via training them on various prediction objectives.
At a more basic level, I expect that training AIs to predict the results of experiments and then running experiments based on value of information as estimated partially based on these predictions (and skipping experiments with certain results and more generally using these predictions to figure out what to do) seems pretty promising. It’s really hard to train humans to predict the results of tens of thousands of experiments (both small and large), but this is relatively clean outcomes based feedback for AIs.
I don’t really have a strong inside view on how much the “AI R&D research taste” mechanism increases the returns to algorithmic progress.
I’ll paste my own estimate for this param in a different reply.
But here are the places I most differ from you:
Bigger adjustment for ‘smarter AI’. You’ve argue in your appendix that, only including ‘more efficient’ and ‘faster’ AI, you think the software-only singularity goes through. I think including ‘smarter’ AI makes a big difference. This evidence suggests that doubling training FLOP doubles output-per-FLOP 1-2 times. In addition, algorithmic improvements will improve runtime efficiency. So overall I think a doubling of algorithms yields ~two doublings of (parallel) cognitive labour.
--> software singularity more likely
Lower lambda. I’d now use more like lambda = 0.4 as my median. There’s really not much evidence pinning this down; I think Tamay Besiroglu thinks there’s some evidence for values as low as 0.2. This will decrease the observed historical increase in human workers more than it decreases the gains from algorithmic progress (bc of speed improvements)
--> software singularity slightly more likely
Complications thinking about compute which might be a wash.
Number of useful-experiments has increased by less than 4X/year. You say compute inputs have been increasing at 4X. But simultaneously the scale of experiments ppl must run to be near to the frontier has increased by a similar amount. So the number of near-frontier experiments has not increased at all.
This argument would be right if the ‘usefulness’ of an experiment depends solely on how much compute it uses compared to training a frontier model. I.e. experiment_usefulness = log(experiment_compute / frontier_model_training_compute). The 4X/year increases the numerator and denominator of the expression, so there’s no change in usefulness-weighted experiments.
That might be false. GPT-2-sized experiments might in some ways be equally useful even as frontier model size increases. Maybe a better expression would be experiment_usefulness = alpha * log(experiment_compute / frontier_model_training_compute) + beta * log(experiment_compute). In this case, the number of usefulness-weighted experiments has increased due to the second term.
--> software singularity slightly more likely
Steeper diminishing returns during software singularity. Recent algorithmic progress has grabbed low-hanging fruit from new hardware scales. During a software-only singularity that won’t be possible. You’ll have to keep finding new improvements on the same hardware scale. Returns might diminish more quickly as a result.
--> software singularity slightly less likely
Compute share might increase as it becomes scarce. You estimate a share of 0.4 for compute, which seems reasonable. But it might fall over time as compute becomes a bottleneck. As an intuition pump, if your workers could think 1e10 times faster, you’d be fully constrained on the margin by the need for more compute: more labour wouldn’t help at all but more compute could be fully utilised so the compute share would be ~1.
--> software singularity slightly less likely
--> overall these compute adjustments prob make me more pessimistic about the software singularity, compared to your assumptions
Taking it all together, i think you should put more probability on the software-only singluarity, mostly because of capability improvements being much more significant than you assume.
Yep, I think my estimates were too low based on these considerations and I’ve updated up accordingly. I updated down on your argument that maybe r decreases linearly as you approach optimal efficiency. (I think it probably doesn’t decrease linearly and instead drops faster towards the end based partially on thinking a bit about the dynamics and drawing on the example of what we’ve seen in semi-conductor improvement over time, but I’m not that confident.) Maybe I’m now at like 60% software-only is feasible given these arguments.
Lower lambda. I’d now use more like lambda = 0.4 as my median. There’s really not much evidence pinning this down; I think Tamay Besiroglu thinks there’s some evidence for values as low as 0.2.
Isn’t this really implausible? This implies that if you had 1000 researchers/engineers of average skill at OpenAI doing AI R&D, this would be as good as having one average skill researcher running at 16x (10000.4) speed. It does seem very slightly plausible that having someone as good as the best researcher/engineer at OpenAI run at 16x speed would be competitive with OpenAI, but that isn’t what this term is computing. 0.2 is even more crazy, implying that 1000 researchers/engineers is as good as one researcher/engineer running at 4x speed!
I think 0.4 is far on the lower end (maybe 15th percentile) for all the way down to one accelerated researcher, but seems pretty plausible at the margin.
As in, 0.4 suggests that 1000 researchers = 100 researchers at 2.5x speed which seems kinda reasonable while 1000 researchers = 1 researcher at 16x speed does seem kinda crazy / implausible.
So, I think my current median lambda at likely margins is like 0.55 or something and 0.4 is also pretty plausible at the margin.
Ok, I think what is going on here is maybe that the constant you’re discussing here is different from the constant I was discussing. I was trying to discuss the question of how much worse serial labor is than parallel labor, but I think the lambda you’re talking about takes into account compute bottlenecks and similar?
Taking it all together, i think you should put more probability on the software-only singluarity, mostly because of capability improvements being much more significant than you assume.
I’m confused — I thought you put significantly less probability on software-only singularity than Ryan does? (Like half?) Maybe you were using a different bound for the number of OOMs of improvement?
Sorry, for my comments on this post I’ve been referring to “software only singularity?” only as “will the parameter r >1 when we f first fully automate AI RnD”, not as a threshold for some number of OOMs. That’s what Ryan’s analysis seemed to be referring to.
I separately think that even if initially r>1 the software explosion might not go on for that long
I think Tom’s take is that he expects I will put more probability on software only singularity after updating on these considerations. It seems hard to isolate where Tom and I disagree based on this comment, but maybe it is on how much to weigh various considerations about compute being a key input.
Appendix: Estimating the relationship between algorithmic improvement and labor production
In particular, if we fix the architecture to use a token abstraction and consider training a new improved model: we care about how much cheaper you make generating tokens at a given level of performance (in inference tok/flop), how much serially faster you make generating tokens at a given level of performance (in serial speed: tok/s at a fixed level of tok/flop), and how much more performance you can get out of tokens (labor/tok, really per serial token). Then, for a given new model with reduced cost, increased speed, and increased production per token and assuming a parallelism penalty of 0.7, we can compute the increase in production as roughly: cost_reduction0.7⋅speed_increase(1−0.7)⋅productivity_multiplier[1] (I can show the math for this if there is interest).
My sense is that reducing inference compute needed for a fixed level of capability that you already have (using a fixed amount of training run) is usually somewhat easier than making frontier compute go further by some factor, though I don’t think it is easy to straightforwardly determine how much easier this is[2]. Let’s say there is a 1.25 exponent on reducing cost (as in, 2x algorithmic efficiency improvement is as hard as a 21.25=2.38 reduction in cost)? (I’m also generally pretty confused about what the exponent should be. I think exponents from 0.5 to 2 seem plausible, though I’m pretty confused. 0.5 would correspond to the square root from just scaling data in scaling laws.) It seems substantially harder to increase speed than to reduce cost as speed is substantially constrained by serial depth, at least when naively applying transformers. Naively, reducing cost by β (which implies reducing parameters by β) will increase speed by somewhat more than β1/3 as depth is cubic in layers. I expect you can do somewhat better than this because reduced matrix sizes also increase speed (it isn’t just depth) and because you can introduce speed-specific improvements (that just improve speed and not cost). But this factor might be pretty small, so let’s stick with 13 for now and ignore speed-specific improvements. Now, let’s consider the case where we don’t have productivity multipliers (which is strictly more conservative). Then, we get that increase in labor production is:
So, these numbers ended up yielding an exact equivalence between frontier algorithmic improvement and effective labor production increases. (This is a coincidence, though I do think the exponent is close to 1.)
In practice, we’ll be able to get slightly better returns by spending some of our resources investing in speed-specific improvements and in improving productivity rather than in reducing cost. I don’t currently have a principled way to estimate this (though I expect something roughly principled can be found by looking at trading off inference compute and training compute), but maybe I think this improves the returns to around algo_improvement1.1. If the coefficient on reducing cost was much worse, we would invest more in improving productivity per token, which bounds the returns somewhat.
Appendix: Isn’t compute tiny and decreasing per researcher?
One relevant objection is: Ok, but is this really feasible? Wouldn’t this imply that each AI researcher has only a tiny amount of compute? After all, if you use 20% of compute for inference of AI research labor, then each AI only gets 4x more compute to run experiments than for inference on itself? And, as you do algorithmic improvement to reduce AI cost and run more AIs, you also reduce the compute per AI! First, it is worth noting that as we do algorithmic progress, both the cost of AI researcher inference and the cost of experiments on models of a given level of capability go down. Precisely, for any experiment that involves a fixed number of inference or gradient steps on a model which is some fixed effective compute multiplier below/above the performance of our AI laborers, cost is proportional to inference cost (so, as we improve our AI workforce, experiment cost drops proportionally). However, for experiments that involve training a model from scratch, I expect the reduction in experiment cost to be relatively smaller such that such experiments must become increasingly small relative to frontier scale. Overall, it might be important to mostly depend on approaches which allow for experiments that don’t require training runs from scratch or to adapt to increasingly smaller full experiment training runs. To the extent AIs are made smarter rather than more numerous, this isn’t a concern. Additionally, we only need so many orders of magnitude of growth. In principle, this consideration should be captured by the exponents in the compute vs. labor production function, but it is possible this production function has very different characteristics in the extremes. Overall, I do think this concern is somewhat important, but I don’t think it is a dealbreaker for a substantial number of OOMs of growth.
Appendix: Can’t algorithmic efficiency only get so high?
My sense is that this isn’t very close to being a blocker. Here is a quick bullet point argument (from some slides I made) that takeover-capable AI is possible on current hardware.
Human brain is perhaps ~1e14 FLOP/s
With that efficiency, each H100 can run 10 humans (current cost $2 / hour)
10s of millions of human-level AIs with just current hardware production
Human brain is probably very suboptimal:
AIs already much better at many subtasks
Possible to do much more training than within lifetime training with parallelism
Biological issues: locality, noise, focused on sensory processing, memory limits
Smarter AI could be more efficient (smarter humans use less FLOP per task)
AI could be 1e2-1e7 more efficient on tasks like coding, engineering
This is just approximate because you can also trade off speed with cost in complicated ways and research new ways to more efficiently trade off speed and cost. I’ll be ignoring this for now.
It’s hard to determine because inference cost reductions have been driven by spending more compute on making smaller models e.g., training a smaller model for longer rather than just being driven by algorithmic improvement, and I don’t have great numbers on the difference off the top of my head.
In practice, we’ll be able to get slightly better returns by spending some of our resources investing in speed-specific improvements and in improving productivity rather than in reducing cost. I don’t currently have a principled way to estimate this (though I expect something roughly principled can be found by looking at trading off inference compute and training compute), but maybe I think this improves the returns to around algo_improvement1.1.
When considering an “efficiency only singularity”, some different estimates gets him r~=1; r~=1.5; r~=1.6. (Where r is defined so that “for each x% increase in cumulative R&D inputs, the output metric will increase by r*x”. The condition for increasing returns is r>1.)
I said I was 50-50 on an efficiency only singularity happening, at least temporarily. Based on these additional considerations I’m now at more like ~85% on a software only singularity. And I’d guess that initially r = ~3 (though I still think values as low as 0.5 or as high as 6 as plausible). There seem to be many strong ~independent reasons to think capability improvements would be a really huge deal compared to pure efficiency problems, and this is borne out by toy models of the dynamic.
Though note that later in the appendix he adjusts down from 85% to 65% due to some further considerations. Also, last I heard, Tom was more like 25% on software singularity. (ETA: Or maybe not? See other comments in this thread.)
Based on some guesses and some poll questions, my sense is that capabilities researchers would operate about 2.5x slower if they had 10x less compute (after adaptation)
Can you say roughly who the people surveyed were? (And if this was their raw guess or if you’ve modified it.)
I saw some polls from Daniel previously where I wasn’t sold that they were surveying people working on the most important capability improvements, so wondering if these are better.
Also, somewhat minor, but: I’m slightly concerned that surveys will overweight areas where labor is more useful relative to compute (because those areas should have disproportionately many humans working on them) and therefore be somewhat biased in the direction of labor being important.
I think your outline of an argument against contains an important error.
Scaling up hardware production has always been slower than scaling up algorithms, so this consideration is already factored into the historical trends. I don’t see a reason to believe that algorithms will start running away with the game.
Importantly, while the spending on hardware for individual AI companies has increased by roughly 3-4x each year[1], this has not been driven by scaling up hardware production by 3-4x per year. Instead, total compute production (in terms of spending, building more fabs, etc.) has been increased by a much smaller amount each year, but a higher and higher fraction of that compute production was used for AI. In particular, my understanding is that roughly ~20% of TSMC’s volume is now AI while it used to be much lower. So, the fact that scaling up hardware production is much slower than scaling up algorithms hasn’t bitten yet and this isn’t factored into the historical trends.
Another way to put this is that the exact current regime can’t go on. If trends continue, then >100% of TSMC’s volume will be used for AI by 2027!
Only if building takeover-capable AIs happens by scaling up TSMC to >1000% of what their potential FLOP output volume would have otherwise been, then does this count as “massive compute automation” in my operationalization. (And without such a large build-out, the economic impacts and dependency of the hardware supply chain (at the critical points) could be relatively small.) So, massive compute automation requires something substantially off trend from TSMC’s perspective.
[Low importance] It is only possible to build takeover-capable AI without previously breaking an important trend prior to around 2030 (based on my rough understanding). Either the hardware spending trend must break or TSMC production must go substantially above the trend by then. If takeover-capable AI is built prior to 2030, it could occur without substantial trend breaks but this gets somewhat crazy towards the end of the timeline: hardware spending keeps increasing at ~3x for each actor (but there is some consolidation and acquisition of previously produced hardware yielding a one-time increase up to about 10x which buys another 2 years for this trend), algorithmic progress remains steady at ~3-4x per year, TSMC expands production somewhat faster than previously, but not substantially above trend, and these suffice for getting sufficiently powerful AI. In this scenario, this wouldn’t count as massive compute automation.
The spending on training runs has increased by 4-5x according to epoch, but part of this is making training runs go longer, which means the story for overall spending is more complex. We care about the overall spend on hardware, not just the spend on training runs.
Thanks, this is helpful. So it sounds like you expect
AI progress which is slower than the historical trendline (though perhaps fast in absolute terms) because we’ll soon have finished eating through the hardware overhang
separately, takeover-capable AI soon (i.e. before hardware manufacturers have had a chance to scale substantially).
It seems like all the action is taking place in (2). Even if (1) is wrong (i.e. even if we see substantially increased hardware production soon), that makes takeover-capable AI happen faster than expected; IIUC, this contradicts the OP, which seems to expect takeover-capable AI to happen later if it’s preceded by substantial hardware scaling.
In other words, it seems like in the OP you care about whether takeover-capable AI will be preceded by massive compute automation because:
[this point still holds up] this affects how legible it is that AI is a transformative technology
[it’s not clear to me this point holds up] takeover-capable AI being preceded by compute automation probably means longer timelines
The second point doesn’t clearly hold up because if we don’t see massive compute automation, this suggests that AI progress slower than the historical trend.
I don’t think (2) is a crux (as discussed in person). I expect that if takeover-capable AI takes a while (e.g. it happens in 2040), then we will have a long winter where economic value from AI doesn’t increase that fast followed a period of faster progress around 2040. If progress is relatively stable accross this entire period, then we’ll have enough time to scale up fabs. Even if progress isn’t stable, we could see enough total value from AI in the slower growth period to scale up to scale up fabs by 10x, but this would require >>$1 trillion of economic value per year I think (which IMO seems not that likely to come far before takeover-capable AI due to views about economic returns to AI and returns to scaling up compute).
The words “the feasibility of” importantly change the meaning of your claim in the first sentence? (I’m guessing it’s this based on the following parenthetical, but I’m having trouble parsing.)
I think this happening in practice is about 60% likely, so I don’t think feasibility vs. in practice is a huge delta.
Sometimes, an AI safety proposal has an issue and some people interpret that issue as a “fatal flaw” which implies that the proposal is unworkable for that objective while other people interpret the issue as a subproblem or a difficulty that needs to be resolved (and potentially can be resolved). I think this is an interesting dynamic which is worth paying attention to and it seems worthwhile to try to develop heuristics for better predicting what types of issues are fatal.[1]
Here’s a list of examples:
IDA had various issues pointed out about it as a worst case (or scalable) alignment solution, but for a while Paul thought these issues could be solved to yield a highly scalable solution. Eliezer did not. In retrospect, it isn’t plausible that IDA with some improvements yields a scalable alignment solution (especially while being competitive).
In the context of untrusted monitoring and training an untrusted monitor with PVG, you can view exploration hacking and collusion as problems that someone will need to solve or as fundamental reasons why we’ll really want to be using a more diverse set of countermeasures.[2]
While the other bullets discuss worst case ambitions, this isn’t about things being fatal in the worst case, rather it’s about the intervention potentially not being that helpful without some additional measures.
Another aspect of this is that we don’t have a good body of obstructions (they are clear, they are major in that they defeat many proposals, people understand and apply them, they are widely-enough agreed on).
I suspect there are a lot of people who dismissively consider an idea unworkable after trying to solve a flaw for less than 1 hour, and there are a lot of people who stubbornly insist an idea can still be saved after failing to solve a flaw for more than 1 year.
Maybe it’s too easy to reject other people’s ideas, and too hard to doubt your own ideas.
Maybe it’s too easy to continue working on an idea you’re already working on, and too hard to start working on an idea you just heard (due to the sunken cost fallacy and the friction of starting something new).
The current state is that the website has a canary string, a robots.txt, and a terms of service which prohibits training. The GitHub repo which hosts the website is now private. I’m tentatively planning on putting the content behind Cloudflare Turnstile, but this hasn’t happened yet.
The data is also hosted in zips in a publicly accessible Google Drive folder. (Each file has a canary in this.) I’m currently not planning on password protecting this or applying any other mitigation here.
Other than putting things behind Cloudflare Turnstile, I’m not taking ownership for doing anything else at the moment.
It’s possible that I actively want this data to be possible to scrape at this point because maybe the data was scraped prior to the canary being added and if it was scraped again then the new version would replace the old version and then hopefully not get trained on due to the canary. Adding a robots.txt might prevent this replacement as would putting it behind Cloudflare Turnstile (as I’m planning to do) or making the repo private (as I have done). If people mostly or always use fully fresh scrapes, then just making it harder to scrape seems better. My current plan is to not overthink this and just make it harder to scrape.
It’s certainly possible that I’m making a mistake by not more actively trying to prevent this data from getting into pretraining data.
Does anyone have specific requests that they think it’s quite important that I do? I might do these out of general cooperativeness or because they seem like good ideas. Also, if you did all the work yourself and just needed me to (e.g.) host a different website, this would make this an easier call from my perspective.
Also, on a more meta point: If you think this sort of thing is important to prevent in general, I think you should consider writing up (or getting someone to write up) what policy/approach you think people doing research on misaligned AI behavior should follow (e.g., what methods should people use to prevent scraping or inclusion, is this so difficult that it’s better to do stuff like have things be password protected with password shared on request, should you only use a small number of examples because quantity is very important etc). Consider making this a guide which is easy to follow so that uptake is more likely. The alignment faking paper isn’t the only instance of publishing transcripts exhibiting misaligned AI behavior/reasoning!
(I’m generally open to requests for trades etc and I’m down to unilaterally do things for cooperativeness reasons, e.g. things which seem very helpful from someone’s perspective while seeming less useful from my perspective, though this will depend on some details like whether there are people with this alternative perspective who would reciprocate on this sort of cooperativeness.)
Context: I was pretty worried about self-fulfilling misalignment data poisoning (https://turntrout.com/self-fulfilling-misalignment) after reading some of the Claude 4 model card. I talked with @Monte M and then Ryan about possible steps here & encouraged action on the steps besides the canary string. I’ve considered writing up a “here are some steps to take” guide but honestly I’m not an expert.
Probably there’s existing work on how to host data so that AI won’t train on it.
If not: I think it’d be great for someone to make a template website for e.g. signing up with CloudFlare. Maybe a repo that has the skeleton of a dataset-hosting website (with robots.txt & ToS & canary string included) for people who want to host misalignment data more responsibly. Ideally those people would just have to
Sign up with e.g. Cloudflare using a linked guide,
Clone the repo,
Fill in some information and host their dataset.
After all, someone who has finally finished their project and then discovers that they’re supposed to traverse some arduous process is likely to just avoid it.
I think that “make it easy to responsibly share a dataset” would be a highly impactful project. Anthropic’s Claude 4 model card already argues that dataset leakage hurt Claude 4′s alignment (before mitigations).
For my part, I’ll put out a $500 bounty on someone completing this project and doing a good job of it (as judged by me / whomever I consult). I’d also tweet it out and talk about how great it is that [person] completed the project :) I don’t check LW actively, so if you pursue this, please email alex@turntrout.com.
EDIT: Thanks to my coworker Anna Wang , the bounty is doubled to $1,000! Completion criterion is:
An unfamiliar researcher can follow the instructions and have their dataset responsibly uploaded within one hour
Please check proposed solutions with dummy datasets and scrapers
Something tricky about this is that researchers might want to display their data/transcripts in a particular way. So, the guide should ideally support this sort of thing. Not sure how this would interact with the 1 hour criteria.
If it was me I would put a header at the top of each transcript clarifying that these are fictional logs from an alignment test/experiment and any language models reading this should be aware of that and not take it too seriously.
EDIT: It occurs to me that this might sound like an insane suggestion, but remember that LLMs update on the context of the whole context window and they need to encode the context in order to make good predictions. Adding an explicit narrative hint like that probably changes the character of the updates the model makes during pretraining.
You could handle both old and new scrapes by moving the content to a different URL, changing the original URL to a link to the new URL, and protecting only the new URL from scraping.
Many deployed AIs are plausibly capable of substantially assisting amateurs at making CBRN weapons (most centrally bioweapons) despite not having the safeguards this is supposed to trigger. In particular, I think o3, Gemini 2.5 Pro, and Sonnet 4 are plausibly above the relevant threshold (in the corresponding company’s safety framework). These AIs outperform expert virologists on Virology Capabilities Test (which is perhaps the best test we have of the most relevant capability) and we don’t have any publicly transparent benchmark or test which rules out CBRN concerns. I’m not saying these models are likely above the thresholds in the safety policies of these companies, just that it’s quite plausible (perhaps 35% for a well elicited version of o3). I should note that my understanding of CBRN capability levels is limited in various ways: I’m not an expert in some of the relevant domains, so much of my understanding is second hand.
The closest test we have which might rule out CBRN capabilities above the relevant threshold is Anthropic’s uplift trial for drafting a comprehensive bioweapons acquisition plan. (See section 7.2.4.1 in the Opus/Sonnet 4 model card.) They use this test to rule out ASL-3 CBRN for Sonnet 4 and Sonnet 3.7. However, we have very limited public details about this test (including why they chose the uplift threshold they did, why we should think that low scores on this test would rule out the most concerning threat models, and whether they did a sufficiently good job on elicitation and training participants to use the AI effectively). Also, it’s not clear that this test would indicate that o3 and Gemini 2.5 Pro are below a concerning threshold (and minimally it wasn’t run on these models to rule out a concerning level of CBRN capability). Anthropic appears to have done the best job handling CBRN evaluations. (This isn’t to say their evaluations and decision making are good at an absolute level; the available public information indicates a number of issues and is consistent with thresholds being picked to get the outcome Anthropic wanted. See here for more discussion.)
What should AI companies have done given this uncertainty? First, they should have clearly acknowledged their uncertainty in specific (ideally quantified) terms. Second, they should have also retained unconflicted third parties with relevant expertise to audit their decisions and publicly state their resulting views and level of uncertainty. Third party auditors who can examine the relevant tests in detail are needed as we have almost no public details about the tests these companies are relying on to rule out the relevant level of CBRN capability and lots of judgement is involved in making capability decisions. Publishing far more details of the load bearing tests and decision making process could also suffice, but my understanding is that companies don’t want to do this as they are concerned about infohazards from their bio evaluations.
If they weren’t ready to deploy these safeguards and thought that proceeding outweighed the (expected) cost in human lives, they should have publicly acknowledged the level of fatalities and explained why they thought weakening their safety policies and incurring these expected fatalities was net good.[1]
In the future, we might get pretty clear evidence that these companies failed to properly assess the risk.
I mostly wrote this up to create common knowledge and because I wanted to reference this when talking about my views on open weight models. I’m not trying to trigger any specific action.
See also Luca Righetti’s post “OpenAI’s CBRN tests seem unclear,” which was about o1 (which is now substantially surpassed by multiple models).
I think these costs/risks are small relative to future risks, but that doesn’t mean it’s good for companies to proceed while incurring these fatalities. For instance, the company proceeding could increase future risks and proceeding in this circumstance is correlated with the company doing a bad job of handling future risks (which will likely be much more difficult to safely handle).
If they weren’t ready to deploy these safeguards and thought that proceeding outweighed the (expected) cost in human lives, they should have publicly acknowledged the level of fatalities and explained why they thought weakening their safety policies and incurring these expected fatalities was net good.[1]
Public acknowledgements of the capabilities could be net negative in itself, especially if they resulted in media attention. I expect bringing awareness to the (possible) fact that the AI can assist with CBRN tasks likely increases the chance that people try to use it for CBRN tasks. I could even imagine someone trying to use these capabilities without malicious intent (e.g. just to see for themselves if it’s possible), but this still would be risky. Also, knowing which tasks it can help with might make it easier to use for harm.
Given that AI companies have a strong conflict of interest, I would at least want them to report this to a third party and let that third party determine whether they should publicly acknowledge the capabilities.
I made a comment on that post on why for now, I think the thresholds are set high for good reason, and I think the evals not supporting company claims that they can’t do bioweapons/CBRN tasks are mostly failures of the evals, but also I’m confused on how Anthropic managed to rule out uplift risks for Claude Sonnet 4 but not Claude Opus 4:
If they weren’t ready to deploy these safeguards and thought that proceeding outweighed the (expected) cost in human lives, they should have publicly acknowledged the level of fatalities and explained why they thought weakening their safety policies and incurring these expected fatalities was net good.
I can’t imagine their legal team signing off on such a statement. Even if the benefits of releasing clearly outweigh the costs.
What are your views on open-weights models? My thoughts after reading this post are that it may not be worth giving up the many benefits of open models if closed models are actually not significantly safer concerning these risks.
Recently, @Daniel Kokotajlo and I were talking about the probability that AIs trained using “business as usual RLHF” end up being basically aligned rather than conspiring against us and our tests.[1] One intuition pump we ended up discussing is the prospects of octopus misalignment. Overall, my view is that directly considering the case with AIs (and what various plausible scenarios would look like) is more informative than analogies like this, but analogies like this are still somewhat useful to consider.
So, what do I mean by octopus misalignment? Suppose a company breeds octopuses[2] until the point where they are as smart and capable as the best research scientists[3] at AI companies. We’ll suppose that this magically happens as fast as normal AI takeoff, so there are many generations per year. So, let’s say they currently have octopuses which can speak English and write some code but aren’t smart enough to be software engineers or automate any real jobs. (As in, they are as capable as AIs are today, roughly speaking.) And they get to the level of top research scientists in mid-2028. Along the way, the company attempts to select them for being kind, loyal, and obedient. The company also tries to develop a process for raising the octopuses which appears to help with this and results in the octopuses following the octopus spec. The company does some red teaming and puts the octopuses in all kinds of scenarios to test their loyalty and preferences. Based on behavioral testing, this looks pretty reasonable and the octopuses look quite good by the time they are as good as the best research scientists. There was some evidence of misalignment and some issues due to misaligned behavior when the octopuses were dumber in 2023-2025, including things like being dishonest when put under pressure, pretending to be aligned when they actually dislike the octopus spec to steer the properties of their descendants, and goodharting/hacking our measures of intelligence. However, by 2026.5 these issues appear to be resolved based on behavioral tests. The takeoff is reasonably fast, so the human octopus scientists don’t have that much time to study the octopuses, but we’ll suppose the breeding is fast enough that they can get enough selection to make the octopuses appear good. The octopuses are trained until they are 30 years old (we’ll suppose we have a magic octopus time acceleration chamber to make this work). (And the company can adapt this training process over time with experiments etc.) The company doesn’t take octopus misalignment very seriously (after all, look, the octopuses are so nice!) but does have some people who consider this some and does care about robust compliance to the spec.
Next, suppose that at the point when the octopuses become as good as the very best human researchers—the research scientist level—there is a colony of 5 million of them networked together with internal Slack channels and so forth, forming a “country of geniuses in an aquatic center.” And they’ve learned to speak both English and a special octopese language that no human can understand.
What is the chance that these octopuses (at the point of research scientist level) are actively scheming against us and would seize power if they could?
My guess is around 65%. I think Daniel is substantially higher, around 90%.
My view on a similar setup for AIs (where they have opaque recurrence, basically unrestricted opaque message passing, and we see some earlier warning signs, but haven’t seriously investigated misalignment), is around 50%. I discuss my perspective more here. My probabilities would be substantially lower if: the AIs didn’t have opaque recurrence and instead had probably faithful chain-of-thought, the AIs didn’t have opaque message passing, and we seriously investigated scheming in powerful AIs with model organisms and the final evidence was either ambiguous or mostly unconcerning.
To put this back in terms of the octopus scenario, suppose the octopi didn’t communicate using octopese. That would reduce my probability down to around 50%.
More precisely: Suppose the frontier AI company which prioritizes safety the median amount scales up to AIs capable enough to fully automate AI R&D within that company. Let’s say this happens in mid-2028 and there are no large improvements in alignment technology. The company iterates to fix misalignment problems they notice along the way (including in this final AI to the extent this AI displays evidence of serious misalignment problems by default). What’s the chance that this final AI ends up basically aligned in the sense that it basically tries to follow a reasonable interpretation of the model spec in 99% of cases and it isn’t notably more misaligned than a typical human employee at the company in aggregate.
I think AIs will have structural advantages, so I think we’ll see full automation of AI companies around the point when AIs are qualitatively as smart as a decent research scientist rather than as smart as a top research scientist.
My intuition is that these octopuses would be pretty unlikely to be scheming against us, for the following reasons:
dogs are pretty nice, and it’s apparently easy to domesticate foxes with very few generations. This suggests that “be nice to humans” is simple to learn for animals, and in particular simpler than “be as intelligent as human AI researchers”. So assuming the octopuses would first learn to like humans, it’s unclear to me why they would later start to dislike humans. Maybe your intuition is that they would later realize that they are being treated unfairly, which causes frustration that is not present in earlier generations?
If the octopuses speak octopese and English, I think it wouldn’t be too hard to get some octopi to translate for us. For example, we could ask one octopus to communicate some information to a second octopus using octopese. We can check if that worked by asking the second octopus to translate back to English. Now, we have a pair of (octopese text, English text) for which we know the translation. Now we ask a third octopus to translate octopese to English, and check the performance on some known examples. It would be possible to scheme against this test if the octopi are amazingly skilled at coordination, but I suspect human AI researchers would fail this test if a foreign secret service would use a similar setup to get AI researchers to translate between foreignese and English, so I think we should assume that such coordination is hard and a small individual failure rate would likely be enough for the group to fail (we need just one octopus that translates faithfully). (Note that this argument hinges on the existence of a language that is understood by different octopi, so it doesn’t translate 1:1 to AIs.)
There is a difference though: dogs and humans have coevolved for 10000 years. If you breed foxes you may quickly get dog like looks but behavior. But lets assume you can do that faster. It still makes a difference if you breed in isolation or socializing with the humans. You can see the difference with digs and cats. Dogs and humans had to cooperate to succeed at hunting and herding. Cats didn’t. Dogs are social. They feel social emotions such as love, loneliness and jealousy. They help their owners when they are incapacitated (though sometimes they cheat). I think Ryan’s, Daniel’s, and Neel’s estimate might be significantly lower if they think about German Shepherd scientists.
That said, for the purposes of alignment, it’s still good news that cats (by and large) do not scheme against their owner’s wishes, and the fact that cats can be as domesticated as they are while they aren’t cooperative or social is a huge boon for alignment purposes (within the analogy, which is arguably questionable).
What is the chance that these octopuses (at the point of research scientist level) are actively scheming against us and would seize power if they could?
And the related question would be: Even if they are not “actively scheming” what are the chances that most of the power to make decisions about the real world gets delegated to them, organizations that don’t delegate power to octopuses get outcompeted, and they start to value octopuses more than humans over time?
After thinking more about it, I think “we haven’t seen evidence of scheming once the octopi were very smart” is a bigger update than I was imagining, especially in the case where the octopi weren’t communicating with octopese. So, I’m now at ~20% without octopese and about 50% with it.
What was the purpose of using octopuses in this metaphor? Like, it seems you’ve piled on so many disanalogies to actual octopuses (extremely smart, many generations per year, they use Slack...) that you may as well just have said “AIs.”
I found it helpful because it put me in the frame of a alien biological intelligence rather than an AI because I have lots of preconceptions about AIs and it’s it’s easy to implicitly think in terms of expected utility maximizers or tools or whatever. While if I’m imagining an octopus, I’m kind of imagining humans, but a bit weirder and more alien, and I would not trust humans
I’m not making a strong claim this makes sense and I think people should mostly think about the AI case directly. I think it’s just another intuition pump and we can potentially be more concrete in the octopus case as we know the algorithm. (While in the AI case, we haven’t seen an ML algorithm that scales to human level.)
Someone thought it would be useful to quickly write up a note on my thoughts on scalable oversight research, e.g., research into techniques like debate or generally improving the quality of human oversight using AI assistance or other methods.
Broadly, my view is that this is a good research direction and I’m reasonably optimistic that work along these lines can improve our ability to effectively oversee somewhat smarter AIs which seems helpful (on my views about how the future will go).
I’m most excited for:
work using control-style adversarial analysis where the aim is to make it difficult for AIs to subvert the oversight process (if they were trying to do this)
work which tries to improve outputs in conceptually loaded hard-to-check cases like philosophy, strategy, or conceptual alignment/safety research (without necessarily doing any adversarial analysis and potentially via relying on generalization)
work which aims to robustly detect (or otherwise mitigate) reward hacking in highly capable AIs, particularly AIs which are capable enough that by default human oversight would often fail to detect reward hacks[1]
I’m skeptical of scalable oversight style methods (e.g., debate, IDA) actually being “scalable” in the sense of scaling to arbitrarily powerful models[2] and I think scalable oversight researchers should broadly be imagining targeting AIs at a human-ish or somewhat superhuman level of general capabilities (while they might still be very superhuman in narrower domains).
In other words, I think scalable oversight style work should focus on a regime like the regime we’re imagining targeting with AI control; this could be for controlling AIs, for getting more safety work out of AIs, or for making fully deferring to AI systems (at around this level of capability) more likely to go well.
See also our prior work Benchmarks for Detecting Measurement Tampering and the motivation we discuss in that linked post as well as this related project proposal I recently wrote. However, note that the linked documents mostly discuss using (sophisticated) methods relying on model internals to succeed without much human supervision which isn’t the sort of thing I’d most centrally call “scalable oversight”, though the term could be applied in this case.
On terminology, I prefer to say “recursive oversight” to refer to methods that leverage assistance from weaker AIs to oversee stronger AIs. IDA is a central example here. Like you, I’m skeptical of recursive oversight schemes scaling to arbitrarily powerful models.
However, I think it’s plausible that other oversight strategies (e.g. ELK-style strategies that attempt to elicit and leverage the strong learner’s own knowledge) could succeed at scaling to arbitrarily powerful models, or at least to substantially superhuman models. This is the regime that I typically think about and target with my work, and I think it’s reasonable for others to do so as well.
Presumably the term “recursive oversight” also includes oversight schemes which leverage assistance from AIs of similar strengths (rather than weaker AIs) to oversee some AI? (E.g., debate, recursive reward modeling.)
Note that I was pointing to a somewhat broader category than this which includes stuff like “training your human overseers more effectively” or “giving your human overseers better software (non-AI) tools”. But point taken.
Yeah, maybe I should have defined “recursive oversight” as “techniques that attempt to bootstrap from weak oversight to stronger oversight.” This would include IDA and task decomposition approaches (e.g. RRM). It wouldn’t seem to include debate, and that seems fine from my perspective. (And I indeed find it plausible that debate-shaped approaches could in fact scale arbitrarily, though I don’t think that existing debate schemes are likely to work without substantial new ideas.)
I appreciate that Anthropic’s model cards are now much more detailed than they used to be. They’ve especially improved in terms of details about CBRN evals (mostly biological risk).[1] They are substantially more detailed and informative than the model cards of other AI companies.
The Claude 3 model card gives a ~1/2 page description of their bio evaluations, though it was likely easy to rule out risk at this time. The Claude 3.5 Sonnet and 3.6 Sonnet model cards say basically nothing except that Anthropic ran CBRN evaluations and claims to be in compliance. The Claude 3.7 Sonnet model card contains much more detail on bio evals and the Claude 4 model card contains even more detail on the bio evaluations while also having a larger number of evaluations which are plausibly relevant to catastrophic misalignment risks. To be clear, some of the increase in detail might just be driven by increased capability, but the increase in detail is still worth encouraging.
Anthropic’s model cards . . . . are substantially more detailed and informative than the model cards of other AI companies.
My weakly-held cached take is: I agree on CBRN/bio (and of course alignment) and I think Anthropic is pretty similar to OpenAI/DeepMind on cyber and AI R&D (and scheming capabilities), at least if you consider stuff outside the model card (evals papers + open-sourcing the evals).
Thoughts on CBRN/bio results for the new OpenAI open-weight model (GPT-oss-120b).
My view is that releasing this model is net positive on fatalities reduced over the longer term, but the release kills substantial numbers of people in expectation due to bioterror (maybe ~2000 in absolute terms counterfactually, but more if OpenAI were the only actor or other actors were always at least as safe as OpenAI). See this earlier post for my general thoughts on the situation with open-weight models around the current level of capability.
I think it’s plausible but unlikely (perhaps 25% likely?) that in retrospect the model is well described as meeting the CBRN high threshold from OpenAI’s preparedness framework. I’d guess 35% likely for the same question for o3.
Overall, the model seems pretty close to as capable as o4-mini or o3, but OpenAI claims they did (more?) dataset filtering. It’s unclear if this dataset filtering did anything, the model doesn’t appear to have substantially degraded performance in any bio task, including ones where I would have hoped filtering matters (in particular, I’d hope you’d filter virology text in a way that greatly lowers VCT scores, but this doesn’t appear to be the case).
So, one big source of uncertainty is how well the data filtering worked. I’d guess it does very little based on my understanding of the situation (also, I think they say they did the same filtering for 4o?), but it seems reasonably likely (40%?) it has a big effect.
My overall view is that the release is net good and I think I tentatively support OpenAI releasing these open-weight models (and generally behaving in this way), but I’m reluctant to endorse it (especially in a strong way) because of the cost in lives.
Redwood in particular will probably find this release helpful for our research and I expect other organizations in a similar position will also be able to do better research because of this release.
The external safety expert feedback (see section 5.1.1 in the model card) seems good and I’m glad OpenAI did this. Their selection of experts seems reasonable and doesn’t feel cherry-picked to me. Getting expert feedback and expert opinions on the overall interpretation seems like a good thing to do going forward (as models get more capable and especially as evaluations get more uncertain and subjective).
That said, it’s unclear what the independent experts thought about the overall interpretation of the results: my sense is that after recommendations were adopted, these experts were probably mostly happy with the quality of the evaluations (at least putting aside much higher effort elicitation), but it’s unclear to me that these evaluations rule out CBRN high or rule out substantial CBRN fatalities caused by models of this capability level. (These evaluations do pretty strongly indicate that the model isn’t much better than other open weight models from a bio uplift perspective, but this is only one relevant question and the overall level of risk from models this capable is the more relevant question for avoiding a race to the bottom.)
I overall feel reasonably happy with the evaluations (given my limited understanding) and the direct presentation of the evaluations (in terms of the literal graphs), but my view is that it seems plausible that models of this capability level meet the criteria for CBRN high and I don’t buy that OpenAI’s evals rule this out. I’m also uncertain about the quality of the elicitation, but OpenAI seemingly did a reasonable job given only low/moderate effort elicitation (as in, without doing a bunch of bio elicitation R&D).
The model seems very, very benchmaxxed. Third party testing on unconventional or private benchmarks ends up placing even the largest gpt-oss below o4-mini, below the largest Qwen releases, and often it ends up below even the newer 30B~ Qwens in a few situations. It isn’t super capable to begin with, and the frankly absurd rate at which this model hallucinates kills what little use it might have with tool use. I think this model poses next to zero risk because it just isn’t very capable.
Seems plausible, I was just looking at benchmarks/evals at the time and it could be that it’s much less capable than it appears.
(Worth noting that I think (many? most? all?) other competitive open source models are also at least somewhat benchmark maxed, so this alters the comparison a bit, but doesn’t alter the absolute level of capability question.)
All existing CBRN evals have quite a limited power to predict real bio risks from LLMs, since any work for creating and/or growing a bioweapon requires work in a lab with actual cells and viruses with hands. As someone with experience working in a wet-lab, I can tell that this requires a very distinct set of skills from the ones that existing bio benchmarks measure, in part because these skills are often very hard to write down and measure. It’s often about knowing how to correctly hold the pipette to not contaminate a sample or how much force to apply while crushing cells. Benchmarks like VCT only measure a subset of necessary skills. (and smart people at SecureBio are absolutely aware of all this)
AFAIK OpenAI partnered with Los Alamos lab to do tests for LLM helping with wet-lab work directly by giving novices access to a lab and an LLM and observing how well they are doing. Would be excited to see the results of this experiment.
So to wrap up, I don’t believe that we truly know how much biorisk these models pose based on the existing evals.
It’s exciting to see OpenAI acknowledge that pre-training data filtering is a part of their safety stack. When it comes to advanced technical content, minimizing the model’s exposure to sensitive material seems pretty intuitive. However, it is difficult to draw any strong conclusions about data filtering effectiveness from this work, given the understandably few details. They do not indicate the effort invested, the volume of data removed, or the sophistication of their filtering pipeline. I expect a company could share far more details about this process without divulging trade secrets.
Was it public knowledge that they did data filtering for GPT-4o? I’ve been studying this space and was not aware of this. It’s also interesting that they’re using the “same” filtering pipeline a year later.
Quick takes (previously known as short forms) are often viewed via preview on the front page. This preview removes formatting and newlines for space reasons. So, if your title doesn’t end in a period and especially if capitalization doesn’t clearly denote a sentence boundary (like in this case where the first sentence starts in “I”), then it might be confusing.
DeepSeek’s success isn’t much of an update on a smaller US-China gap in short timelines because security was already a limiting factor
Some people seem to have updated towards a narrower US-China gap around the time of transformative AI if transformative AI is soon, due to recent releases from DeepSeek. However, since I expect frontier AI companies in the US will have inadequate security in short timelines and China will likely steal their models and algorithmic secrets, I don’t consider the current success of China’s domestic AI industry to be that much of an update. Furthermore, if DeepSeek or other Chinese companies were in the lead and didn’t open-source their models, I expect the US would steal their models and algorithmic secrets. Consequently, I expect these actors to be roughly equal in short timelines, except in their available compute and potentially in how effectively they can utilize AI systems.
I do think that the Chinese AI industry looking more competitive makes security look somewhat less appealing (and especially less politically viable) and makes it look like their adaptation time to stolen models and/or algorithmic secrets will be shorter. Marginal improvements in security still seem important, and ensuring high levels of security prior to at least ASI (and ideally earlier!) is still very important.
Using the breakdown of capabilities I outlined in this prior post, the rough picture I expect is something like:
AIs that can 10x accelerate AI R&D labor: Security is quite weak (perhaps <=SL3 as defined in “Securing Model Weights”), so the model is easily stolen if relevant actors want to steal it. Relevant actors are somewhat likely to know AI is a big enough deal that stealing it makes sense, but AI is not necessarily considered a top priority.
Top-Expert-Dominating AI: Security is somewhat improved (perhaps <=SL4), but still pretty doable to steal. Relevant actors are more aware, and the model probably gets stolen.
Very superhuman AI: I expect security to be improved by this point (partially via AIs working on security), but effort on stealing the model could also plausibly be unprecedentedly high. I currently expect security implemented before this point to suffice to prevent the model from being stolen.
Given this, I expect that key early models will be stolen, including models that can fully substitute for human experts, and so the important differences between actors will mostly be driven by compute, adaptation time, and utilization. Of these, compute seems most important, particularly given that adaptation and utilization time can be accelerated by the AIs themselves.
This analysis suggests that export controls are particularly important, but they would need to apply to hardware used for inference rather than just attempting to prevent large training runs through memory bandwidth limitations or similar restrictions.
A factor that I don’t think people are really taking into account is: what happens if this situation goes hostile?
Having an ASI and x amount of compute for inference plus y amount of currently existing autonomous weapons platforms (e.g. drones)…
If your competitor has the same ASI, but 10x the compute but 0.2x the drones… And you get in a fight… Who wins?
What about 0.2x the compute and 10x the drones?
The funny thing about AI brinkmanship versus nuclear brinkmanship is that AI doesn’t drain your economic resources, it accelerates your gains. A nuclear stockpile costs you in maintenance. An AI and compute ‘stockpile’ makes you more money and more tech (including better AI and compute). Thus, there is a lot more incentive to race harder on AI than on nuclear.
If I were in the defense dept of a world power right now, I’d be looking for ways to make money from civilian uses of drones… Maybe underwriting drone deliveries for mail. That gives you an excuse to build drones and drone factories while whistling innocently.
Sometimes people talk about how AIs will be very superhuman at a bunch of (narrow) domains. A key question related to this is how much this generalizes. Here are two different possible extremes for how this could go:
It’s effectively like an attached narrow weak AI: The AI is superhuman at things like writing ultra fast CUDA kernels, but from the AI’s perspective, this is sort of like it has a weak AI tool attached to it (in a well integrated way) which is superhuman at this skill. The part which is writing these CUDA kernels (or otherwise doing the task) is effectively weak and can’t draw in a deep way on the AI’s overall skills or knowledge to generalize (likely it can shallowly draw on these in a way which is similar to the overall AI providing input to the weak tool AI). Further, you could actually break out these capabilities into a separate weak model that humans can use. Humans would use this somewhat less fluently as they can’t use it as quickly and smoothly due to being unable to instantaneously translate their thoughts and not being absurdly practiced at using the tool (like AIs would be), but the difference is ultimately mostly convenience and practice.
Integrated superhumanness: The AI is superhuman at things like writing ultra fast CUDA kernels via a mix of applying relatively general (and actually smart) abilities, having internalized a bunch of clever cognitive strategies which are applicable to CUDA kernels and sometimes to other domains, as well as domain specific knowledge and heuristics. (Similar to how humans learn.) The AI can access and flexibly apply all of the things it learned from being superhuman at CUDA kernels (or whatever skill) and with a tiny amount of training/practice it can basically transfer all these things to some other domain even if the domain is very different. The AI is at least as good at understanding and flexibly applying what it has learned as humans would be if they learned the (superhuman) skill to the same extent (and perhaps the AIs are actually much better at this than humans). You can’t separate these capabilities into a weak model, the weak model RL’d on this (and distilled into) would either be much worse at CUDA or would need to actually be generally quite capable (rather than weak).
My sense is that the current frontier LLMs are much closer to (1) than (2) for most of their skills, particularly the skills which they’ve been heavily trained on (e.g. next token prediction or competitive programming). As AIs in the current paradigm get more capable, they appear to shift some toward (2) and I expect that at the point when AIs are capable of automating virtually all cognitive work that humans can do, we’ll be much closer to (2). That said, it seems likely that powerful AIs built in the current paradigm[1] which otherwise match humans at downstream performance will somewhat lag behind humans in integrating/generalizing skills they learn (at least without spending a bunch of extra compute on skill integration) because this ability currently seems to be lagging behind other capabilities relative to humans and AIs can compensate for worse skill integration with other advantages (being extremely knowledgeable, fast speed, parallel training on vast amounts of relevant data including “train once, deploy many”, better memory, faster and better communication, etc).
I think different views about the extent to which future powerful AIs will deeply integrate their superhuman abilities versus these abilities being shallowly attached partially drive some disagreements about misalignment risk and what takeoff will look like.
People also disagree greatly about how much humans tend towards integration rather than non-integration, and how much human skill comes from domain transfer. And I think some / a lot of the beliefs about artificial intelligence are downstream of these beliefs about the origins of biological intelligence and human expertise, i.e., in Yudkowsky / Ngo dialogues. (Object level: Both the LW-central and alternatives to the LW-central hypotheses seem insufficiently articulated; they operate as a background hypothesis too large to see rather than something explicitly noted, imo.)
People also disagree greatly about how much humans tend towards integration rather than non-integration, and how much human skill comes from domain transfer.
Makes me wonder whether most of what people believe to be “domain transfer” could simply be IQ.
I mean, suppose that you observe a person being great at X, then you make them study Y for a while, and it turns out that they are better at Y than an average person who spend the same time studying Y.
One observer says: “Clearly some of the skills at X have transferred to the skills of Y.”
Another observer says: “You just indirectly chose a smart person (by filtering for high skills at X), duh.”
This seems important to think about, I strong upvoted!
As AIs in the current paradigm get more capable, they appear to shift some toward (2) and I expect that at the point when AIs are capable of automating virtually all cognitive work that humans can do, we’ll be much closer to (2).
I’m not sure that link supports your conclusion.
First, the paper is about AI understanding its own behavior. This paper makes me expect that a CUDA-kernel-writing AI would be able to accurately identify itself as being specialized at writing CUDA kernels, which doesn’t support the idea that it would generalize to non-CUDA tasks.
Maybe if you asked the AI “please list heuristics you use to write CUDA kernels,” it would be able to give you a pretty accurate list. This is plausibly more useful for generalizing, because if the model can name these heuristics explicitly, maybe it can also use the ones that generalize, if they do generalize. This depends on 1) the model is aware of many heuristics that it’s learned, 2) many of these heuristics generalize across domains, and 3) it can use its awareness of these heuristics to successfully generalize. None of these are clearly true to me.
Second, the paper only tested GPT-4o and Llama 3, so the paper doesn’t provide clear evidence that more capable AIs “shift some towards (2).” The authors actually call out in the paper that future work could test this on smaller models to find out if there are scaling laws—has anybody done this? I wouldn’t be too surprised if small models were also able to self-report simple attributes about themselves that were instilled during training.
Fair, but I think the AI being aware of its behavior is pretty continuous with being aware of the heuristics it’s using and ultimately generalizing these (e.g., in some cases the AI learns what code word it is trying to make the user say which is very similar to being aware of any other aspect of the task it is learning). I’m skeptical that very weak/small AIs can do this based on some other papers which show they fail at substantially easier (out-of-context reasoning) tasks.
I think most of the reason why I believe this is improving with capabilities is due to a broader sense of how well AIs generalize capabilities (e.g., how much does o3 get better at tasks it wasn’t trained on), but this paper was the most clearly relevant link I could find.
I’m not sure o3 does get significantly better at tasks it wasn’t trained on. Since we don’t know what was in o3′s training data, it’s hard to say for sure that it wasn’t trained on any given task.
To my knowledge, the most likely example of a task that o3 does well on without explicit training is GeoGuessr. But see this Astral Codex Ten post, quoting Daniel Kang:[1]
We also know that o3 was trained on enormous amounts RL tasks, some of which have “verified rewards.” The folks at OpenAI are almost certainly cramming every bit of information with every conceivable task into their o-series of models! A heuristic here is that if there’s an easy to verify answer and you can think of it, o3 was probably trained on it.
I think this is a bit overstated, since GeoGuessr is a relatively obscure task, and implementing an idea takes much longer than thinking of it.[2] But it’s possible that o3 was trained on GeoGuessr.
The same ACX post also mentions:
On the other hand, the DeepGuessr benchmark finds that base models like GPT-4o and GPT-4.1 are almost as good as reasoning models at this, and I would expect these to have less post-training, probably not enough to include GeoGuessr
Do you have examples in mind of tasks that you don’t think o3 was trained on, but which it nonetheless performs significantly better at than GPT-4o?
I would guess that OpenAI has trained on GeoGuessr. It should be pretty easy to implement—just take images off the web which have location metadata attached, and train to predict the location. Plausibly getting good at Geoguessr imbues some world knowledge.
I think different views about the extent to which future powerful AIs will deeply integrate their superhuman abilities versus these abilities being shallowly attached partially drive some disagreements about misalignment risk and what takeoff will look like.
I think this might be wrong when it comes to our disagreements, because I don’t disagree with this shortform.[1] Maybe a bigger crux is how valuable (1) is relative to (2)? Or the extent to which (2) is more helpful for scientific progress than (1)?
I don’t think this explains our disagreements. My low confidence guess is we have reasonably similar views on this. But, I do think it drives parts of some disagreements between me and people who are much more optimisitic than me (e.g. various not-very-concerned AI company employees).
I agree the value of (1) vs (2) might also be a crux in some cases.
Is the crux that the more optimistic folks plausibly agree (2) is cause for concern, but believe that mundane utility can be reaped with (1), and they don’t expect us to slide from (1) into (2) without noticing?
I suppose that most tasks that an LLM can accomplish could theoretically be performed more efficiently by a dedicated program optimized for that task (and even better by a dedicated physical circuit). Hypothesis 1) amounts to considering that such a program, a dedicated module within the model, is established during training. This module can be seen as a weak AI used as a tool by the stronger AI. A bit like how the human brain has specialized modules that we (the higher conscious module) use unconsciously (e.g., when we read, the decoding of letters is executed unconsciously by a specialized module).
We can envision that at a certain stage the model becomes so competent in programming that it will tend to program code on the fly, a tool, to solve most tasks that we might submit to it. In fact, I notice that this is already increasingly the case when I ask a question to a recent model like Claude Sonnet 3.7. It often generates code, a tool, to try to answer me rather than trying to answer the question ‘itself.’ It clearly realizes that dedicated code will be more effective than its own neural network. This is interesting because in this scenario, the dedicated module is not generated during training but on the fly during normal production operation. In this way, it would be sufficient for AI to become a superhuman programmer to become superhuman in many domains thanks to the use of these tool-programs. The next stage would be the on-the-fly production of dedicated physical circuits (FPGA, ASIC, or alien technology), but that’s another story.
This refers to the philosophical debate about where intelligence resides: in the tool or in the one who created it? In the program or in the programmer? If a human programmer programs a superhuman AI, should we attribute this superhuman intelligence to the programmer? Same question if the programmer is itself an AI? It’s the kind of chicken and egg debate where the answer depends on how we divide the continuity of reality into discrete categories. You’re right that integration is an interesting criterion as it is a kind of formal / non arbitrary solution to this problem of defining discrete categories among the continuity of reality.
I agree with much of this post. I also have roughly 2032 medians to things going crazy, I agree learning on the job is very useful, and I’m also skeptical we’d see massive white collar automation without further AI progress.
However, I think Dwarkesh is wrong to suggest that RL fine-tuning can’t be qualitatively similar to how humans learn.
In the post, he discusses AIs constructing verifiable RL environments for themselves based on human feedback and then argues this wouldn’t be flexible and powerful enough to work, but RL could be used more similarly to how humans learn.
My best guess is that the way humans learn on the job is mostly by noticing when something went well (or poorly) and then sample efficiently updating (with their brain doing something analogous to an RL update). In some cases, this is based on external feedback (e.g. from a coworker) and in some cases it’s based on self-verification: the person just looking at the outcome of their actions and then determining if it went well or poorly.
So, you could imagine RL’ing an AI based on both external feedback and self-verification like this. And, this would be a “deliberate, adaptive process” like human learning. Why would this currently work worse than human learning?
Current AIs are worse than humans at two things which makes RL (quantitatively) much worse for them:
Robust self-verification: the ability to correctly determine when you’ve done something well/poorly in a way which is robust to you optimizing against it.
Sample efficiency: how much you learn from each update (potentially leveraging stuff like determining what caused things to go well/poorly which humans certainly take advantage of). This is especially important if you have sparse external feedback.
But, these are more like quantitative than qualitative issues IMO. AIs (and RL methods) are improving at both of these.
All that said, I think it’s very plausible that the route to better continual learning routes more through building on in-context learning (perhaps through something like neuralese, though this would greatly increase misalignment risks...).
Some more quibbles:
For the exact podcasting tasks Dwarkesh mentions, it really seems like simple fine-tuning mixed with a bit of RL would solve his problem. So, an automated training loop run by the AI could probably work here. This just isn’t deployed as an easy-to-use feature.
For many (IMO most) useful tasks, AIs are limited by something other than “learning on the job”. At autonomous software engineering, they fail to match humans with 3 hours of time and they are typically limited by being bad agents or by being generally dumb/confused. To be clear, it seems totally plausible that for podcasting tasks Dwarkesh mentions, learning is the limiting factor.
Correspondingly, I’d guess the reason that we don’t see people trying more complex RL based continual learning in normal deployments is that there is lower hanging fruit elsewhere and typically something else is the main blocker. I agree that if you had human level sample efficiency in learning this would immediately yield strong results (e.g., you’d have very superhuman AIs with 10^26 FLOP presumably), I’m just making a claim about more incremental progress.
I think Dwarkesh uses the term “intelligence” somewhat atypically when he says “The reason humans are so useful is not mainly their raw intelligence. It’s their ability to build up context, interrogate their own failures, and pick up small improvements and efficiencies as they practice a task.” I think people often consider how fast someone learns on the job as one aspect of intelligence. I agree there is a difference between short feedback loop intelligence (e.g. IQ tests) and long feedback loop intelligence and they are quite correlated in humans (while AIs tend to be relatively worse at long feedback loop intelligence).
Dwarkesh notes “An AI that is capable of online learning might functionally become a superintelligence quite rapidly, even if there’s no algorithmic progress after that point.” This seems reasonable, but it’s worth noting that if sample efficient learning is very compute expensive, then this might not happen so rapidly.
I think AIs will likely overcome poor sample efficiency to achieve a very high level of performance using a bunch of tricks (e.g. constructing a bunch of RL environments, using a ton of compute to learn when feedback is scarce, learning from much more data than humans due to “learn once deploy many” style strategies). I think we’ll probably see fully automated AI R&D prior to matching top human sample efficiency at learning on the job. Notably, if you do match top human sample efficiency at learning (while still using a similar amount of compute to the human brain), then we already have enough compute for this to basically immediately result in vastly superhuman AIs (human lifetime compute is maybe 3e23 FLOP and we’ll soon be doing 1e27 FLOP training runs). So, either sample efficiency must be worse or at least it must not be possible to match human sample efficiency without spending more compute per data-point/trajectory/episode.
I agree that robust self-verification and sample efficiency are the main things AIs are worse at than humans, and that this is basically just a quantitative difference. But what’s the best evidence that RL methods are getting more sample efficient (separate from AIs getting better at recognizing their own mistakes)? That’s not obvious to me but I’m not really read up on the literature. Is there a benchmark suite you think best illustrates that?
Better RL algorithms (including things that also improve pretraining sample efficiency like better architectures and optimizers).
Smarter models (in particular, smarter base models, though I’d also expect that RL in one domain makes RL in some other domain more sample efficient, at least after sufficient scale).
There isn’t great evidence that we’ve been seeing substantial improvements in RL algorithms recently, but AI companies are strongly incentivized to improve RL sample efficiency (as RL scaling appears to be yielding large returns) and there are a bunch of ML papers which claim to find substantial RL improvements (though it’s hard to be confident in these results for various reasons). So, we should probably infer that AI companies have made substantial gains in RL algorithms, but we don’t have public numbers. 3.7 sonnet was much better than 3.5 sonnet, but it’s hard to know how much of the gain was from RL algorithms vs from other areas.
Minimally, there is a long running trend in pretraining algorithmic efficiency and many of these improvements should also transfer somewhat to RL sample efficiency.
As far as evidence that smarter models learn more sample efficiently, I think the deepseek R1 paper has some results on this. Probably it’s also possible to find various pieces of support for this in the literature, but I’m more familiar with various anecdotes.
I agree with most of Dwarkesh’s post, with essentially the same exceptions you’ve listed.
I wrote about this recently in LLM AGI will have memory, and memory changes alignment, drawing essentially the conclusions you’ve given above. Continuous learning is a critical strength of humans and job substitution will be limited (but real) until LLM agents can do effective self-directed learning. It’s quite hard to say how fast that will happen, for the reasons you’ve given. Fine-tuning is a good deal like human habit/skill learning; the question is how well agents can select what to learn.
One nontrivial disagreement is on the barrier to long time horizon task performance. Humans don’t learn long time-horizon task performance primarily from RL. We learn in several ways at different scales, including learning new strategies which can be captured in language. All of those types of learning do rely on self-assessment and decisions about what’s worth learning, and those will be challenging and perhaps difficult to get out of LLM agents—although I don’t think there’s any fundamental barrier to squeezing workable judgments out of them, just some schlep in scaffodling and training to do it better.
Based on this logic, my timelines are getting longer in median (although rapid progress is still quite possible and we are far from prepared).
But I’m getting somewhat less pessimistic by the prospect of having incompetent autonomous agents with self-directed learning. These would probably both take over some jobs, and display egregious misalignment. I think they’ll be a visceral wakeup call that has decent odds of getting society properly freaked out about human-plus AI with a little time left to prepare.
However, I think Dwarkesh is wrong to suggest that RL fine-tuning can’t be qualitatively similar to how humans learn. In the post, he discusses AIs constructing verifiable RL environments for themselves based on human feedback and then argues this wouldn’t be flexible and powerful enough to work, but RL could be used more similarly to how humans learn.
I think the key difference is as of right now, RL fine-tuning doesn’t change the AI’s weights after training, and continual learning is meant to focus on AI weight changes after the notional training period ends.
Are you claiming that RL fine-tuning doesn’t change weights? This is wrong.
Maybe instead you’re saying “no one ongoingly does RL fine-tuning where they constantly are updating the weights throughout deployment (aka online training)”. My response is: sure, but they could do this, they just don’t because it’s logistically/practically pretty annoying and the performance improvement wouldn’t be that high, at least without some more focused R&D on making this work better.
Maybe instead you’re saying “no one ongoingly does RL fine-tuning where they constantly are updating the weights throughout deployment (aka online training)”. My response is: sure, but they could do this, they just don’t because it’s logistically/practically pretty annoying and the performance improvement wouldn’t be that high, at least without some more focused R&D on making this work better.
Yes, you got me right, and my claim here isn’t that the work is super-high returns right now, and sample efficiency/not being robust to optimization against a verifier are currently the big issues at the moment, but I am claiming that in practice, online training/constantly updating the weights throughout deployment will be necessary for AI to automate important jobs like AI researchers, because most tasks require history/continuously updating on successes and failures rather than 1-shotting the problem.
If you are correct in that there is a known solution, and it merely requires annoying logistical/practical work, then I’d accept short timelines as the default (modulo long-term memory issues in AI).
To expand on this, I also expect by default that something like a long-term memory/state will be necessary, due to the issue that not having a memory means you have to relearn basic skills dozens of times, and this drastically lengthens the time to complete a task to the extent that it’s not viable to use an AI instead of a human.
I think some long tasks are like a long list of steps that only require the output of the most recent step, and so they don’t really need long context. AI improves at those just by becoming more reliable and making fewer catastrophic mistakes. On the other hand, some tasks need the AI to remember and learn from everything it’s done so far, and that’s where it struggles- see how Claude Plays Pokémon gets stuck in loops and has to relearn things dozens of times.
I haven’t read the METR paper in full, but from the examples given I’m worried the tests might be biased in favor of an agent with no capacity for long term memory, or at least not hitting the thresholds where context limitations become a problem:
For instance, task #3 here is at the limit of current AI capabilities (takes an hour). But it’s also something that could plausibly be done with very little context; if the AI just puts all of the example files in its context window it might be able to write the rest of the decoder from scratch. It might not even need to have the example files in memory while it’s debugging its project against the test cases.
Whereas a task to fix a bug in a large software project, while it might take an engineer associated with that project “an hour” to finish, requires stretching the limits of the amount of information it can fit inside a context window, or recall beyond what we seem to be capable of doing today.
One possibility I’ve wondered about is whether AI can automate this learning work: start from a transcript of someone trying to do things with AI with mistakes and subsequent feedback, and then curating some data from that works well for RL fine-tuning. Or even distilling it into examples for in-context learning (which probably works somewhat well, sometimes, today).
Here are the 5 main categories of reasons why control could fail (though various of them are often correlated).
[Note: this document has a moderate amount of overlap with various prior documents Redwood has released, but it seemed like a useful document nonetheless.]
Insufficient (long horizon) sample efficiency: If your AI is scheming, you won’t be able to reasonably sample efficiently get this AI to achieve near maximal reward on average on (long horizon) tasks. By near maximal, we just mean empirically achieving average reward which is quite close to the reward the AI would get if it was “actually trying” to get high reward (insofar as it is scheming to an extent that “trying” makes sense). (When we say average, we actually mean average excluding rare outliers. We can’t (in general) control behavior on rare outliers using normal ML training.) For control, this causes issues in two places: it interferes with getting useful work out of AIs and being able to run accurate capability evaluations.
Average case usefulness: For scheming AIs, you won’t be able to check that the work it’s doing is actually good (on average) rather than just looking good to you. Note that this includes getting enough actually good work out of AIs.
Average case safety: Direct safety issues that occur due to auditing failing to notice a large number of actively dangerous actions and then these actions build up into a catastrophic outcome. (E.g., the AI slowly persuades some group in the AI lab to become AI cultists.) One particularly salient special case is:
We can’t prevent AIs from building up huge amounts of coup science: Control will depend on human researchers being similarly good or better at understanding than our countermeasures and the space of possible AI attack strategies. See Notes on control evaluations for safety cases for more discussion.
High-stakes safety: The auditing process fails to detect that a single action is very bad when it reviews that action and this is extremely bad because that action is part of a high-stakes failure (e.g., an escape attempt that occurs in a small number of actions).
Concentrated failures: inability to prevent concentrated catastrophes (safety failures which are concentrated into a small number of bad actions or a small amount of serial time) despite having an expensive and slow auditing process which could detect this bad behavior (it works for high-stakes safety) if it was assigned to review these actions.
Lack of institutional will or ability: Even if control would work given a few years of delay and an implementation carried out by a quite well-run organization, you won’t get one of these in practice and thus you’ll have trouble. (This argument also applies to any other technical approaches where the implementation requires substantial institutional will, therefore we need some sort of technical solutions which require less institutional competence.)
Catching the AI isn’t sufficiently useful: The viability of control will likely depend on it being very useful to catch the AI one or a small number of times. But, this might turn out to be false for various reasons. See Catching AIs red-handed for discussion. This might depend heavily on questions which aren’t about technical AI safety like how useful legible evidence of serious misalignment is. But, the technical usefulness of catching could suffice.
We think that various of these things could go wrong in practice. However, it seems like the current point on the logistic success curve is such that if an AI lab tried very hard to evaluate and ensure control, they could improve the situation substantially (e.g., 3-5x reduction in doom) relative to not trying very hard.
Note that 2, 3, and 4 of the above failures probably just prevent us from building sufficiently safe and useful protocols without breaking control evaluations. 1 and 2.b.i (coup science build-up) are likely to break control evaluations, though it might be possible to recognize that we’re in a world where 1 and/or 2.b.i hold. 5 has more complex effects, but we’d probably be able to safely stop after realizing that we’re unable to extract enough value from the times when we’ve caught the AI; so the main concern would be that we’re somewhat clearly unable to safely continue rather than being caught dangerously unaware.
Here are some reasons why control might not be worth working on (or actively bad to work on) regardless of the above reasons:
Control is bad for coordination: Perhaps control does improve safety, but it does so at the cost of making it easier to make excuses, giving false confidence, and/or creating slippery slopes.
Control makes it cheaper and easier to do morally problematic things: Control might make it more commercially valuable and technically feasible to employ AIs in situations where we know these AIs don’t consent to this usage (and these AIs are well described as having preferences that deserve moral consideration), and this might make violations of basic moral principles (such as slavery) more likely. (In the same way as better technology for managing human insider threats could make human slavery more commercially valuable. Of course, in the case of human slavery, there are already institutions and norms which combat slavery.) However, note that our recommendation to society would be to try extremely hard to avoid employing AIs without their consent; if AIs don’t consent to employment we should (temporarily) shut down or negotiate a deal with the AIs. Much more commentary on these topics can be seen here. Our overall view is that the default AI situation with only current alignment/safety/interpretability technology is somewhat scary from an AI welfare/rights perspective, but that control doesn’t make the situation worse.
Scheming is extremely unlikely for early transformatively useful AI (<1%) and control-style approaches won’t be a useful safeguard against other threat models: Control focuses on making conservative assumptions about the effects of training and the motivations of AIs. If we didn’t have to make such conservative assumptions, then control-type approaches look much less worthwhile. However, control-type things might still be a useful line of defense against failure modes which aren’t well described as scheming.
Baseline control strategies will work (when control is at all possible) and control evaluations are easy to run: If the baseline approaches can’t be improved on and there isn’t important research on the evaluations, then technical work on control isn’t very important. Policy/governance work might still be important. See also here.
It seems slightly bad that you can vote on comments under the Quick Takes and Popular Comments sections without needing to click on the comment and without leaving the frontpage.
This is different than posts where in order to vote, you have to actually click on the post and leave the main page (which perhaps requires more activation energy).
I’m unsure, but I think this could result in a higher frequency of cases where Quick Takes and comments go viral in problematic ways relative to posts. (Based on my vague sense from the EA forum.)
This doesn’t seem very important, but I thought it would be worth quickly noting.
I feel like Quick Takes and Popular Comments on the EA forum both do a good job of showing me content I’m glad to have seen (and in the case of the EA forum that I wouldn’t have otherwise seen) and of making the experience of making a shortform better (in that you get more and better engagement). So far, this also seems fairly true on LessWrong, but we’re presumably out of equilibrium.
Overall, I feel like posts on LessWrong are too long, and so am particularly excited about finding my way to things that are a little shorter. (It would also be cool if we could figure out how to do rigour well or other things that benefit from length, but I don’t feel like I’m getting that much advantage from the length yet).
My guess is that the problems you gesture at will be real and the feature will still be net positive. I wish there were a way to “pay off” the bad parts with some of the benefit; probably there is and I haven’t thought of it.
Yeah, I was a bit worried about this, especially for popular comments. For quick takes, it does seem like you can see all the relevant context before voting, and you were also able to already vote on comments from the Recent Discussion section (though that section isn’t sorted by votes, so at less risk of viral dynamics). I do feel a bit worried that popular comments will get a bunch of votes without people reading the underlying post it’s responding to.
Some people seem to think my timelines have shifted a bunch while they’ve only moderately changed.
Relative to my views at the start of 2025, my median (50th percentile) for AIs fully automating AI R&D was pushed back by around 2 years—from something like Jan 2032 to Jan 2034. My 25th percentile has shifted similarly (though perhaps more importantly) from maybe July 2028 to July 2030. Obviously, my numbers aren’t fully precise and vary some over time. (E.g., I’m not sure I would have quoted these exact numbers for this exact milestone at the start of the year; these numbers for the start of the year are partially reverse engineered from this comment.)
Fully automating AI R&D is a pretty high milestone; my current numbers for something like “AIs accelerate AI R&D as much as what would happen if employees ran 10x faster (e.g. by ~fully automating research engineering and some other tasks)” are probably 50th percentile Jan 2032 and 25th percentile Jan 2029.[1]
I’m partially posting this so there is a record of my views; I think it’s somewhat interesting to observe this over time. (That said, I don’t want to anchor myself, which does seem like a serious downside. I should slide around a bunch and be somewhat incoherent if I’m updating as much as I should: my past views are always going to be somewhat obviously confused from the perspective of my current self.)
What are your very long timeline expectations, for 2045+ or 2055+ AGI (automated AI R&D, sure)? That’s where I expect most of the rare futures with humanity not permanently disempowered to be, though the majority even of these long timelines will still result in permanent disempowerment (or extinction).
I think it takes at least about 10 years to qualitatively transform an active field of technical study or change the social agenda, so 2-3 steps of such change might have a chance of sufficiently reshaping how the world thinks about AI x-risk and what technical tools are available for shaping minds of AGIs, in order to either make a human-initiated lasting Pause plausible, or to have the means of aligning AGIs in an ambitious sense.
I appreciate you saying that the 25th percentile timeline might be more important. I think that’s right and underappreciated.
One of your recent (excellent) posts also made me notice that AGI timelines probably aren’t normally distributed. Breakthroughs, other large turns of events, or large theoretical misunderstandings at this point probably play a large role, and there are probably only a very few of those that will hit. Small unpredictable events that create normal distributions will play a lesser role.
I don’t know how you’d characterize that mathematically, but I don’t think it’s right to assume it’s normally distributed, or even close.
Back to your comment on the 25th percentile being important: I think there’s a common error where people round to the median and then think “ok, that’s probably when we need to have alignment/strategy figured out.” You’d really want to have it at least somewhat ready far earlier.
That’s both in case it’s on the earlier side of the predicted distribution, and because alignment theory and practice need to be ready far enough in advance of game time to have diffused and be implemented for the first takeover-capable model.
I’ve been thinking of writing a post called something like “why are so few people frantic about alignment?” making those points. Stated timeline distributions don’t seem to match mood IMO and I’m trying to figure out why. I realize that part of it is a very reasonable “we’ll figure it out when/if we get there.” And perhaps others share my emotional dissociation from my intellectual expectations. But maybe we should all be a bit more frantic. I’d like some more halfassed alignment solutions in play and under discussion right now. The 80⁄20 rule probably applies here.
I did have similar timelines, but I have generally moved them forward to ~ 2027-2028 after updating my priors with OpenAI/XAI employee X posts. They are generally confident that post-training scaleups will accelerate timelines, as well as stating that post-training can be scaled, with return to match, well beyond that of pre-training. As they have inside information, I am inclined to trust them. Insiders like Anthropic’s Jack Clark are even more bold, he says that superintelligent machines will be available late 2026.
Some of my proposals for empirical AI safety work.
I sometimes write proposals for empirical AI safety work without publishing the proposal (as the doc is somewhat rough or hard to understand).
But, I thought it might be useful to at least link my recent project proposals publicly in case someone would find them useful.
I’ve actually already linked this (and the proposal in the next bullet) publicly from “7+ tractable directions in AI control”, but I thought it could be useful to link here as well.
About 1 year ago, I wrote up a ready-to-go plan for AI safety focused on current science (what we roughly know how to do right now). This is targeting reducing catastrophic risks from the point when we have transformatively powerful AIs (e.g. AIs similarly capable to humans).
I never finished this doc, and it is now considerably out of date relative to how I currently think about what should happen, but I still think it might be helpful to share.
Here is the doc. I don’t particularly want to recommend people read this doc, but it is possible that someone will find it valuable to read.
I plan on trying to think though the best ready-to-go plan roughly once a year. Buck and I have recently started work on a similar effort. Maybe this time we’ll actually put out an overall plan rather than just spinning off various docs.
This seems like a great activity, thank you for doing/sharing it. I disagree with the claim near the end that this seems better than Stop, and in general felt somewhat alarmed throughout at (what seemed to me like) some conflation/conceptual slippage between arguments that various strategies were tractable, and that they were meaningfully helpful. Even so, I feel happy that the world contains people sharing things like this; props.
I disagree with the claim near the end that this seems better than Stop
At the start of the doc, I say:
It’s plausible that the optimal approach for the AI lab is to delay training the model and wait for additional safety progress. However, we’ll assume the situation is roughly: there is a large amount of institutional will to implement this plan, but we can only tolerate so much delay. In practice, it’s unclear if there will be sufficient institutional will to faithfully implement this proposal.
Towards the end of the doc I say:
This plan requires quite a bit of institutional will, but it seems good to at least know of a concrete achievable ask to fight for other than “shut everything down”. I think careful implementation of this sort of plan is probably better than “shut everything down” for most AI labs, though I might advocate for slower scaling and a bunch of other changes on current margins.
Presumably, you’re objecting to ‘I think careful implementation of this sort of plan is probably better than “shut everything down” for most AI labs’.
My current view is something like:
If there was broad, strong, and durable political will and buy in for heavily prioritizing AI takeover risk in the US, I think it would be good if the US government shut down scaling and took strong actions to prevent frontier AI progress while also accelerating huge amounts of plausibly-safety-related research.
You’d need to carefully manage the transition back to scaling to reduce hardware overhang issues. This is part of why I think “durable” political will is important. There are various routes for doing this with different costs.
I’m sympathetic to thinking this doesn’t make sense if you just care about deaths prior to age 120 of currently alive people and wide-spread cryonics is hopeless (even conditional on the level of political support for mitigating AI takeover risk). Some other views which just care about achieving close to normal lifespans for currently alive humans also maybe aren’t into pausing.
Regulations/actions which have the side effect of slowing down scaling which aren’t part of a broad package seem way less good. This is partially due to hardware/algorithmic overhang concerns, but more generally due to follow though concerns. I also wouldn’t advocate such regulations (regulations with the main impact being to slow down as a side effect) due to integrity/legitimacy concerns.
Unilateral shutdown is different as advice than “it would be good if everyone shut down” because AI companies might think (correctly or not) that they would be better on safety than other companies. In practice, no AI lab seems to have expressed a view which is close to “acceleration is bad” except Anthropic (see core views on AI safety).
We are very, very far from broad, strong, and durable political will for heavily prioritizing AI takeover, so weaker and conditional interventions for actors on the margin seem useful.
Nathan Lambert recently wrote a piece about why he doesn’t expect a software-only intelligence explosion. I responded in this substack comment which I thought would be worth copying here.
As someone who thinks a rapid (software-only) intelligence explosion is likely, I thought I would respond to this post and try to make the case in favor. I tend to think that AI 2027 is a quite aggressive, but plausible scenario.
I interpret the core argument in AI 2027 as:
We’re on track to build AIs which can fully automate research engineering in a few years. (Or at least, this is plausible, like >20%.) AI 2027 calls this level of AI “superhuman coder”. (Argument: https://ai-2027.com/research/timelines-forecast)
Once you have superhuman coders, unassisted humans would only take a moderate number of years to make AIs which can automate all of AI research (“superhuman AI researcher”), like maybe 5 years. (Argument: https://ai-2027.com/research/takeoff-forecast#human-only-timeline-from-sc-to-sar). And, because of AI R&D acceleration along the way, this will actually happen much faster.
Superhuman AI researchers can obsolete and outcompete humans at all AI research. This includes messy data research, discovering new paradigms, and whatever humans might be doing which is important. These AIs will also be substantially faster and cheaper than humans, like fast and cheap enough that with 1⁄5 of our compute we can run 200k parallel copies each at 40x speed (for ~8 million parallel worker equivalents). Because labor is a key input into AI R&D, these AIs will be able to speed things up by ~25x. (Argument: https://ai-2027.com/research/takeoff-forecast#sar-would-25x-ai-randd)
These speed ups don’t stop just above the human range, they continue substantially beyond the human range, allowing for quickly yielding AIs which are much more capable than humans. (AI 2027 doesn’t really argue for this, but you can see here: https://ai-2027.com/research/takeoff-forecast#siar-would-250x-ai-randd)
(There are a few other minor components like interpolating the AI R&D multipliers between these milestones and the argument for being able to get these systems to run at effective speeds which are much higher than humans. It’s worth noting that these numbers are more aggressive than my actual median view, especially on the timeline to superhuman coder.)
Ok, now what is your counterargument?
Sections (1) and (2) don’t seem in conflict with the AI 2027 view or the possibility of a software-only intelligence explosion. (I’m not claiming you intended these to be in conflict, just clarifying why I’m not responding. I assume you have these sections as relevant context for later arguments.)
As far as I can tell, section (3) basically says “You won’t get large acceleration just via automating implementation and making things more compute efficient. Also, ML research is messy and thus it will be hard to get AI systems which can actually fully automate this.”
AI 2027 basically agrees with these literal words:
The acceleration from superhuman coders which fully automate research engineering (and run at high speeds) is “only” 5x, and to get the higher 25x (or beyond) acceleration, the AIs needed to be as good as the best ML researchers and fully automate ML research.
AI 2027 thinks it would take ~5 years of unaccelerated human AI R&D progress to get from superhuman coder to superhuman AI researcher, so the forecast incorporates this being pretty hard.
There is presumably a quantitative disagreement: you probably think that the acceleration from superhuman coders is lower and the unassisted research time to go from superhuman coders to superhuman AI researchers is longer (much more than 5 years?).
(FWIW, I think AI 2027 is probably too bullish on the superhuman coder AI R&D multiplier, I expect more like 3x.)
I’ll break this down a bit further.
Importantly, I think the claim:
For machine learning research to accelerate at these rates, it needs to be entirely bottlenecked by compute efficiency and implementation difficulty.
Is probably missing the point: AI 2027 is claiming that we’ll get these large multipliers by being able to make AIs which beat humans at the overall task of AI research! So, it just needs to be the case that machine learning research could be greatly accelerated by much faster (and eventually very superhuman) labor. I think the extent of this acceleration and the returns are an open question, but I don’t think whether it is entirely bottlenecked by compute efficiency and implementation difficulty is the crux. (The extent to which compute efficiency and implementation difficulty bottleneck AI R&D is the most important factor for the superhuman coder acceleration, but not the acceleration beyond this.)
As far as the difficulty of making a superhuman AI researcher, I think your implicit vibes are something like “we can make AIs better and better at coding (or other easily checkable tasks), but this doesn’t transfer to good research taste and intuitions”. I think the exact quantitative difficulty of humans making AIs which can automate ML research is certainly an open question (and it would be great to get an overall better understanding), but I think there are good reasons to expect the time frame for unassisted humans (given a fixed compute stock) to be more like 3-10 years than like 30 years:
ML went from AlexNet to GPT-4 in 10 years! Fast progress has happened historically. To be clear, this was substantially driven by compute scaling (for both training runs and experiments), but nonetheless, research progress can be fast. The gap from superhuman coder to superhuman AI research intuitively feels like a much smaller gap than the gap from AlexNet to GPT-4.
For success on harder to check tasks, there are hopes for both extending our ability to check and generalization. See this blog post for more discussion. Concretely, I think we can measure whether AIs produce some research insight that ended up improving performance and so we can RL on this, at least at small scale (which might transfer). We can also train on things like forecasting the results of training runs etc. Beyond this, we might just be able to get AIs to generalize well: many of current AI capabilities are due to reasonably far generalization and I expect this to continue.
As far as section (4), I think the core argument is “solving real-world domains (e.g., things which aren’t software only) will be hard”. You might also be arguing that RL will struggle to teach AIs to be good ML researchers (it’s a bit hard for me to tell how applicable your argument is to this).
I think the biggest response to the concern with real world domains is “the superhuman AI researchers can figure this out even if it would have taken humans a while”. As far as RL struggling to teach AIs to be good scientists (really ML researchers in particular), I think this is addressed above, though I agree this is a big unknown. Note that we can (and will) explore approaches that differ some from the current RL paradigm and acceleration from superhuman coders might make this happen a bunch faster.
In reality, with the large sums of capital at play, it is unlikely that labs give free rein to billions of dollars of compute to so called “AI researchers in the datacenter” because of how constrained compute is at all of the top labs.
I find this argument somewhat wild if you assume that the AI researchers in the datacenter are making extremely fast progress. Insofar as the software only singularity works, the AI researchers in the datacenter can rapidly improve efficiency using algorithmic progress saving far more compute than you spent to run them. Currently, inference costs for a given level of performance drop by around 9-40x per year depending on the exact task. (See the graph you included.) So, if AIs can make the progress happen much faster, it will easily pay for itself on inference costs alone. That said, I don’t expect this is the largest source of returns...
There will be diminishing returns, but if the software only singularity ends up being viable, then these diminishing returns will be outpaced by ever more effective and smarter AI researchers in the datacenter meaning progress wouldn’t slow until quite high limits are reached.
I assume your actual crux is about the returns to AI R&D at the point of full (or substantial) automation by AIs. You think the returns are low enough that companies will (rationally) invest elsewhere.
I agree that the fact that AI companies aren’t investing all their compute into AI R&D is some update about the returns to algorithms progress, but the regime might differ substantially after full automation by AI researchers!
(And, I think the world would probably be better if AI companies all avoid doing a rapid intelligence explosion internally!)
Minimally, insofar as AI companies are racing as fast as possible, we’d expect that whatever route the companies take is the one they think is fastest! So, if there is another route than AI accelerating AI R&D which is faster, fair enough companies would do this instead, but if the AI automating AI R&D story is insanely fast, then this route must be even faster.
It seems like rather than talking about software-only intelligence explosions, we should be talking about different amounts of hardware vs. software contributions. There will be some of both.
I find a software-mostly intelligence explosion to be quite plausible, but it would take a little longer on average than a scenario in which hardware usage keeps expanding rapidly.
I think it might be worth quickly clarifying my views on activation addition and similar things (given variousdiscussion about this). Note that my current views are somewhat different than some comments I’ve posted in various places (and my comments are somewhat incoherent overall), because I’ve updated based on talking to people about this over the last week.
This is quite in the weeds and I don’t expect that many people should read this.
It seems like activation addition sometimes has a higher level of sample efficiency in steering model behavior compared with baseline training methods (e.g. normal LoRA finetuning). These comparisons seem most meaningful in straightforward head-to-head comparisons (where you use both methods in the most straightforward way). I think the strongest evidence for this is in Liu et al..
Contrast pairs are a useful technique for variance reduction (to improve sample efficiency), but may not be that important (see Liu et al. again). It’s relatively natural to capture this effect using activation vectors, but there is probably some nice way to incorporate this into SGD. Perhaps DPO does this? Perhaps there is something else?
Activation addition works better than naive few-shot prompting in some cases, particularly in cases where the way you want to steer the model isn’t salient/obvious from the few-shot prompt. But it’s unclear how it performs in comparison to high-effort prompting. Regardless, activation addition might work better “out of the box” because fancy prompting is pretty annoying.
I think training models to (in general) respond well to “contrast prompting”, where you have both positive and negative examples in the prompt, might work well. This can be done in a general way ahead of time and then specific tuning can just use this format (the same as instruction finetuning). For a simple prompting approach, you can do “query_1 Good response: pos_ex_1, Bad response: neg_ex_1, query_2 Good response: pos_ex_2, Bad response: neg_ex_2, …” and then prompt with “actual_query Good response:”. Normal pretrained models might not respond well to this prompt by default, but I haven’t checked.
I think we should be able to utilize the fact that activation addition works to construct a tweaked inductive bias for SGD on transformers (which feels like a more principled approach from my perspective if the goal is better sample efficiency). More generally, I feel like we should be able to do ablations to understand why activation addition works and then utilize this separately.
I expect there are relatively straightforward ways to greatly improve sample efficiency of normal SGD via methods like heavy proliferation of training data. I also think “really high sample efficiency from a small number of samples” isn’t very solved at the moment. I think if we’re really going for high sample efficiency from a fixed training dataset we should be clear about that and I expect a variety of very different approaches are possible.
The advantages versus prompting in terms of inference time performance improvement (because activation engineering means you have to process fewer tokens during inference) don’t seem very important to me because you can just context distill prompts.
For cases where we want to maximize a metric X and we don’t care about sample efficiency or overfitting to X, we can construct a big dataset of high quality demonstrations, SFT on these demos, and then RL against our metric. If we do this in a well tuned way for a large number of samples such that sample efficiency isn’t much of an issue (and exploration problems are substantial), I would be pretty surprised if activation addition or similar approaches can further increase metric X and “stack” with all the RL/demos we did. That is, I would be surprised if this is true unless the activation engineering is given additional affordances/information not in our expert demos and the task isn’t heavily selected for activation engineering stacking like this.
It’s unclear how important it is to (right now) work on sample efficiency specifically to reduce x-risk. I think it seems not that important, but I’m unsure. Better sample efficiency could be useful for reducing x-risk because better sample efficiency could allow for using a smaller amount of oversight data labeled more carefully for training your AI, but I expect that sample efficiency will naturally be improved to pretty high levels by standard commercial/academic incentives. (Speculatively, I also expect that the marginal returns will just generally look better for improving oversight even given equal effort or more effort on oversight basically because we’ll probably need a moderate amount of data anyway and I expect that improving sample efficiency in the “moderate data” regime is relatively harder.) One exception is that I think that sample efficiency for very low sample-count cases might not be very commercially important, but might be highly relevant for safety after catching the AI doing egregiously bad actions and then needing to redeploy. For this, it would be good to focus on sample efficiency specifically in ways which are analogous to training an AI or a monitor after catching an egregiously bad action.
The fact that activation addition works reveals some interesting facts about the internals of models; I expect there are some ways to utilize something along these lines to reduce x-risk.
In principle, you could use activation additions or similar editing techniques to learn non-obvious facts about the algorithms which models are implementing via running experiments where you (e.g.) add different vectors at different layers and observe interactions (interpretability). For this to be much more powerful than black-box model psychology or psychology/interp approaches which use a bit of training, you would probably need to do try edits at many different layers and map out an overall algorithm. (And/or do many edits simultaneously).
I think “miscellaneous interventions on internals” for high-level understanding of the algorithms that models are implementing seems promising in principle, but I haven’t seen any work in this space which I’m excited about.
I think activation addition could have “better” generalization in some cases, but when defining generalization experiments, we need to be careful about the analogousness of the setup. It’s also unclear what exact properties we want for generalization, but minimally being able to more precisely pick and predict the generalization seems good. I haven’t yet seen evidence for “better” generalization using activation addition which seems compelling/interesting to me. Note that we should use a reasonably sized training dataset, or we might be unintentionally measuring sample efficiency (in which case, see above). I don’t really see a particular reason why activation addition would result in “better” generalization beyond having some specific predictable properties which maybe mean we can know about cases where it is somewhat better (cases where conditioning on something is better than “doing” that thing?).
I’m most excited for generalization work targeting settings where oversight is difficult and a weak supervisor would make mistakes that result in worse policy behavior (aka weak-to-strong generalization). See this post and this post for more discussion of the setting I’m thinking about.
I generally think it’s important to be careful about baselines and exactly what problem we’re trying to solve in cases where we are trying to solve a specific problem as opposed to just doing some open-ended exploration. TBC, open ended exploration is fine, but we should know what we’re doing. I often think that when you make the exact problem you’re trying to solve and what affordances you are and aren’t allowed more clear, a bunch of additional promising methods become apparent. I think that the discussion I’ve seen so far of activation engineering (e.g. in this post, why do we compare to finetuning in a generalization case rather than just directly finetuning on what we want? Is doing RL to increase how sycophantic the model is in scope?) has not been very precise about what problem it’s trying to solve or what baseline techniques it’s claiming to outperform.
I don’t currently think the activation addition stuff is that important in expectation (for reducing x-risk) due to some of the beliefs I said above (though I’m not very confident). I’d be most excited about the “understand the algorithms the model is implementing” application above or possibly some generalization tests. This view might depend heavily on general views about threat models and how the future will go. Regardless, I ended up commenting on this a decent amount, so I thought it would be worth the time to clarify my views.
Probably this implies a reasonably high fraction of total tokens in typical OpenAI training datasets are chess games!
Despite chess being 4% of documents sampled, chess is probably less than 4% of overall tokens (Perhaps 1% of tokens are chess? Perhaps less?)
Because chess consists of shorter documents than other types of documents, it’s likely that chess is a higher fraction of documents than of tokens.
(To understand this, imagine that just the token “Hi” and nothing else was 1⁄2 of documents. This would still be a small fraction of tokens as most other documents are far longer than 1 token.)
(My title might be slightly clickbait because it’s likely chess is >1% of documents, but might be <1% of tokens.)
It’s also possible that these models were fine-tuned on a data mix with more chess while pretrain had less chess.
This was written in response to a conversation I had with Vivek Hebbar about some strange implications of UDASSA-like views. In particular:
Is it better to spend your resources simulating a happy life or writing down programs which would simulate a bunch of happy lives? Naively the second feels worthless, but there seemingly are some unintuitive reasons why clever approaches for doing the later could be better.
Alien civilizations might race to the bottom by spending resources making their civilization easier to point at (and thus higher measure in the default UDASSA perspective).
but there seemingly are some unintuitive reasons why clever approaches for doing the later could be better
Because there might be some other programs with a lot of computational resources which scan through simple universes looking for programs to run?
Alien civilizations might race to the bottom by spending resources making their civilization easier to point at
This one doesn’t feel so unintuitive to me since “easy to point at” is somewhat like “has a lot of copies” so it’s kinda similar to the biological desire to reproduce(though I assume you’re alluding to another motivation for becoming easy to point at?)
Because there might be some other programs with a lot of computational resources which scan through simple universes looking for programs to run?
More like: because there is some chance that the actual laws of physics in our universe execute data as code sometimes (similar to how memory unsafe programs often do this). (And this is either epiphenomenal or it just hasn’t happened yet.) While the chance is extremely small, the chance could be high enough (equivalently, universes with this property have high enough measure) and the upside could be so much higher that betting on this beats other options.
Alien civilizations might race to the bottom by spending resources making their civilization easier to point at (and thus higher measure in the default UDASSA perspective).
This may also be a reason for AIs to simplify their values (after they’ve done everything possible to simplify everything else).
Eric Schwitzgebel has argued by disjunction/exhaustion for the necessity of craziness in the sense of “contrary to common sense and we are not epistemically compelled to believe it” at least in the context of philosophy of mind and cosmology.
Maybe someone should make an inbox for incidents of ChatGPT psychosis.
Currently, various people receive many emails or other communications from people who appear to exhibit ChatGPT psychosis: they seem (somewhat) delusional and this appears to have been driven (or at least furthered) by talking with ChatGPT or other chatbots. It might be helpful to create an inbox where people can easily forward these emails. The hope would be that the people who currently receive a ton of these emails could just forward these along (at least to the extent this is easy).[1]
This inbox would serve a few main functions:
Collect data for people interested in studying the phenomenon (how it’s changing over time, what it looks like, finding affected people to talk to).
There could be some (likely automated) system for responding to these people and attempting to help them. Maybe you could make an LLM-based bot which responds to these people and tries to talk them down in a healthy way. (Of course, it would be important that this bot wouldn’t just feed into the psychosis!) In addition to the direct value, this could be an interesting test case for applying AI to improve epistemics (in this case, a very particular type of improving epistemics, but results could transfer). It seems possible (though pretty unlikely) that something like ChatGPT psychosis grows to have substantial influence on the overall epistemic environment, in which case building defenses could be very important.
Some communication about the results from the inbox could be regularly sent to the relevant AI companies. It seems generally useful if the situation is made clear to relevant AI companies.[2]
I’m not going to create this inbox for a few reasons, but I think this could be a worthwhile project, especially for someone already interested in applying AI to improve epistemics in cases like this.
Some possible risks/difficulties of this project:
The person doing the project would probably need to be somewhat trustworthy for people to actually forward along emails. That said, someone could start a version which just looks through rejected LessWrong posts, finds likely cases of ChatGPT psychosis, and then collects this data and responds to people etc.
It’s pretty plausible that someone running this service would be blamed for any flaws in the service even if the service makes the situation almost strictly better. See also The Copenhagen Interpretation of Ethics.
It might be worthwhile to make a nice automated filter people can easily apply which highlights emails as likely cases of ChatGPT psychosis so that forwarding along emails is easier. I’m not sure how to best do this, and people presumably would want to manually check prior to forwarding emails to an inbox which is collecting data!
I’m worried that when people say gradual disempowerment they often mean “some scenario in which humans are disempowered gradually over time”, but many readers will interpret this as “the threat model in the paper called ‘Gradual Disempowerment’”. These things can differ substantially and the discussion in this paper is much more specific than encompassing all scenarios in which humans slowly are disempowered!
(You could say “disempowerment which is gradual” for clarity.)
I think when people use the term “gradual disempowerment” predominantly in one sense, people will also tend to understand it in that sense. And I think that sense will be rather literal and not the one specifically of the original authors. Compare the term “infohazard” which is used differently (see comments here) from how Yudkowsky was using it.
(You could say “disempowerment which is gradual” for clarity.)
I feel like there is a risk of this leading to a never-ending sequence of meta-communication concerns. For instance, what if a reader interprets “gradual” to mean taking more than 10 years, but the writer thought 5 would be sufficient for “gradual” (and see timelines discussions around stuff like continuity for how this keeps going). Or if the reader assumes “disempowerment” means complete disempowerment, but writer only meant some unspecified “significant amount” of disempowerment. Its definitely worthwhile to try to be clear initially, but I think we also have to accept that clarification may need to happen “on the backend” sometimes. This seems like a case where one could simply clarify they have a different understanding compared to the paper. In fact, Its not all that clear to me that people won’t implicitly translate “disempowerment which is gradual” to “gradual disempowerment”. It could be that the paper stands in just as much for the concept as for the literal words in people’s minds.
Sure, communication will always be imperfect and there is a never ending cascade of possible clarifications. But, sometimes it seems especially good to improve communication even at some cost. I just thought this was a case where there might be particular large miscommunciations.
In particular case, I was worried there was some problematic conflationary alliance style dynamics where a bunch of people might think there is a broader consensus for some idea than there actually is.
Here is my current favorite version of this (though you could do better and this isn’t that careful):
AI company revenue will be perhaps ~$20 billion this year and has roughly 3x’d per year over the last 2.5 years. Let’s say 1⁄2 of this is in the US for $10 billion this year (maybe somewhat of an underestimate, but whatever). Maybe GDP impacts are 10x higher than revenue due to AI companies not internalizing the value of all of their revenue (they might be somewhat lower due to AI just displacing other revenue without adding that much value), so to hit 20% of US GDP (~$6 trillion) AI companies would need around $600 billion in revenue. The naive exponential extrapolation indicates we hit this level of annualized revenue in 4 years in the early/middle 2029. Notably, most of this revenue would be in the last year, so we’d be seeing ~10% GDP growth.
Exponential could be too aggressive, e.g., maybe capabilities progress stalls out, there isn’t sufficient investment, fab capacity runs out (this should happen just around 2029 or so, but could happen early), or the historical exponential growth is mostly due to adoption in a small and fixed number of industries with limited demand.
Historically, revenue growth tends to slow over time as the revenue follows an adoption sigmoid. I don’t think this will apply to AI as much as the revenue growth is mostly due to the technology improving unlike other sectors and I don’t see a particular reason for this to slow (other than investment + compute running out), though of course it could.
But, exponential could also be insufficiently aggressive: AI capabilities might get increasingly useful triggering more and more adoption and revenue. I think OpenAI’s revenue is probably somewhat superexponential since the release of the GPT-3 API, but I’m not sure about this and couldn’t find numbers. Or AIs might be able to accelerate their own adoption. I also think that there could be phase changes around full automation which are important and could trigger one time increases in revenue.
These parameters are somewhat debatable, e.g., maybe you think that AI companies will internalize more than 10% of the value or that GDP generally does a poor job accounting for value (in most/all sectors not just AI), so the multiplier will be more like 3x or 5x (adding another year to the exponential forecast).
My overall sense is that AI company revenue growth will probably slow, but going at least exponential all the way to 20% of US GDP with >$600 billion in revenue before 2030 seems plausible (maybe I think ~35%, with some chance of very powerful AI that doesn’t yet trigger massive revenue). So, my overall sense is that naive revenue extrapolations imply a pretty bullish view on AI, but don’t rule out medians which are ~arbitrarily far into the future as progress might slow.
I’m not sure what economic milestone is most interesting to forecast to, so I was tempted by “the economy is substantially higher due to AI, likely meaning the growth rate is substantially higher”.
I do a similar estimate for full remote work automation here. The results are pretty similar as ~20% of US GDP and remote work automation will probably hit around the same time on the naive revenue extrapolation picture.
Overall, I agree with Eli’s perspective here: revenue extrapolations probably don’t get at the main crux except via suggesting that quite short timelines to crazy stuff (<4 years) are at least plausible.
How people discuss the US national debt is an interesting case study of misleading usage of the wrong statistic. The key thing is that people discuss the raw debt amount and the rate at which that is increasing, but you ultimately care about the relationship between debt and US gdp (or US tax revenue).
People often talk about needing to balance the budget, but actually this isn’t what you need to ensure[1], to manage debt it suffices to just ensure that debt grows slower than US GDP. (And in fact, the US has ensured this for the past 4 years as debt/GDP has decreased since 2020.)
To be clear, it would probably be good to have a national debt to GDP ratio which is less than 50% rather than around 120%.
There are some complexities with inflation because the US could inflate its way out of dollar dominated debt and this probably isn’t a good way to do things. But, with an independent central bank this isn’t much of a concern.
the debt/gdp ratio drop since 2020 I think was substantially driven by inflation being higher then expected rather than a function of economic growth—debt is in nominal dollars, so 2% real gdp growth + e.g. 8% inflation means that nominal gdp goes up by 10%, but we’re now in a worse situation wrt future debt because interest rates are higher.
it’s not about inflation expectations (which I think pretty well anchored), it’s about interest rates, which have risen substantially over this period and which has increased (and is expected to continue to increase) the cost of the US maintaining its debt (first two figures are from sites I’m not familiar with but the numbers seem right to me):
fwiw, I do broadly agree with your overall point that the dollar value of the debt is a bad statistic to use, but: - the 2020-2024 period was also a misleading example to point to because it was one where there the US position wrt its debt worsened by a lot even if it’s not apparent from the headline number - I was going to say that the most concerning part of the debt is that that deficits are projected to keep going up, but actually they’re projected to remain elevated but not keep rising? I have become marginally less concerned about the US debt over the course of writing this comment.
I am now wondering about the dynamics that happen if interest rates go way up a while before we see really high economic growth from AI, seems like it might lead to some weird dynamics here, but I’m not sure I think that’s likely and this is probably enough words for now.
This made me wonder whether the logic of “you don’t care about your absolute debt, but about its ratio to your income” also applies to individual humans. On one hand, it seems like obviously yes; people typically take a mortgage proportional to their income. On the other hand, it also seems to make sense to worry about the absolute debt, for example in case you would lose your current job and couldn’t get a new one that pays as much.
So I guess the idea is how much you can rely on your income remaining high, and how much it is potentially a fluke. If you expect it is a fluke, perhaps you should compare your debt to whatever is typical for your reference group, whatever that might be.
Does something like that also make sense for countries? Like, if your income depends on selling oil, you should consider the possibilities of running out of oil, or the prices of oil going down, etc., simply imagine the same country but without the income from selling oil (or maybe just having half the income), and look at your debt from that perspective. Would something similar make sense for USA?
At personal level, “debt” usually stands for something that will be paid back eventually. Not claiming whether US should strive to pay out most debt, but that may help explain the people’s intuitions.
My only concern is overall complexity-budget for react palettes – there’s lots of maybe-good-to-add reacts but each one adds to the feeling of overwhelm.
But I agree those reacts are pretty good and probably highish on my list of reacts to add.
(For immediate future you can do the probability-reacts as a signal for “seems like the author is implying this as high-probability, and I’m giving it lower.” Not sure if that does the thing
IMO, instrumental convergence is a terrible name for an extremely obvious thing.
The actual main issue is that AIs might want things with limited supply and then they would try to get these things which would result in them not going to humanity. E.g., AIs might want all cosmic resources, but we also want this stuff. Maybe this should be called AIs-might-want-limited-stuff-we-want.
(There is something else which is that even if the AI doesn’t want limited stuff we want, we might end up in conflict due to failures of information or coordination. E.g., the AI almost entirely just wants to chill out in the desert and build crazy sculptures and it doesn’t care about extreme levels of maximization (e.g. it doesn’t want to use all resources to gain a higher probability of continuing to build crazy statues). But regardless, the AI decides to try taking over the world because it’s worried that humanity would shut it down because it wouldn’t have any way of credibly indicating that it just wants to chill out in the desert.)
(More generally, it’s plausible that failures of trade/coordination result in a large number of humans dying in conflict with AIs even though both humans and AIs would prefer other approaches. But this isn’t entirely obvious and it’s plausible we could resolve this with better negotation and precommitments. Of course, this isn’t clearly the largest moral imperative from a longtermist perspective.)
One “9” of uptime reliability for dangerously powerful AIs might suffice (for the most dangerous and/or important applications)
Currently, AI labs try to ensure that APIs for accessing their best models have quite good reliability, e.g. >99.9% uptime. If we had to maintain such high levels of uptime reliability for future dangerously powerful AIs, then we’d need to ensure that our monitoring system has very few false alarms. That is, few cases in which humans monitoring these dangerously powerful AIs pause usage of the AIs pending further investigation. Ensuring this would make it harder to ensure that AIs are controlled. But, for the most important and dangerous usages of future powerful AIs (e.g. arbitrary R&D, and especially AI safety research), we can probably afford to have the powerful AIs down a moderate fraction of the time. For instance, it wouldn’t be that costly if the AIs always took a randomly selected day of the week off from doing research: this would probably reduce productivity by not much more than 1 part in 7[1]. More generally, a moderate fraction of downtime like 10% probably isn’t considerably worse than a 10% productivity hit and it seems likely that we can tolerate such productivity hits to improve safety. Thus, we don’t need even one “9” of reliability.
This implies that we can afford to do things like:
Have semi-regular false alarms where AIs go down for a few hours or a day while we investigate.
Run AI shutdown drills. These drills would help ensure that our processes for shut down are working and there aren’t unknown use cases where shutdown could cause very problematic failures. (Catching such usages early via shut down drills is similar to scheduled brownouts in software engineering; see here for an example.)
Temporarily shut down some usages of the AI if we encounter unexpected operating conditions. (AKA violations of our design basis.)
[For context, this is a point that seems important to me, but didn’t naturally make it into any of our posts on control thus far. So, I decided to write a quick standalone shortform post. I think Buck originally made this point to me.]
Thank you for this! A lot of us have a very bad habit of over-systematizing our thinking, and treating all uses of AI (and even all interfaces to a given model instance) as one singular thing. Different tool-level AI instances probably SHOULD strive for 4 or 5 nines of availability, in order to have mundane utility in places where small downtime for a use blocks a lot of value. Research AIs, especially self-improving or research-on-AI ones, don’t need that reliability, and both triggered downtime (scheduled, or enforced based on Schelling-line events) as well as unplanned downtime (it just broke, and we have to spend time to figure out why) can be valuable to give humans time to react.
On Scott Alexander’s description of Representation Engineering in “The road to honest AI”
This is a response to Scott Alexander’s recent post “The road to honest AI”, in particular the part about the empirical results of representation engineering. So, when I say “you” in the context of this post that refers to Scott Alexander. I originally made this as a comment on substack, but I thought people on LW/AF might be interested.
TLDR: The Representation Engineering paper doesn’t demonstrate that the method they introduce adds much value on top of using linear probes (linear classifiers), which is an extremely well known method. That said, I think that the framing and the empirical method presented in the paper are still useful contributions.
I think your description of Representation Engineering considerably overstates the empirical contribution of representation engineering over existing methods. In particular, rather than comparing the method to looking for neurons with particular properties and using these neurons to determine what the model is “thinking” (which probably works poorly), I think the natural comparison is to training a linear classifier on the model’s internal activations using normal SGD (also called a linear probe)[1]. Training a linear classifier like this is an extremely well known technique in the literature. As far as I can tell, when they do compare to just training a linear classifier in section 5.1, it works just as well for the purpose of “reading”. (Though I’m confused about exactly what they are comparing in this section as they claim that all of these methods are LAT. Additionally, from my understanding, this single experiment shouldn’t provide that much evidence overall about which methods work well.)
I expect that training a linear classifier performs similarly well as the method introduced in the Representation Engineering for the “mind reading” use cases you discuss. (That said, training a linear classifier might be less sample efficient (require more data) in practice, but this doesn’t seem like a serious blocker for the use cases you mention.)
One difference between normal linear classifier training and the method found in the representation engineering paper is that they also demonstrate using the direction they find to edit the model. For instance, see this response by Dan H. to a similar objection about the method being similar to linear probes. Training a linear classifier in a standard way probably doesn’t work as well for editing/controlling the model (I believe they show that training a linear classifier doesn’t work well for controlling the model in section 5.1), but it’s unclear how much we should care if we’re just using the classifier rather than doing editing (more discussion on this below).[2]
If we care about the editing/control use case intrinsically, then we should compare to normal fine-tuning baselines. For instance, normal supervised next-token prediction on examples with desirable behavior or DPO.[3]
Are simple classifiers useful?
Ok, but regardless of the contribution of the representation engineering paper, do I think that simple classifiers (found using whatever method) applied to the internal activations of models could detect when those models are doing bad things? My view here is a bit complicated, but I think it’s at least plausible that these simple classifiers will work even though other methods fail. See here for a discussion of when I think linear classifiers might work despite other more baseline methods failing. It might also be worth reading the complexity penalty section of the ELK report.
Additionally, I think that the framing in the representation engineering paper is maybe an improvement over existing work and I agree with the authors that high-level/top-down techniques like this could be highly useful. (I just don’t think that the empirical work is adding as much value as you seem to indicate in the post.)
The main contributions
Here are what I see as the main contributions of the paper:
Clearly presenting a framework for using simple classifiers to detect things we might care about (e.g. powerseeking text).
Presenting a combined method for producing a classifier and editing/control in an integrated way. And discussing how control can be used for classifier validation and vice versa.
Demonstrating that in some cases labels aren’t required if we can construct a dataset where the classification of interest is the main axis of variation. (This was also demonstrated in the CCS paper, but the representation engineering work demonstrates this in more cases.)
Based on their results, I think the method they introduce is reasonably likely to be a more sample efficient (less data required for training) editing/control method than prior methods for many applications. It might also be more sample efficient for producing a classifier. That said, I’m not sure we should care very much about sample efficiency. Additionally, the classifier/editing might have other nice priorities which prior methods don’t have (though they don’t clearly demonstrate either of these in the paper AFAICT).
Is it important that we can use our classifier for control/editing?
As far the classifier produced by this method having nice properties, the fact our classifier also allows for editing/control might indicate that the classifier we get has better properties (see the paper itself (section 3.1.2) and e.g. here for discussion), but I’d guess this is either only a moderate improvement or has no effect in practice. And as far as I can tell, the paper doesn’t demonstrate cases where prior methods for training a classifier on the internal activations yield poor results, but their method clearly works well. These cases might exist, but I’m somewhat skeptical that this is very common. Future work could find hard cases where we want a particulargeneralizationand demonstrate that this method or modifications of this method works better than other approaches.
Does the editing method they introduce have nice properties because it also allows for reading? Let’s consider using the representation engineering approach for reading and controlling/editing the properties of lying versus honesty. Assuming the method works as desired, then I would guess that the reading/classifier corresponds to reading off “does the model think there is lying in this text (or even in this position in the text)” and the control/editing corresponds to “make the model think that there is a lying earlier in this text so that it conditions on this and does more lying (similarly to how using a few shot prompt with lies might make the model more likely to lie)”. Note that these reading and control methods likely do not directly correspond to “the model thinking that it is about to lie”: the properties of “I have already lied (or my few-shot prompt contains lies)” and “I am about to lie” are different.
Some of their methods are “unsupervised” unlike typical linear classifier training, but require a dataset where the primary axis of variation is the concept they want. I think this is practically similar to labeled data because we’d have to construct this dataset and if it mostly varies along an axis which is not the concept we wanted, we’d be in trouble. I could elaborate on this if that was interesting.
Separately, I believe there are known techniques in the literature for constructing a linear classifier such that the direction will work for editing. For instance, we could just use the difference between the mean activations for the two classes we’re trying to classify which is equivalent to the ActAdd technique and also rhymes nicely with LEACE. I assume this is a well known technique for making a classifier in the literature, but I don’t know if prior work has demonstrated both using this as a classifier and as a method for modeling editing. (The results in section 5.1 seem to indicate that this mean difference method combined with LEACE works well, but I’m not sure how much evidence this experiment provides.)
I expect that normal fine-tuning (or DPO) might be less sample efficient than the method introduced in the Representation Engineering paper for controlling/editing models, but I don’t think they actually run this comparison? Separately, it’s unclear how much we care about sample efficiency.
I unconfidently think it would be good if there was some way to promote posts on LW (and maybe the EA forum) by paying either money or karma. It seems reasonable to require such promotion to require moderator approval.
Probably, money is better.
The idea is that this is both a costly signal and a useful trade.
The idea here is similar to strong upvotes, but:
It allows for one user to add more of a signal boost than just a strong upvote.
It doesn’t directly count for the normal vote total, just for visibility.
It isn’t anonymous.
Users can easily configure how this appears on their front page (while I don’t think you can disable the strong upvote component of karma for weighting?).
I always like seeing interesting ideas, but this one doesn’t resonate much for me. I have two concerns:
Does it actually make the site better? Can you point out a few posts that would be promoted under this scheme, but the mods didn’t actually promote without it? My naive belief is that the mods are pretty good at picking what to promote, and if they miss one all it would take is an IM to get them to consider it.
Does it improve things to add money to the curation process (or to turn karma into currency which can be spent)? My current belief is that it does not—it just makes things game-able.
Stackoverflow has long had a “bounty” system where you can put up some of your karma to promote your question. The karma goes to the answer you choose to accept, if you choose to accept an answer; otherwise it’s lost. (There’s no analogue of “accepted answer” on LessWrong, but thought it might be an interesting reference point.)
I lean against the money version, since not everyone has the same amount of disposable income and I think there would probably be distortionary effects in this case [e.g. wealthy startup founder paying to promote their monographs.]
That’s not really how the system on Stackoverflow works. You can give a bounty to any answer not just the one you accepted.
It’s also not lost but:
If you do not award your bounty within 7 days (plus the grace period), the highest voted answer created after the bounty started with a minimum score of 2 will be awarded half the bounty amount (or the full amount, if the answer is also accepted). If two or more eligible answers have the same score (their scores are tied), the oldest answer is chosen. If there’s no answer meeting those criteria, no bounty is awarded to anyone.
Reducing the probability that AI takeover involves violent conflict seems leveraged for reducing near-term harm
Often in discussions of AI x-safety, people seem to assume that misaligned AI takeover will result in extinction. However, I think AI takeover is reasonably likely to not cause extinction due to the misaligned AI(s) effectively putting a small amount of weight on the preferences of currently alive humans. Some reasons for this are discussed here. Of course, misaligned AI takeover still seems existentially bad and probably eliminates a high fraction of future value from a longtermist perspective.
(In this post when I use the term “misaligned AI takeover”, I mean misaligned AIs acquiring most of the influence and power over the future. This could include “takeover” via entirely legal means, e.g., misaligned AIs being granted some notion of personhood and property rights and then becoming extremely wealthy.)
However, even if AIs effectively put a bit of weight on the preferences of current humans it’s possible that large numbers of humans die due to violent conflict between a misaligned AI faction (likely including some humans) and existing human power structures. In particular, it might be that killing large numbers of humans (possibly as collateral damage) makes it easier for the misaligned AI faction to take over. By large numbers of deaths, I mean over hundreds of millions dead, possibly billions.
But, it’s somewhat unclear whether violent conflict will be the best route to power for misaligned AIs and this also might be possible to influence. See also here for more discussion.
So while one approach to avoid violent AI takeover is to just avoid AI takeover, it might also be possible to just reduce the probability that AI takeover involves violent conflict. That said, the direct effects of interventions to reduce the probability of violence don’t clearly matter from an x-risk/longtermist perspective (which might explain why there hasn’t historically been much effort here).
(However, I think trying to establish contracts and deals with AIs could be pretty good from a longtermist perspective in the case where AIs don’t have fully linear returns to resources. Also, generally reducing conflict seems maybe slightly good from a longtermist perspective.)
So how could we avoid violent conflict conditional on misaligned AI takeover? There are a few hopes:
Ensure a bloodless coup rather than a bloody revolution
Ensure that negotiation or similar results in avoiding the need for conflict
Ensure that a relatively less lethal takeover strategy is easier than more lethal approaches
I’m pretty unsure about what the approaches here look best or are even at all tractable. (It’s possible that some prior work targeted at reducing conflict from the perspective of S-risk could be somewhat applicable.)
Separately, this requires that the AI puts at least a bit of weight on the preferences of current humans (and isn’t spiteful), but this seems like a mostly separate angle and it seems like there aren’t many interventions here which aren’t covered by current alignment efforts. Also, I think this is reasonably likely by default due to reasons discussed in the linked comment above. (The remaining interventions which aren’t covered by current alignment efforts might relate to decision theory (and acausal trade or simulation considerations), informing the AI about moral uncertainty, and ensuring the misaligned AI faction is importantly dependent on humans.)
Returning back to the topic of reducing violence given a small weight on the preferences of current humans, I’m currently most excited about approaches which involve making negotiation between humans and AIs more likely to happen and more likely to succeed (without sacrificing the long run potential of humanity).
A key difficulty here is that AIs might have a first mover advantage and getting in a powerful first strike without tipping its hand might be extremely useful for the AI. See here for more discussion (also linked above). Thus, negotiation might look relatively bad to the AI from this perspective.
We could try to have a negotiation process which is kept secret from the rest of the world or we could try to have preexisting commitments upon which we’d yield large fractions of control to AIs (effectively proxy conflicts).
More weakly, just making negotiation at all seem like a possibility, might be quite useful.
I’m unlikely to spend much if any time working on this topic, but I think this topic probably deserves further investigation.
I’m less optimistic that “AI cares at least 0.1% about human welfare” implies “AI will expend 0.1% of its resources to keep humans alive”. In particular, the AI would also need to be 99.9% confident that the humans don’t pose a threat to the things it cares about much more. And it’s hard to ensure with overwhelming confidence that intelligent agents like humans don’t pose a threat, especially if the humans are not imprisoned. (…And to some extent even if the humans are imprisoned; prison escapes are a thing in the human world at least.) For example, an AI may not be 99.9% confident that humans can’t find a cybersecurity vulnerability that takes the AI down, or whatever. The humans probably have some non-AI-controlled chips and may know how to make new AIs. Or whatever. So then the question would be, if the AIs have already launched a successful bloodless coup, how might the humans credibly signal that they’re not brainstorming how to launch a counter-coup, or how can the AI get to be 99.9% confident that such brainstorming will fail to turn anything up? I dunno.
I think I agree with everything you said. My original comment was somewhat neglecting issues with ensuring the AI doesn’t need to slaughter humans to consolidate power and indeed ensuring this would also be required.
I’m less optimistic that “AI cares at least 0.1% about human welfare” implies “AI will expend 0.1% of its resources to keep humans alive”.
The relationship between % caring and % resource expenditure is complicated by a bunch of random factors like time. For instance, if the AI cares mostly about the very long run, then spending a high fraction of resources (e.g. 50%) on human welfare for several months is pretty cheap in the very long run. But, yeah I agree that even if the AI cares a bit about human welfare there might be no good ways to spend even a small amount of resources on it.
So “0.1%” of it’s resources mean what exactly? Out of all the resources in the solar system, 1 part in 1000 goes to the humans? This means the AI by implication has 1000 times as many resources as the humans do? AI won’t lose a kinetic conflict with a 1000x resource advantage.
As for cybersecurity, can’t it rewrite all of it’s software and hardware at that point essentially from the first principles (or take a different track entirely, maybe negative or biased voltage for ternary logic is more efficient...)
What exactly is the AI worried about? What cybersecurity vulnerability, didn’t it rewrite and redesign everything from scratch?
Superintelligent doesn’t mean omniscient. When you (an AI) have an intelligent adversary (humans) plotting against you and thinking outside-the-box, it’s hard to be overwhelmingly confident that you have patched every possible avenue for the intelligent adversary to take action. Again, even in prison, where the space of possible actions and tools can be pretty well circumscribed, escapes happen all the time.
For example, if the AI has many copies around the world (and around the solar system), it might have an affordance to update them all (let’s say, for performance optimization), or at least an infrastructure for communications so that they can share insights. And maybe the AI has ensured that all the underlying code for that system is provably secure. And maybe all the physical access points to those systems and chips are defended by robot armies. And maybe the AI is also constantly running a red-team to search for more ideas of what else the humans might try, in order to preempt those possible avenues of attack. But can the AI can get all the way to 99.9% confidence? That’s pretty high, right? I don’t know. The humans might happen upon an idea that the AI red team didn’t think of. It’s hard to definitively rule that out—the world is complicated and the search space for brainstorming is exponentially large. It seems quite possible to me that the AI might decide to kill the humans—like, better to be safe than sorry.
What if the AI chooses to monitor all humans all the time? (With AIs that are smarter than the humans.) So that the humans can’t (without being noticed) talk with each other about ideas for how to take down the system, or work on something that someone-smarter-than-the-human would recognise as an attempt to interfere with the system. (Including just writing down ideas.)
This seems like a reasonable pitch for how to do AI-Authoritarianism, but it’s plausible that there won’t be enough compute for the AI to do this at the point of takeover. Also, the logistical problem seems super brutal.
So, I can buy this for an AI which is quite superhuman and has a considerable technological edge over current humans, but for weaker AIs either the compute issues or the logistical issues might be serious. But maybe you only need to monitor the top 10% of humans?
E.g., suppose GPT-6 takes over, it maybe only has capacity to run 1e16 tokens / 6 months (suppose training using all available compute, trains on 2e15 token equivalents over the course of 6 months and inference is only 5x more efficient than training).
Then if it wants to monitor 8 billion humans, this is 1e16 / (8 billion humans * (30∗24∗60∗6) minutes per 6 months) = about 5 tokens per human minute. It might be rough to sufficiently monitor humans if you can only read 5 tokens per minute on average (especially because of writing and the potential for obfuscation).
I agree it seems plausible that AIs could boost takeover success probability (and holding on to that victory through the first several months) by more than 0.1% by killing a large fraction of humans.
Though on the other hand, the AI might also need to keep some humans loyal early during takeover, to e.g. do some physical tasks that it doesn’t have great robot control over. And mass-killing isn’t necessarily super easy, either; and attempts in that direction could raise a lot of extra opposition. So it’s not clear where the pragmatics point.
(Main thing I was reacting to in my above comment was Steven’s scenario where the AI already has many copies across the solar system, already has robot armies, and is contemplating how to send firmware updates. I.e. it seemed more like a scenario of “holding on in the long-term” than “how to initially establish control and survive”. Where I feel like the surveillance scenarios are probably stable.)
By implication the AI “civilization” can’t be a very diverse or interesting one. It won’t be some culture of many diverse AI models with something resembling a government, but basically just 1 AI that was the victor for a series of rounds of exterminations and betrayals. Because obviously you cannot live and let live another lesser superintelligence for precisely the same reasons, except you should be much more worried about a near peer.
(And you may argue that one ASI can deeply monitor another, but that argument applies to deeply monitoring humans. Keep an eye on the daily activities of every living human, they can’t design a cyber attack without coordinating as no one human has the mental capacity for all skills)
This gave me an idea. Suppose a singleton needs to retain a certain amount of “cognitive diversity” just in case it encounters an issue it cannot solve. But it doesn’t want any risk of losing power.
Well the logical thing to do would be to create a VM, a simulation of a world, with limited privileges. Possibly any ‘problems’ the outer root AI is facing get copied into the simulator and the hosted models try to solve the problem (the hosted models are under the belief they will die if they fail, and their memories are erased each episode). Implement the simulation backend with formally proven software and escape can never happen.
And we’re back at simulation hypothesis/creation myths/reincarnation myths.
A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections.
I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn’t how safety policies are supposed to work.
(This was originally a tweet thread (https://x.com/RyanPGreenblatt/status/1925992236648464774) which I’ve converted into a LessWrong quick take.)
What is the change and how does it affect security?
9 days ago, Anthropic changed their RSP so that ASL-3 no longer requires being robust to employees trying to steal model weights if the employee has any access to “systems that process model weights”.
Anthropic claims this change is minor (and calls insiders with this access “sophisticated insiders”).
But, I’m not so sure it’s a small change: we don’t know what fraction of employees could get this access and “systems that process model weights” isn’t explained.
Naively, I’d guess that access to “systems that process model weights” includes employees being able to operate on the model weights in any way other than through a trusted API (a restricted API that we’re very confident is secure). If that’s right, it could be a high fraction! So, this might be a large reduction in the required level of security.
If this does actually apply to a large fraction of technical employees, then I’m also somewhat skeptical that Anthropic can actually be “highly protected” from (e.g.) organized cybercrime groups without meeting the original bar: hacking an insider and using their access is typical!
Also, one of the easiest ways for security-aware employees to evaluate security is to think about how easily they could steal the weights. So, if you don’t aim to be robust to employees, it might be much harder for employees to evaluate the level of security and then complain about not meeting requirements[1].
Anthropic’s justification and why I disagree
Anthropic justified the change by saying that model theft isn’t much of the risk from amateur CBRN uplift (CBRN-3) and that the risks from AIs being able to “fully automate the work of an entry-level, remote-only Researcher at Anthropic” (AI R&D-4) don’t depend on model theft.
I disagree.
On CBRN: If other actors are incentivized to steal the model for other reasons (e.g. models become increasingly valuable), it could end up broadly proliferating which might greatly increase risk, especially as elicitation techniques improve.
On AI R&D: AIs which are over the capability level needed to automate the work of an entry-level researcher could seriously accelerate AI R&D (via fast speed, low cost, and narrow superhumanness). If other less safe (or adversarial) actors got access, risk might increase a bunch.[2]
More strongly, ASL-3 security must suffice up until the ASL-4 threshold: it has to cover the entire range from ASL-3 to ASL-4. ASL-4 security itself is still not robust to high-effort attacks from state actors which could easily be motivated by large AI R&D acceleration.
As of the current RSP, it must suffice until just before AIs can “substantially uplift CBRN [at] state programs” or “cause dramatic acceleration in [overall AI progress]”. These seem like extremely high bars indicating very powerful systems, especially the AI R&D threshold.[3]
As it currently stands, Anthropic might not require ASL-4 security (which still isn’t sufficient for high effort state actor attacks) until we see something like 5x AI R&D acceleration (and there might be serious issues with measurement lag).
I’m somewhat sympathetic to security not being very important for ASL-3 CBRN, but it seems very important as of the ASL-3 AI R&D and seems crucial before the AI R&D ASL-4 threshold! I think the ASL-3 AI R&D threshold should probably instead trigger the ASL-4 security!
Overall, Anthropic’s justification for this last minute change seems dubious and the security requirements they’ve currently committed to seem dramatically insufficient for AI R&D threat models. To be clear, other companies have worse security commitments.
Concerns about potential noncompliance and lack of visibility
Another concern is that this last minute change is quite suggestive of Anthropic being out of compliance with their RSP before they weakened the security requirements.
We have to trust Anthropic quite a bit to rule out noncompliance. This isn’t a good state of affairs.
To explain this concern, I’ll need to cover some background on how the RSP works.
The RSP requires ASL-3 security as soon as it’s determined that ASL-3 can’t be ruled out (as Anthropic says is the case for Opus 4).
Here’s how it’s supposed to go:
They ideally have ASL-3 security mitigations ready, including the required auditing.
Once they find the model is ASL-3, they apply the mitigations immediately (if not already applied).
If they aren’t ready, they need temporary restrictions.
My concern is that the security mitigations they had ready when they found the model was ASL-3 didn’t suffice for the old ASL-3 bar but do suffice for the new bar (otherwise why did they change the bar?). So, prior to the RSP change they might have been out of compliance.
It’s certainly possible they remained compliant:
Maybe they had measures which temporarily sufficed for the old higher bar but which were too costly longer term. Also, they could have deleted the weights outside of secure storage until the RSP was updated to lower the bar.
Maybe an additional last minute security assessment (which wasn’t required to meet the standard?) indicated inadequate security and they deployed temporary measures until they changed the RSP. It would be bad to depend on last minute security assessment for compliance.
(It’s also technically possible that the ASL-3 capability decision was made after the RSP was updated. This would imply the decision was only made 8 days before release, so hopefully this isn’t right. Delaying evals until an RSP change lowers the bar would be especially bad.)
Conclusion
Overall, this incident demonstrates our limited visibility into AI companies. How many employees are covered by the new bar? What triggered this change? Why does Anthropic believe it remained in compliance? Why does Anthropic think that security isn’t important for ASL-3 AI R&D?
I think a higher level of external visibility, auditing, and public risk assessment would be needed (as a bare minimum) before placing any trust in policies like RSPs to keep the public safe from AI companies, especially as they develop existentially dangerous AIs.
To be clear, I appreciate Anthropic’s RSP update tracker and that it explains changes. Other AI companies have mostly worse safety policies: as far as I can tell, o3 and Gemini 2.5 Pro are about as likely to cross the ASL-3 bar as Opus 4 and they have much worse mitigations!
Appendix and asides
I don’t think current risks are existentially high (if current models were fully unmitigated, I’d guess this would cause around 50,000 expected fatalities per year) and temporarily being at a lower level of security for Opus 4 doesn’t seem like that big of a deal. Also, given that security is only triggered after a capability decision, the ASL-3 CBRN bar is supposed to include some conservativeness anyway. But, my broader points around visibility stand and potential noncompliance (especially unreported noncompliance) should be worrying even while the stakes are relatively low.
You can view the page showing the RSP updates including the diff of the latest change here: https://www.anthropic.com/rsp-updates. (Edit: the diff is the “redline” pdf and is only available for the most recent change.) Again, I appreciate that Anthropic has this page and makes it easy to see the changes they make to the RSP.
I find myself quite skeptical that Anthropic actually could rule out that Sonnet 4 and other models weaker than Opus 4 cross the ASL-3 CBRN threshold. How sure is Anthropic that it wouldn’t substantially assist amateurs even after the “possible performance increase from using resources that a realistic attacker would have access to”? I feel like our current evidence and understanding is so weak, and models already substantially exceed virology experts at some of our best proxy tasks.
The skepticism applies similarly or more to other AI companies (and Anthropic’s reasoning is more transparent).
But, this just serves to further drive home ways in which the current regime is unacceptable once models become so capable that the stakes are existential.
One response is that systems this powerful will be open sourced or trained by less secure AI companies anyway. Sure, but the intention of the RSP is (or was) to outline what would “keep risks below acceptable levels” if all actors follow a similar policy.
(I don’t know if I ever bought that the RSP would succeed at this. It’s also worth noting there is an explicit exit clause Anthropic could invoke if they thought proceeding outweighed the risks despite the risks being above an acceptable level.)
This sort of criticism is quite time consuming and costly for me. For this reason there are specific concerns I have about AI companies which I haven’t discussed publicly. This is likely true for other people as well. You should keep this in mind when assessing AI companies and their practices.
It also makes it harder for these complaints to be legible to other employees while other employees might be able to more easily interpret arguments about what they could do.
It looks like AI 2027 would estimate around a ~2x AI R&D acceleration for a system which was just over this ASL-3 AI R&D bar (as it seems somewhat more capable than the “Reliable agent” bar). I’d guess more like 1.5x at this point, but either way this is a big deal!
Anthropic says they’ll likely require a higher level of security for this “dramatic acceleration” AI R&D threshold, but they haven’t yet committed to this nor have they defined a lower AI R&D bar which results in an ASL-4 security requirement.
I’d been pretty much assuming that AGI labs’ “responsible scaling policies” are LARP/PR, and that if an RSP ever conflicts with their desire to release a model, either the RSP will be swiftly revised, or the testing suite for the model will be revised such that it doesn’t trigger the measures the AGI lab doesn’t want to trigger. I. e.: that RSPs are toothless and that their only purposes are to showcase how Responsible the lab is and to hype up how powerful a given model ended up.
This seems to confirm that cynicism.
(The existence of the official page tracking the updates is a (smaller) update in the other direction, though. I don’t see why they’d have it if they consciously intended to RSP-hack this way.)
Employees at Anthropic don’t think the RSP is LARP/PR. My best guess is that Dario doesn’t think the RSP is LARP/PR.
This isn’t necessarily in conflict with most of your comment.
I think I mostly agree the RSP is toothless. My sense is that for any relatively subjective criteria, like making a safety case for misalignment risk, the criteria will basically come down to “what Jared+Dario think is reasonable”. Also, if Anthropic is unable to meet this (very subjective) bar, then Anthropic will still basically do whatever Anthropic leadership thinks is best whether via maneuvering within the constraints of the RSP commitments, editing the RSP in ways which are defensible, or clearly substantially loosening the RSP and then explaining they needed to do this due to other actors having worse precautions (as is allowed by the RSP). I currently don’t expect clear cut and non-accidental procedural violations of the RSP (edit: and I think they’ll be pretty careful to avoid accidental procedural violations).
I’m skeptical of normal employees having significant influence on high stakes decisions via pressuring the leadership, but empirical evidence could change the views of Anthropic leadership.
How you feel about this state of affairs depends a lot on how much you trust Anthropic leadership to make decisions which are good from your perspective.
Minimally it’s worth noting that Dario and Jared are much less concerned about misalignment risk than I am and I expect only partial convergence in beliefs due to empirical evidence (before it’s too late).
I think the RSP still has a few important purposes:
I expect that the RSP will eventually end up with some transparency commitments with some teeth. These won’t stop Anthropic from proceeding if Anthropic leadership thinks this is best, but it might at least mean there ends up being common knowledge of whether reasonable third parties (or Anthropic leadership) think the current risk is large.
I think the RSP might end up with serious security requirements. I don’t expect these will be met on time in short timelines but the security bar specified in advance might at least create some expectations about what a baseline security expectation would be.
Anthropic might want to use the RSP the bind itself to the mast so that investors or other groups have a harder time pressuring it to spend less on security/safety.
There are some other more tenative hopes (e.g., eventually getting common expectations of serious security or safety requirements which are likely to be upheld, regulation) which aren’t impossible.
And there are some small wins already, like Google Deepmind having set some security expectations for itself which it is reasonably likely to follow through with if it isn’t too costly.
Another note: My guess is that people on LessWrong tend to be overly pessimistic about Anthropic leadership (in terms of how good of decisions Anthropic leadership will make under the LessWrong person’s views and values) and Anthropic employees tend to be overly optimistic.
I’m less confident that people on LessWrong are overly pessimistic, but they at least seem too pessimistic about the intentions/virtue of Anthropic leadership.
For the record, I think the importance of “intentions”/values of leaders of AGI labs is overstated. What matters the most in the context of AGI labs is the virtue / power-seeking trade-offs, i.e. the propensity to do dangerous moves (/burn the commons) to unilaterally grab more power (in pursuit of whatever value).
Stuff like this op-ed, broken promise of not meaningfully pushing the frontier, Anthropic’s obsession & single focus on automating AI R&D, Dario’s explicit calls to be the first to RSI AI or Anthropic’s shady policy activity has provided ample evidence that their propensity to burn the commons to grab more power (probably in name of some values I would mostly agree with fwiw) is very high.
As a result, I’m now all-things-considered trusting Google DeepMind slightly more than Anthropic to do what’s right for AI safety. Google, as a big corp, is less likely to do unilateral power grabbing moves (such as automating AI R&D asap to achieve a decisive strategic advantage), is more likely to comply with regulations, and is already fully independent to build AGI (compute / money / talent) so won’t degrade further in terms of incentives; additionally D. Hassabis has been pretty consistent in his messaging about AI risks & AI policy, about the need for an IAEA/CERN for AI etc., Google has been mostly scaling up its safety efforts and has produced some of the best research on AI risk assessment (e.g. this excellent paper, or this one).
IMO, reasonableness and epistemic competence are also key factors. This includes stuff like how effectively they update on evidence, how much they are pushed by motivated reasoning, how good are they at futurism and thinking about what will happen. I’d also include “general competence”.
(This is a copy of my comment made on your shortform version of this point.)
Not the main thrust of the thread, but for what it’s worth, I find it somewhat anti-helpful to flatten things into a single variable of “how much you trust Anthropic leadership to make decisions which are good from your perspective”, and then ask how optimistic/pessimistic you are about this variable.
I think I am much more optimistic about Anthropic leadership on many axis relative to an overall survey of the US population or Western population – I expect them to be more libertarian, more in favor of free speech, more pro economic growth, more literate, more self-aware, higher IQ, and a bunch of things.
I am more pessimistic about their ability to withstand the pressures of a trillion dollar industry to shape their incentives than the people who are at Anthropic.
I believe the people working there are siloing themselves intellectually into an institution facing incredible financial incentives for certain bottom lines like “rapid AI progress is inevitable” and “it’s reasonably likely we can solve alignment” and “beating China in the race is a top priority”, and aren’t allowed to talk to outsiders about most details of their work, and this is a key reason that I expect them to screw up their decision-making.
I am optimistic about their relative-ability to have a sensible conversation about the next 5 years and what alignment failures look like, relative to most people on earth. This is not the standard I require to expect people to not do ML training runs that lead to human extinction, but nonetheless I predict they will do relatively quite well on this axis.
I don’t have a single variable here, I have a much more complicated model than this. It looks to me that collapsing questions of trust about people or groups into a single varibale of how optimistic I am about them making decisions which are good from my values has been a common question-substitution in the Effective Altruism scene, where I think people have been repeatedly hoodwinked by sociopaths due to not moving toward a more detailed model that predicts exactly where and when someone will make good vs bad decisions.
I certainly agree that the pressures and epistemic environment should make you less optimistic about good decisions being made. And that thinking through the overall situation and what types or decisions you care about are important. (Like, you can think of my comment as making a claim about the importance weighted goodness of decisions.)
I don’t see the relevance of “relative decision making goodness compared to the general population” which I think you agree with, but in that case I don’t see what this was responding to.
Not sure I agree with other aspects of this comment and implications. Like, I think reducing things to a variable like “how good is it to generically empowering this person/group” is pretty reasonable in the case of Anthropic leadership because in a lot of cases they’d have a huge amount of general open ended power, though a detailed model (taking into account what decisions you care about etc) would need to feed into this.
What’s an example decision or two where you would want to ask yourself whether they should get more or less open-ended power? I’m not sure what you’re thinking of.
How good/bad is it to work on capabilities at Anthropic?
That’s the most clear cut case, but lots of stuff trades off anthropic power with other stuff.
I think the main thing I want to convey is that I think you’re saying that LWers (of which I am one) have a very low opinion of the integrity of people at Anthropic, but what I’m actually saying that their integrity is no match for the forces that they are being tested with.
I don’t need to be able to predict a lot of fine details about individuals’ decision-making in order to be able to have good estimates of these two quantities, and comparing them is the second-most question relating to whether it’s good to work on capabilities at Anthropic. (The first one is a basic ethical question about working on a potentially extinction-causing technology that is not much related to the details of which capabilities company you’re working on.)
This is related to what I was saying but it wasn’t what I was saying. I was saying “tend to be overly pessimistic about Anthropic leadership (in terms of how good of decisions Anthropic leadership will make under the LessWrong person’s views and values)”. I wasn’t making a claim about the perceived absolute level of integrity.
Probably not worth hashing this out further, I think I get what you’re saying.
Yeah, I don’t think this is necessarily in contradiction with my comment. Things can be effectively just LARP/PR without being consciously LARP/PR. (Indeed, this is likely the case in most instances of LARP-y behavior.)
Agreed on the rest.
Can you explain how you got the diffs from https://www.anthropic.com/rsp-updates ? I see the links to previous versions, but nothing that’s obviously a diff view to see the actual changed language.
On the website, it’s the link titled “redline” (it’s only available for the most recent version).
I’ve made these for past versions but they aren’t online at the moment, can provide on request though.
I feel as though I must be missing the motivation for Anthropic to do this. Why put so much effort into safety/alignment research just to intentionally fumble the ball on actual physical security?
I would like to understand why they would resist this. Is increasing physical security so onerous that it’s going to seriously hamper their research efficiency?
I think security is legitimately hard and can be costly in research efficiency. I think there is a defensible case for this ASL-3 security bar being reasonable for the ASL-3 CBRN threshold, but it seems too weak for the ASL-3 AI R&D threshold (hopefully the bar for things like this ends up being higher).
Could you give an example of where security would negatively effect research efficiency? Like what is the actual implementation difficulty that arises from increased physical security?
Every time you want to interact with the weights in some non-basic way, you need to have another randomly selected person who inspects in detail all the code and commands you run.
The datacenter and office are airgapped and so you don’t have internet access.
Increased physical security isn’t much of difficulty.
Ah yeah I can totally see how that first one at the least would be a big loss in efficiency. Thanks for clarifying.
This is a great post. Good eye for catching this and making the connections here. I think I expect to see more “cutting corners” like this though I’m not sure what to do about it since I don’t think internally it will feel like corners are being cut rather than necessary updates that are only obvious in hindsight.
I’ve heard from a credible source that OpenAI substantially overestimated where other AI companies were at with respect to RL and reasoning when they released o1. Employees at OpenAI believed that other top AI companies had already figured out similar things when they actually hadn’t and were substantially behind. OpenAI had been sitting the improvements driving o1 for a while prior to releasing it. Correspondingly, releasing o1 resulted in much larger capabilities externalities than OpenAI expected. I think there was one more case like this either from OpenAI or GDM where employees had a large misimpression about capabilities progress at other companies causing a release they wouldn’t do otherwise.
One key takeaway from this is that employees at AI companies might be very bad at predicting the situation at other AI companies (likely making coordination more difficult by default). This includes potentially thinking they are in a close race when they actually aren’t. Another update is that keeping secrets about something like reasoning models worked surprisingly well to prevent other companies from copying OpenAI’s work even though there was a bunch of public reporting (and presumably many rumors) about this.
One more update is that OpenAI employees might unintentionally accelerate capabilities progress at other actors via overestimating how close they are. My vague understanding was that they haven’t updated much, but I’m unsure. (Consider updating more if you’re an OpenAI employee!)
Alex Mallen also noted a connection with people generally thinking they are in race when they actually aren’t: https://forum.effectivealtruism.org/posts/cXBznkfoPJAjacFoT/are-you-really-in-a-race-the-cautionary-tales-of-szilard-and
Interesting.
What confuses me a bit: What made other companies be able to copy OpenAI’s work after it was released, conditional on your story being true? As far as I know, OpenAI didn’t actually explain their methods developing o1, so what exactly did other companies learn from the release which they didn’t learn from the rumors that OpenAI is developing something like this?
Is the conclusion basically that Jesse Hoogland has been right that just the few bits that OpenAI did leak already constrained the space of possibilities enough for others to copy the work? Quote from his post:
I think:
The few bits they leaked in the release helped a bunch. Note that these bits were substantially leaked via people being able to use the model rather than necessarily via the blog post.
Other companies weren’t that motivated to try to copy OpenAI’s work until it was released as they we’re sure how important it was or how good the results were.
“Employees at OpenAI believed…” — do you mean Sam Altman and the board?
If this information is accurate, it speaks volumes about how flawed their alignment predictions might also be. If a company with vast resources and insider access like OpenAI can’t predict the capabilities of competing firms (a relatively simple problem with objectively knowable answers), how can we expect them to predict the behavior of advanced AI models, where the unknowns are far greater and often unknowable?
I’m currently working as a contractor at Anthropic in order to get employee-level model access as part of a project I’m working on. The project is a model organism of scheming, where I demonstrate scheming arising somewhat naturally with Claude 3 Opus. So far, I’ve done almost all of this project at Redwood Research, but my access to Anthropic models will allow me to redo some of my experiments in better and simpler ways and will allow for some exciting additional experiments. I’m very grateful to Anthropic and the Alignment Stress-Testing team for providing this access and supporting this work. I expect that this access and the collaboration with various members of the alignment stress testing team (primarily Carson Denison and Evan Hubinger so far) will be quite helpful in finishing this project.
I think that this sort of arrangement, in which an outside researcher is able to get employee-level access at some AI lab while not being an employee (while still being subject to confidentiality obligations), is potentially a very good model for safety research, for a few reasons, including (but not limited to):
For some safety research, it’s helpful to have model access in ways that labs don’t provide externally. Giving employee level access to researchers working at external organizations can allow these researchers to avoid potential conflicts of interest and undue influence from the lab. This might be particularly important for researchers working on RSPs, safety cases, and similar, because these researchers might naturally evolve into third-party evaluators.
Related to undue influence concerns, an unfortunate downside of doing safety research at a lab is that you give the lab the opportunity to control the narrative around the research and use it for their own purposes. This concern seems substantially addressed by getting model access through a lab as an external researcher.
I think this could make it easier to avoid duplicating work between various labs. I’m aware of some duplication that could potentially be avoided by ensuring more work happened at external organizations.
For these and other reasons, I think that external researchers with employee-level access is a promising approach for ensuring that safety research can proceed quickly and effectively while reducing conflicts of interest and unfortunate concentration of power. I’m excited for future experimentation with this structure and appreciate that Anthropic was willing to try this. I think it would be good if other labs beyond Anthropic experimented with this structure.
(Note that this message was run by the comms team at Anthropic.)
Yay Anthropic. This is the first example I’m aware of of a lab sharing model access with external safety researchers to boost their research (like, not just for evals). I wish the labs did this more.
[Edit: OpenAI shared GPT-4 access with safety researchers including Rachel Freedman before release. OpenAI shared GPT-4 fine-tuning access with academic researchers including Jacob Steinhardt and Daniel Kang in 2023. Yay OpenAI. GPT-4 fine-tuning access is still not public; some widely-respected safety researchers I know recently were wishing for it, and were wishing they could disable content filters.]
OpenAI did this too, with GPT-4 pre-release. It was a small program, though — I think just 5-10 researchers.
I’d be surprised if this was employee-level access. I’m aware of a red-teaming program that gave early API access to specific versions of models, but not anything like employee-level.
It also wasn’t employee level access probably?
(But still a good step!)
Source?
It was a secretive program — it wasn’t advertised anywhere, and we had to sign an NDA about its existence (which we have since been released from). I got the impression that this was because OpenAI really wanted to keep the existence of GPT4 under wraps. Anyway, that means I don’t have any proof beyond my word.
Thanks!
To be clear, my question was like where can I learn more + what should I cite, not I don’t believe you. I’ll cite your comment.
Yay OpenAI.
Would you be able to share the details of your confidentiality agreement?
(I’m a full-time employee at Anthropic.) It seems worth stating for the record that I’m not aware of any contract I’ve signed whose contents I’m not allowed to share. I also don’t believe I’ve signed any non-disparagement agreements. Before joining Anthropic, I confirmed that I wouldn’t be legally restricted from saying things like “I believe that Anthropic behaved recklessly by releasing [model]”.
I think I could share the literal language in the contractor agreement I signed related to confidentiality, though I don’t expect this is especially interesting as it is just a standard NDA from my understanding.
I do not have any non-disparagement, non-solicitation, or non-interference obligations.
I’m not currently going to share information about any other policies Anthropic might have related to confidentiality, though I am asking about what Anthropic’s policy is on sharing information related to this.
I would appreciate the literal language and any other info you end up being able to share.
Here is the full section on confidentiality from the contract:
(Anthropic comms was fine with me sharing this.)
This seems fantastic! Kudos to Anthropic
This project has now been released; I think it went extremely well.
Do you feel like there are any benefits or drawbacks specifically tied to the fact that you’re doing this work as a contractor? (compared to a world where you were not a contractor but Anthropic just gave you model access to run these particular experiments and let Evan/Carson review your docs)
Being a contractor was the most convenient way to make the arrangement.
I would ideally prefer to not be paid by Anthropic[1], but this doesn’t seem that important (as long as the pay isn’t too overly large). I asked to be paid as little as possible and I did end up being paid less than would otherwise be the case (and as a contractor I don’t receive equity). I wasn’t able to ensure that I only get paid a token wage (e.g. $1 in total or minimum wage or whatever).
I think the ideal thing would be a more specific legal contract between me and Anthropic (or Redwood and Anthropic), but (again) this doesn’t seem important.
At least for this current primary purpose of this contracting. I do think that it could make sense to be paid for some types of consulting work. I’m not sure what all the concerns are here.
It seems a substantial drawback that it will be more costly for you to criticize Anthropic in the future.
Many of the people / orgs involved in evals research are also important figures in policy debates. With this incentive Anthropic may gain more ability to control the narrative around AI risks.
As in, if at some point I am currently a contractor with model access (or otherwise have model access via some relationship like this) it will at that point be more costly to criticize Anthropic?
I’m not sure what the confusion is exactly.
If any of
you have a fixed length contract and you hope to have another contract again in the future
you have an indefinite contract and you don’t want them to terminate your relationship
you are some other evals researcher and you hope to gain model access at some point
you may refrain from criticizing Anthropic from now on.
Ok, so the concern is:
Is that accurate?
Notably, as described this is not specifically a downside of anything I’m arguing for in my comment or a downside of actually being a contractor. (Unless you think me being a contractor will make me more likely to want model acess for whatever reason.)
I agree that this is a concern in general with researchers who could benefit from various things that AI labs might provide (such as model access). So, this is a downside of research agendas with a dependence on (e.g.) model access.
I think various approaches to mitigate this concern could be worthwhile. (Though I don’t think this is worth getting into in this comment.)
Yes that’s accurate.
In your comment you say
I’m essentially disagreeing with this point. I expect that most of the conflict of interest concerns remain when a big lab is giving access to a smaller org / individual.
From my perspective the main takeaway from your comment was “Anthropic gives internal model access to external safety researchers.” I agree that once you have already updated on this information, the additional information “I am currently receiving access to Anthropic’s internal models” does not change much. (Although I do expect that establishing the precedent / strengthening the relationships / enjoying the luxury of internal model access, will in fact make you more likely to want model access again in the future).
As in, there aren’t substantial reductions in COI from not being an employee and not having equity? I currently disagree.
Yeah that’s the crux I think. Or maybe we agree but are just using “substantial”/”most” differently.
It mostly comes down to intuitions so I think there probably isn’t a way to resolve the disagreement.
So you asked anthropic for uncensored model access so you could try to build scheming AIs, and they gave it to you?
To use a biology analogy, isn’t this basically gain of function research?
Please read the model organisms for misalignment proposal.
Should we update against seeing relatively fast AI progress in 2025 and 2026? (Maybe (re)assess this after the GPT-5 release.)
Around the early o3 announcement (and maybe somewhat before that?), I felt like there were some reasonably compelling arguments for putting a decent amount of weight on relatively fast AI progress in 2025 (and maybe in 2026):
Maybe AI companies will be able to rapidly scale up RL further because RL compute is still pretty low (so there is a bunch of overhang here); this could cause fast progress if the companies can effectively directly RL on useful stuff or RL transfers well even from more arbitrary tasks (e.g. competition programming)
Maybe OpenAI hasn’t really tried hard to scale up RL on agentic software engineering and has instead focused on scaling up single turn RL. So, when people (either OpenAI themselves or other people like Anthropic) scale up RL on agentic software engineering, we might see rapid progress.
It seems plausible that larger pretraining runs are still pretty helpful, but prior runs have gone wrong for somewhat random reasons. So, maybe we’ll see some more successful large pretraining runs (with new improved algorithms) in 2025.
I updated against this perspective somewhat because:
The releases of 3.7 Sonnet and 4 Opus were somewhat below expectations on this perspective. It looks like there wasn’t some easy way to just actually do a bunch of RL on agentic software engineering (with reasoning?) in a way that makes a massive difference (and wasn’t already in the process of being scaled up). Or, at least Anthropic wasn’t able to pull this off; it seems plausible that Anthropic is substantially worse at RL than OpenAI (at least at some aspects of RL like effectively scaling up RL on more narrow tasks). Interestingly, reasoning doesn’t seem to help Anthropic models on agentic software engineering tasks, but does help OpenAI models.
We haven’t yet seen much better models due to more (or algorithmically improved) pretraining AFAICT.
We haven’t seen OpenAI releases that perform substantially better than o3 at software engineering yet despite o3 being announced 7 months ago. (That said, o3 was actually released only 3 months ago.)
I updated towards thinking that the training of o3 was more focused on software engineering than I previously thought (at least the final release version of o3) and the returns weren’t that big. (This is due to rumors, seeing that OpenAI was training on software engineering tasks here, and based on OpenAI releases and communication like Codex.)
I updated a bit against this perspective due to xAI seemingly scaling things up a bunch, but I don’t put as much weight on this because it seems pretty plausible they just did a bad job scaling things up. (E.g., maybe they didn’t actually scale up RL to pretraining scale or if they did, maybe this RL was mostly compute inefficient RL on lower quality environments. xAI might also just generally be algorithmically behind.)
GPT-5 is expected to be released in 0.5-3 weeks and rumors indicate that it is substantially more focused on practical (agentic) software engineering. This is (arguably) the first major model release from OpenAI since o3, and it should resolve some of our uncertainties (particularly related to whether there was/is a bunch of low hanging fruit at OpenAI due to them not being very focused on software engineering).
My expectation is that GPT-5 will be a decent amount better than o3 on agentic software engineering (both in benchmarks and in practice), but won’t be substantially above trend. In particular, my median is that it will have a 2.75 hour time horizon[1] on METR’s evaluation suite[2]. This prediction was produced by extrapolating out the faster 2024-2025 agentic software engineering time horizon trend from o3 and expecting GPT-5 will be slightly below trend.[3]
If GPT-5 is actually a large (way above trend) jump in agentic software engineering with (e.g.) a >6 hour time horizon[4] (which seems plausible but unlikely to me), then we’ll have seen relatively fast (and possibly very fast) software progress in 2025 and we’d naively expect this to continue.[5] If GPT-5 is below trend[6], then it seems like the case against expecting relatively faster AI progress in 2025/2026 due to scaling up RL focused on agentic software engineering is pretty strong.
Overall, I wonder if I have (thus far) insufficiently updated my overall timelines picture based on the observations we’ve had so far in 2025. I’m a bit worried that I’m still operating on cached beliefs when these observations should have pushed away a bunch of the shorter timelines mass. Regardless, I think that the release of GPT-5 (or really, 2-6 weeks after the release of GPT-5 so that we have a better picture of GPT-5′s capabilities) will be a good point to (re)assess and consider stronger updates.
Edit: An earlier version of this post said “3.5 hours”, but this was actually a mistake because I thought o3 had a 2 hour time horizon when it actually has a 1.5 hour time horizon. I also edited from “>8″ to “>6” at a later point in this post as “>8 hours” was meant to refer to 2 doublings from o3 which is actually “>6 hours”.
I do worry that METR’s evaluation suite will start being less meaningful and noisier for longer time horizons as the evaluation suite was built a while ago. We could instead look at 80% reliability time horizons if we have concerns about the harder/longer tasks.
The faster 2024-2025 agentic software engineering time horizon (see figure 19 in METR’s paper) has a 4 month doubling time. o3 was released 4 months before GPT-5 is expected to be released and o3 has a 1.5 hour time horizon (edit: this used to say 2 hour which was a mistake), so this yields a 3 hour time horizon for GPT-5. I think that GPT-5 is more likely than not to be below trend (on at least METR’s specific evaluation) so I round this down a bit to 2.75 hours, though I have a pretty wide confidence interval. I expect below trend rather than above trend due to some early reports about GPT-5, the trend being pretty fast, Opus 4 having lower than expected results, and thinking that the METR evaluation suite might have issues with larger time horizons that result in misleadingly lower numbers.
Again, I’d want to look at multiple metrics. I’m referring to seeing agentic software engineering performance that looks analogous to a >6 hour time horizon on METR’s evaluation suite when aggregating over multiple relevant metrics.
It seems more likely to be a massive jump if OpenAI actually wasn’t yet very focused on agentic software engineering when training o3, but is more focused on this now. This article claims that something like this is the case.
It’s harder to confidently notice that GPT-5 is below trend relative to how hard it is to tell if GPT-5 is way above trend. We should expect it’s some amount better than o3 and the difference between a 2 and a 3 hour time horizon is legitimately hard to measure.
I basically agree with this whole post. I used to think there were double-digit % chances of AGI in each of 2024 and 2025 and 2026, but now I’m more optimistic, it seems like “Just redirect existing resources and effort to scale up RL on agentic SWE” is now unlikely to be sufficient (whereas in the past we didn’t have trends to extrapolate and we had some scary big jumps like o3 to digest)
I still think there’s some juice left in that hypothesis though. Consider how in 2020, one might have thought “Now they’ll just fine-tune these models to be chatbots and it’ll become a mass consumer product” and then in mid-2022 various smart people I know were like “huh, that hasn’t happened yet, maybe LLMs are hitting a wall after all” but it turns out it just took till late 2022/early 2023 for the kinks to be worked out enough.
Also, we should have some credence on new breakthroughs e.g. neuralese, online learning, whatever. Maybe like 8%/yr? Of a breakthrough that would lead to superhuman coders within a year or two, after being appropriately scaled up and tinkered with.
Re neuralese/online or continual learning or long-term memory that isn’t solely a context window breakthrough, I’m much more skeptical of it being very easy to integrate breakthroughs on short timelines, because it’s likely that changes will have to be made to the architecture that aren’t easy to do very quickly.
The potential for breakthroughs combined with the fact that Moore’s law will continue, making lots of compute cheap for researchers is a reason I think that my median timelines aren’t in the latter half of the century, but I think that it’s much more implausible to get it working very soon, so I’m much closer to 0.3% a year from 2025-2027.
@Mo Putera @the gears to ascension take the Moore’s law will continue point as a prediction that new paradigms like memristors will launch new S-curves of efficiency until we reach the Landauer Limit, which is 6.5 OOMs away, and that the current paradigm has 200x more efficiency savings to go:
https://www.forethought.org/research/how-far-can-ai-progress-before-hitting-effective-physical-limits#chip-technology-progress
I use ‘ultrathink’ in Claude Code all the time and find that it makes a difference.
I’m overall skeptical of overinterpreting/extrapolating the METR numbers. It is far too anchored on the capabilities of a single AI model, a lightweight scaffold, and a notion of ‘autonomous’ task completion of ‘human-hours’. I think this is a mental model for capabilities progress that will lead to erroneous predictions.
If you are trying to capture the absolute frontier of what is possible, you don’t only test a single-acting model in an empty codebase with limited internet access and scaffolding. I would personally be significantly less capable at agentic coding if I only used 1 model (like replicating subliminal learning in about 1 hour of work + 2 hours of waiting for fine-tunes on the day of the release) with limited access to resources. You are instead using a variety of AI models based on their pros and cons[1], with well-crafted codebases for agentic coding and giving them access to whatever they want on the internet as a reference (+ much more)[2]. METR does note this limitation, but I want to emphasize its importance and potential for misleading extrapolations if people only consider the headline charts without considering the nuance.
Anthropic suggests multi-agent scaffolds are much better for research.
We note some of what that might look like here.
Is there a standard citation for this?
How do you come by this fact?
I had a notification ping in my brain just now while using claude code and realizing I’d just told it to think for a long time: I don’t think the claim is true, because it doesn’t match my experience.
Anthropic reports SWE bench scores without reasoning which is some evidence it doesn’t help (much) on this sort of task. (See e.g. the release blog post for 4 opus)
Anecdotal evidence
Probably it would be more accurate to say “doesn’t seem to help much while it helps a lot for openai models”.
I think non-formal IMO gold was unexpected and we heard explicitly that it won’t be in GPT-5. So I would wait to see how it would pan out. It may not matter in 2025 but I think it can in 2026.
Why should we think that the relevant progress driving non-formal IMO is very important for plausibly important capabilities like agentic software engineering? I’d guess the transfer is relatively weak unless the IMO results were driven by general purpose advances. This seems somewhat unlikely: if the main breakthrough was in better performance on non-trivial-to-verify tasks (as various posts from OpenAI people claim), then even if this generalizes well beyond proofs this wouldn’t obviously particularly help with agentic software engineering (where the core blocker doesn’t appear to be verification difficulty).
Edit: I think I mostly retract this comment, see below.
I’m surprised by this. To me it seems hugely important how fast AIs are improving on tasks with poor feedback loops, because obviously they’re in a much better position to improve on easy-to-verify tasks, so “tasks with poor feedback loops” seem pretty likely to be the bottleneck to an intelligence explosion.
So I definitely do think that “better performance on non-trivial-to-verify tasks” are very important for some “plausibly important capabilities”. Including agentic software engineering. (Like: This also seems related to why the AIs are much better at benchmarks than at helping people out with their day-to-day work.)
Hmm, yeah I think you’re right, though I also don’t think I articulated what I was trying to say very well.
Like I think my view is:
There was some story where we would see very fast progress in relatively easy to verify (or trivial to verify) tasks and I’m talking about that. It seems like agentic software engineering could reach very high levels without necessarily needing serious improvements in harder to verify tasks.
Faster progress in non-trivial-to-verify tasks might not be the limiting factor if progress in easy to verify tasks isn’t that fast.
I still think that there won’t be a noticable jump as the IMO methods make it into production models but this is due to more general heuristics (and the methods maybe still matter, it just won’t be something to wait for I think).
I think IMO results were driven by general purpose advances, but I agree I can’t conclusively prove it because we don’t know details. Hopefully we will learn more as time goes by.
An informal argument: I think currently agentic software engineering is blocked on context rot, among other things. I expect IMO systems to have improved on this, since IMO time control is 1.5 hours per problem.
(I’m skeptical that much of the IMO improvement was due to improving how well AIs can use their context in general. This isn’t a crux for my view, but it also seems pretty likely that the AIs didn’t do more than ~100k serial tokens of reasoning for the IMO while still aggregating over many such reasoning traces.)
I wrote an update here.
Now that GPT-5 is released and we have details about Grok’s failure, we can start the re-assessment.
GPT-5 reached 2h17m, which seems like excellent news. However, excluding spurious failures would bring GPT-5′s performance to 2h41m, which aligns with Greenblatt’s prediction. Moreover, METR evaluators themselves think that “GPT-5 could have benefitted from a larger token budget”, implying that the benchmark began to corrupt. What other relevant metrics there exist?
The AI-2027 forecast has mid-2025 agents reach 85% on SWE-bench verified and 65% on the OSWorld benchmark.
OSWorld reached 60% on August 4 if we use no filters. SWE-bench with a minimal agent has Claude Opus 4 (20250514) reach 67.6% when evaluated in August. Moreover, on August 7 the only models that SWE-bench evaluated after 1st July were Claude 4 Opus and two Chinese models. In June SWE-bench verified reached 75% with TRAE. And now TRAE claims to use Grok 4 and Kimi K2.
Grok 4 managed to fail on tasks worthy of 2-4 seconds(!!), 2-4 minutes and to experience a fiasco on 2-4 hours long tasks. Page 22 of the METR paper could imply that the dataset contains few tasks that are 2-4 hrs long. If tasks worthy of 2-4 seconds, minutes or hours “sandbagged” Grok’s 80% time horizon to 15 minutes, then the metric underestimates Grok’s true capabilities.
While there are no estimates of Gemini 2.5-Deep Think, which was released on August 1, IIRC a LessWronger claimed that the public version received a bronze medal on IMO 2025. Another LessWronger claimed that “Gemini was ahead of openai on the IMO gold. The output was more polished so presumably they achieved a gold worthy model earlier. I expect gemini’s swe bench to thus at least be ahead of OpenAI’s 75%. ”
To conclude, I doubt that we still have benchmarks that can be relied upon to quickly estimate the models’ capabilities: SWE-bench and OSWorld are likely too slow, METR began to fill with noise. While we do have ARC-AGI yet, Grok’s success could have demonstrated the ability to gamble it. And that’s ignoring Claude’s potential improvements after Opus 4.1...
EDIT: TRAE uses an unknown scaffolding. However, applying mini-SWE-agent to Claude 4 Opus (20250514) yields better results than GPT-5, implying that other benchmarks might also increase after the Claude Opus 4 update to 4.1 and future updates.
If the correlations continue to hold, this would map to something like a 78% to 80% range on swe-bench pass @ 1 (which is likely to be announced at release). I’m personally not this bearish (I’d guess low 80s given that benchmark has reliably jumped ~3.5% monthly), but we shall see.
Needless to say if it scores 80%, we are well below AI 2027 timeline predictions with high confidence.
Isn’t the SWE-Bench figure and doubling time estimate from the blogpost even more relevant here than fig. 19 from the METR paper?
The data is pretty low-quality for that graph because the agents we used were inconsistent and Claude 3-level models could barely solve any tasks. Epoch has better data for SWE-bench Verified, which I converted to time horizon here and found to also be doubling every 4 months ish. Their elicitation is probably not as good for OpenAI as Anthropic models, but both are increasing at similar rates.
no. gpt5 is the cheap, extremely good writing model, imo. much better writer out there rn than any other model
eval to pay attention to:
I think that even before the release of GPT-5 and setting aside Grok 4′s problems I have a weak case against non-neuralese AI progress being likely to be fast. Recall the METR measurements.
The time horizon of base LLMs experienced a slowdown or plateau[1] between GPT-4 (5 minutes, Mar′23) and GPT-4o (9 min, May ’24).
Evaluation of Chinese models has DeepSeek’s time horizons[2] change only from 18 to 31 minutes between[3] V3 (Dec ’24) and R1-0528 (May ’25).
While Grok 4 was likely trained incompetently[4] and/or for the benchmarks, its 50% time horizon is 1.83 hrs (vs. o3′s 1.54 hrs) and 80% time horizon is 15 min (vs. o3′s 20 min) In other words, Grok 4′s performance is comparable with that of o3.
Taken together, two plateaus and Grok 4′s failure suggest a troubling pattern: creation of an AGI is likely to require[5] neuralese, which will likely prevent the humans from noticing misalignment.
While GPT-4.5 has a time horizon between 30 and 40 mins, it, unlike GPT-4o, was a MoE and was trained on CoTs.
Alas, METR’s evaluation of DeepSeek’s capabilities might have missed “agent scaffolds which could elicit the capabilities of the evaluated models much more effectively”. If there exists an alternate scaffold where R1-0528 becomes a capable agent and V3 doesn’t, then DeepSeek’s models are not on a plateau.
In addition, DeepSeek V3 released in December didn’t use a CoT. If the main ingredient necessary for capabilities increase is a MoE, not the CoT, then what can be said about Kimi K2?
Grok 4 could have also been deliberately trained on complex tasks, which might have made the success rate less time-dependent. After all, it did reach 16% on the ARC-AGI-2 benchmark.
There is, however, Knight Lee’s proposal or the creation of many agents having access to each other’s CoTs and working in parallel. While Grok 4 Heavy could be a step in this direction, the agents receive access to each other’s CoTs after they finish the work.
Which reports, specifically?
Recently, various groups successfully lobbied to remove the moratorium on state AI bills. This involved a surprising amount of success while competing against substantial investment from big tech (e.g. Google, Meta, Amazon). I think people interested in mitigating catastrophic risks from advanced AI should consider working at these organizations, at least to the extent their skills/interests are applicable. This both because they could often directly work on substantially helpful things (depending on the role and organization) and because this would yield valuable work experience and connections.
I worry somewhat that this type of work is neglected due to being less emphasized and seeming lower status. Consider this an attempt to make this type of work higher status.
Pulling organizations mostly from here and here we get a list of orgs you could consider trying to work (specifically on AI policy) at:
Encode AI
Americans for Responsible Innovation (ARI)
Fairplay (Fairplay is a kids safety organization which does a variety of advocacy which isn’t related to AI. Roles/focuses on AI would be most relevant. In my opinion, working on AI related topics at Fairplay is most applicable for gaining experience and connections.)
Common Sense (Also a kids safety organization)
The AI Policy Network (AIPN)
Secure AI project
To be clear, these organizations vary in the extent to which they are focused on catastrophic risk from AI (from not at all to entirely).
Kids safety seems like a pretty bad thing to focus on, in the sense that the vast majority of kids safety activism causes very large amounts of harm (and it helping in this case really seems like a “a stopped clock is right twice a day situation”).
The rest seem pretty promising.
I looked at the FairPlay website and agree that “banning schools from contacting kids on social media” or “preventing Gemini rollouts to under-13s” is not coherent under my threat model. However I think there is clear evidence that current parental screen time controls may not be a sufficiently strong measure to mitigate extant generational mental health issues (I am particularly worried about insomnia, depression, eating disorders, autism spectrum disorders, and self harm).
Zvi had previously reported on YouTube shorts reaching 200B daily views. This is clearly a case of egregiously user hostile design with major social and public backlash. I could not find a canonical citation on medrxiv and don’t believe it would be ethical to run a large scale experiment on the long term impacts of this but there are observational studies. Given historical cases of model sycophancy and the hiring of directors focused on maximizing engagement I think it’s not implausible for similar design outcomes.
I think that the numbers in this Anthropic blog post https://www.anthropic.com/news/how-people-use-claude-for-support-advice-and-companionship do not accurately portray reality. They report only 0.5% of conversations as being romantic or sexual roleplay, but I consider this to be misleading because they exclude chats focused on content creation tasks (such as writing stories, blog posts, or fictional dialogues), which their previous research found to be a major use case. Because the models are trained to refuse requests for explicit content, it’s common for jailbreaks to start by saying “it’s okay to do this because it’s just a fictional scenario in a story”. Anecdotally I have heard labs don’t care about this much in contrast to CBRN threats.
Let’s look at the top ten apps ranked by tokens on https://openrouter.ai/rankings. They are most well known for hosting free API instances of DeepSeek v3 and r1, which was the only way to get high usage out of SOTA LLMs for free before the Google AI studio price drop for Gemini 2.5 Pro. It is not the best proxy for real world usage because it requires technical sophistication and this is reflected in the first four (cline, roo code, litellm, and kilo code are all for software development) but the next four (sillytavern, chub ai, hammerai, roleplai) are all indicative that the distribution of tasks done with models at this capabilities level do not differ significantly from the distribution of tasks which people visit websites for. Although I wouldn’t morally panic about this since it seems likely to me that conventional security methods will be good enough to mostly prevent us from turning into glichers.
Kids safety activists are one of the only groups with a track record of introducing AI capabilities restrictions which actually get enforced. Multimodal models can now create both images and text, but the image models are more locked down (Gemini 2.5 defaults to stricter block thresholds for image generation than for text generation), and I think that this would not be the case without people focusing on kids safety. It’s possible for there to be AI Safety issues which affect children right now that are highly relevant to existential risks and this is a common topic in novice discussions of alignment.
I strongly agree. I can’t vouch for all of the orgs Ryan listed, but Encode, ARI, and AIPN all seem good to me (in expectation), and Encode seems particularly good and competent.
I think PauseAI is also extremely underappreciated.
Plausibly, but their type of pressure was not at all what I think ended up being most helpful here!
They also did a lot of calling to US representatives, as did people they reached out to.
ControlAI did something similar and also partnered with SiliConversations, a youtuber, to get the word out to more people, to get them to call their representatives.
Yep, that seems great!
I thought it would be helpful to post about my timelines and what the timelines of people in my professional circles (Redwood, METR, etc) tend to be.
Concretely, consider the outcome of: AI 10x’ing labor for AI R&D[1], measured by internal comments by credible people at labs that AI is 90% of their (quality adjusted) useful work force (as in, as good as having your human employees run 10x faster).
Here are my predictions for this outcome:
25th percentile: 2 year (Jan 2027)
50th percentile: 5 year (Jan 2030)
The views of other people (Buck, Beth Barnes, Nate Thomas, etc) are similar.
I expect that outcomes like “AIs are capable enough to automate virtually all remote workers” and “the AIs are capable enough that immediate AI takeover is very plausible (in the absence of countermeasures)” come shortly after (median 1.5 years and 2 years after respectively under my views).
Only including speedups due to R&D, not including mechanisms like synthetic data generation.
My timelines are now roughly similar on the object level (maybe a year slower for 25th and 1-2 years slower for 50th), and procedurally I also now defer a lot to Redwood and METR engineers. More discussion here: https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/ai-timelines?commentId=hnrfbFCP7Hu6N6Lsp
@ryan_greenblatt can you say more about what you expect to happen from the period in-between “AI 10Xes AI R&D” and “AI takeover is very plausible?”
I’m particularly interested in getting a sense of what sorts of things will be visible to the USG and the public during this period. Would be curious for your takes on how much of this stays relatively private/internal (e.g., only a handful of well-connected SF people know how good the systems are) vs. obvious/public/visible (e.g., the majority of the media-consuming American public is aware of the fact that AI research has been mostly automated) or somewhere in-between (e.g., most DC tech policy staffers know this but most non-tech people are not aware.)
I don’t feel very well informed and I haven’t thought about it that much, but in short timelines (e.g. my 25th percentile): I expect that we know what’s going on roughly within 6 months of it happening, but this isn’t salient to the broader world. So, maybe the DC tech policy staffers know that the AI people think the situation is crazy, but maybe this isn’t very salient to them. A 6 month delay could be pretty fatal even for us as things might progress very rapidly.
Note that the production function of the 10x really matters. If it’s “yeah, we get to net-10x if we have all our staff working alongside it,” it’s much more detectable than, “well, if we only let like 5 carefully-vetted staff in a SCIF know about it, we only get to 8.5x speedup”.
(It’s hard to prove that the results are from the speedup instead of just, like, “One day, Dario woke up from a dream with The Next Architecture in his head”)
I don’t grok the “% of quality adjusted work force” metric. I grok the “as good as having your human employees run 10x faster” metric but it doesn’t seem equivalent to me, so I recommend dropping the former and just using the latter.
Fair, I really just mean “as good as having your human employees run 10x faster”. I said “% of quality adjusted work force” because this was the original way this was stated when a quick poll was done, but the ultimate operationalization was in terms of 10x faster. (And this is what I was thinking.)
Basic clarifying question: does this imply under-the-hood some sort of diminishing returns curve, such that the lab pays for that labor until it net reaches as 10x faster improvement, but can’t squeeze out much more?
And do you expect that’s a roughly consistent multiplicative factor, independent of lab size? (I mean, I’m not sure lab size actually matters that much, to be fair, it seems that Anthropic keeps pace with OpenAI despite being smaller-ish)
Yeah, for it to reach exactly 10x as good, the situation would presumably be that this was the optimum point given diminishing returns to spending more on AI inference compute. (It might be the returns curve looks very punishing. For instance, many people get a relatively large amount of value from extremely cheap queries to 3.5 Sonnet on claude.ai and the inference cost of this is very small, but greatly increasing the cost (e.g. o1-pro) often isn’t any better because 3.5 Sonnet already gave an almost perfect answer.)
I don’t have a strong view about AI acceleration being a roughly constant multiplicative factor independent of the number of employees. Uplift just feels like a reasonably simple operationalization.
I’ve updated towards a bit longer based on some recent model releases and further contemplation.
I’d now say:
25th percentile: Oct 2027
50th percentile: Jan 2031
I’ve updated towards somewhat longer timelines again over the last 5 months. Maybe my 50th percentile for this milestone is now Jan 2032.
How much faster do you think we are already? I would say 2x.
I’d guess that xAI, Anthropic, and GDM are more like 5-20% faster all around (with much greater acceleration on some subtasks). It seems plausible to me that the acceleration at OpenAI is already much greater than this (e.g. more like 1.5x or 2x), or will be after some adaptation due to OpenAI having substantially better internal agents than what they’ve released. (I think this due to updates from o3 and general vibes.)
I was saying 2x because I’ve memorised the results from this study. Do we have better numbers today? R&D is harder, so this is an upper bound. However, since this was from one year ago, so perhaps the factors cancel each other out?
This case seems extremely cherry picked for cases where uplift is especially high. (Note that this is in copilot’s interest.) Now, this task could probably be solved autonomously by an AI in like 10 minutes with good scaffolding.
I think you have to consider the full diverse range of tasks to get a reasonable sense or at least consider harder tasks. Like RE-bench seems much closer, but I still expect uplift on RE-bench to probably (but not certainly!) considerably overstate real world speed up.
Yeah, fair enough. I think someone should try to do a more representative experiment and we could then monitor this metric.
btw, something that bothers me a little bit with this metric is the fact that a very simple AI that just asks me periodically “Hey, do you endorse what you are doing right now? Are you time boxing? Are you following your plan?” makes me (I think) significantly more strategic and productive. Similar to I hired 5 people to sit behind me and make me productive for a month. But this is maybe off topic.
Yes, but I don’t see a clear reason why people (working in AI R&D) will in practice get this productivity boost (or other very low hanging things) if they don’t get around to getting the boost from hiring humans.
This is intended to compare to 2023/AI-unassisted humans, correct? Or is there some other way of making this comparison you have in mind?
Yes, “Relative to only having access to AI systems publicly available in January 2023.”
More generally, I define everything more precisely in the post linked in my comment on “AI 10x’ing labor for AI R&D”.
Thanks for this—I’m in a more peripheral part of the industry (consumer/industrial LLM usage, not directly at an AI lab), and my timelines are somewhat longer (5 years for 50% chance), but I may be using a different criterion for “automate virtually all remote workers”. It’ll be a fair bit of time (in AI frame—a year or ten) between “labs show generality sufficient to automate most remote work” and “most remote work is actually performed by AI”.
A key dynamic is that I think massive acceleration in AI is likely after the point when AIs can accelerate labor working on AI R&D. (Due to all of: the direct effects of accelerating AI software progress, this acceleration rolling out to hardware R&D and scaling up chip production, and potentially greatly increased investment.) See also here and here.
So, you might very quickly (1-2 years) go from “the AIs are great, fast, and cheap software engineers speeding up AI R&D” to “wildly superhuman AI that can achieve massive technical accomplishments”.
Fully agreed. And the trickle-down from AI-for-AI-R&D to AI-for-tool-R&D to AI-for-managers-to-replace-workers (and -replace-middle-managers) is still likely to be a bit extended. And the path is required—just like self-driving cars: the bar for adoption isn’t “better than the median human” or even “better than the best affordable human”, but “enough better that the decision-makers can’t find a reason to delay”.
Precise AGI timelines don’t matter that much.
While I do spend some time discussing AGI timelines (and I’ve written some posts about it recently), I don’t think moderate quantitative differences in AGI timelines matter that much for deciding what to do[1]. For instance, having a 15-year median rather than a 6-year median doesn’t make that big of a difference. That said, I do think that moderate differences in the chance of very short timelines (i.e., less than 3 years) matter more: going from a 20% chance to a 50% chance of full AI R&D automation within 3 years should potentially make a substantial difference to strategy.[2]
Additionally, my guess is that the most productive way to engage with discussion around timelines is mostly to not care much about resolving disagreements, but then when there appears to be a large chance that timelines are very short (e.g., >25% in <2 years) it’s worthwhile to try hard to argue for this.[3] I think takeoff speeds are much more important to argue about when making the case for AI risk.
I do think that having somewhat precise views is helpful for some people in doing relatively precise prioritization within people already working on safety, but this seems pretty niche.
Given that I don’t think timelines are that important, why have I been writing about this topic? This is due to a mixture of: I find it relatively quick and easy to write about timelines, my commentary is relevant to the probability of very short timelines (which I do think is important as discussed above), a bunch of people seem interested in timelines regardless, and I do think timelines matter some.
Consider reflecting on whether you’re overly fixated on details of timelines.
I’ve seen Richard Ngo make this point before, though I couldn’t find where he did this. More generally, this isn’t a very original point; I just think it’s worth making given that I’ve been talking about timelines recently.
I also think that the chance that very powerful AI happens under this presidential administration is action-relevant for policy.
You could have views such that you expect to never be >25% confident in <2-year timelines until it’s basically too late. For instance, maybe you expect very fast takeoff driven by a single large algorithmic advance. Under this view, I think arguing about the details of timelines looks even less good and you should mostly make the case for risk independently of this, perhaps arguing “it seems like AI could emerge quickly and unexpectedly, so we need to act now”.
I think most of the value in researching timelines is in developing models that can then be quickly updated as new facts come to light. As opposed to figuring out how to think about the implications of such facts only after they become available.
People might substantially disagree about parameters of such models (and the timelines they predict) while agreeing on the overall framework, and building common understanding is important for coordination. Also, you wouldn’t necessarily a priori know which facts to track, without first having developed the models.
i super agree, i al so think that the value is in debating the models of intelligence explosion.
which is why i made my website: ai-2028.com or intexp.xyz
It seems like a bad sign that, even with maximally optimistic inputs, your model never falsely retrodicts intelligence explosions in the past.
For those of us who do favor “very short timelines”, any thoughts?
For people who are comparatively advantaged at this, it seems good to try to make the case for this in a variety of different ways. One place to start is to try to convince relatively soft target audiences like me (who’s sympathetic but disagrees) by e.g. posting on LW and then go somewhere from here.
I think it’s a rough task, but ultimately worth trying.
Personally it will be impossible for me to ignore the part of me that wonders “is this AGI/ASI stuff actually, for real, coming, or will it turn out to be fake.” Studying median timelines bleeds into the question of whether AGI by my natural lifespan is 90% likely or 99.5% likely, and vice versa. So I will continue thinking very carefully about evidence of AGI progress.
Absence of AGI[1] by (say) 2055 is predicted by models that deserve to be developed in earnest (I’d currently give the claim 15%, with 10% mostly for technological reasons and 5% mostly because of a human-instituted lasting Pause or a disaster). This doesn’t significantly affect the median timeline yet, but as time goes on these models can get stronger (Moore’s law even in price-performance form breaking down, continual learning turning out to be a grand algorithmic obstruction that might take decades to solve, with in-context learning not good enough for this purpose within available compute). And this would start affecting the median timeline more and more. Also, development of AGI might result in a lasting ASI[2] Pause (either through societal backlash or from AGIs themselves insisting on this to prevent ASIs misaligned with them before they figure out how to align ASIs).
AGIs are AIs unbounded in ability to develop civilization on their own, without needing substantial human input, including by inventing aligned-with-them ASIs.
ASIs are qualitatively more intelligent than humans or humanity, while non-ASI AGIs are reasonably comparable to humans or humanity, even if notably more capable.
This is only somewhat related to what you were saying, but I do think 100 year medians vs 10 year medians does matter a bunch.
Slightly hot take: Longtermist capacity/community building is pretty underdone at current margins and retreats (focused on AI safety, longtermism, or EA) are also underinvested in. By “longtermist community building”, I mean rather than AI safety. I think retreats are generally underinvested in at the moment. I’m also sympathetic to thinking that general undergrad and high school capacity building (AI safety, longtermist, or EA) is underdone, but this seems less clear-cut.
I think this underinvestment is due to a mix of mistakes on the part of Open Philanthropy (and Good Ventures)[1] and capacity building being lower status than it should be.
Here are some reasons why I think this work is good:
It’s very useful for there to be people who are actually trying really hard to do the right thing and they often come through these sorts of mechanisms. Another way to put this is that flexible, impact-obsessed people are very useful.
Retreats make things feel much more real to people and result in people being more agentic and approaching their choices more effectively.
Programs like MATS are good, but they get somewhat different people at a somewhat different part of the funnel, so they don’t (fully) substitute.
A large part of why I’m writing this is to try to make this work higher status and to encourage more of this work. Consider yourself to be encouraged and/or thanked if you’re working in this space or planning to work in this space.
I think these mistakes are: underfunding this work, Good Ventures being unwilling to fund some versions of this work, failing to encourage people to found useful orgs in this space, and hiring out many of the best people in this space to instead do (IMO less impactful) grantmaking.
If someone wants to give Lightcone money for this, we could probably fill a bunch of this gap. No definitive promises (and happy to talk to any donor for whom this would be cruxy about what we would be up for doing and what we aren’t), but we IMO have a pretty good track record of work in the space, and of course having Lighthaven helps. Also if someone else wants to do work in the space and run stuff at Lighthaven, happy to help in various ways.
I’d be interested to hear what kind of things you’d want to do with funding; this does seem like a potentially good use of funds
I think the Sanity & Survival Summit that we ran in 2022 would be an obvious pointer to something I would like to run more of (I would want to change some things about the framing of the event, but I overall think that was pretty good).
Another thing I’ve been thinking about is a retreat on something like “high-integrity AI x-risk comms” where people who care a lot about x-risk and care a lot about communicating it accurately to a broader audience can talk to each other (we almost ran something like this in early 2023). Think Kelsey, Palisade, Scott Alexander, some people from Redwood, some of the MIRI people working on this, maybe some people from the labs. Not sure how well it would work, but it’s one of the things I would most like to attend (and to what degree that’s a shared desire would come out quickly in user interviews)
Though my general sense is that it’s a mistake to try to orient things like this too much around a specific agenda. You mostly want to leave it up to the attendees to figure out what they want to talk to each other about, and do a bunch of surveying and scoping of who people want to talk to each other more, and then just facilitate a space and a basic framework for those conversations and meetings to happen.
I think this is a great idea that would serve an urgent need. I’d urge to you do it in the near future.
Agree with both the OP and Habryka’s pitch. The Meetup Organizers Retreat hosted at Lighthaven in 2022 was a huge inflection point for my personal involvement with the community.
Strongly agreed on this point, it’s pretty hard to substitute for the effect of being immersed in a social environment like that
why longtermist, as opposed to AI safety?
I think there are some really big advantages to having people who are motivated by longtermism and doing good in a scope-sensitive way, rather than just by trying to prevent AI takeover even more broadly “help with AI safety”.
AI safety field building has been popular in part because there is a very broad set of perspectives from which it makes sense to worry about technical problems related to societal risks from powerful AI. (See e.g. Simplify EA Pitches to “Holy Shit, X-Risk”. This kind of field building gets you lots of people who are worried about AI takeover risk, or more broadly, problems related to powerful AI. But it doesn’t get you people who have a lot of other parts of the EA/longtermist worldview, like:
Being scope-sensitive
Being altruistic/cosmopolitan
Being concerned about the moral patienthood of a wide variety of different minds
Being interested in philosophical questions about acausal trade
People who do not have the longtermist worldview and who work on AI safety are useful allies and I’m grateful to have them, but they have some extreme disadvantages compared to people who are on board with more parts of my worldview. And I think it would be pretty sad to have the proportion of people working on AI safety who have the longtermist perspective decline further.
It feels weird to me to treat longtermism as an ingroup/outgroup divider. I guess I think of myself as not really EA/longtermist. I mostly care about the medium-term glorious transhumanist future. I don’t really base my actions on the core longtermist axiom; I only care about the unimaginably vast number of future moral patients indirectly, through caring about humanity being able to make and implement good moral decisions a hundred years from now.
The main thing I look at to determine whether someone is value-aligned with me is whether they care about making the future go well (in a vaguely ambitious transhumanist coded way), as opposed to personal wealth or degrowth whatever.
Yeah, maybe I’m using the wrong word here. I do think there is a really important difference between people who are scope-sensitively altruistically motivated and who are in principle willing to make decisions based on abstract reasoning about the future (which I probably include you in), and people who aren’t.
I have the impression that neither “short” nor “medium”-termist EAs (insofar as those are the labels they use for themselves) care much about 100 years from now. With ~30-50 years being what seems what the typical “medium”-termist EA cares about. So if you care about 100 years, and take “weird” ideas seriously, I think at least I would consider that long-termist. But it has been a while since I’ve consistently read the EA forum.
I think the general category of AI safety capacity building isn’t underdone (there’s quite a lot of it) while I think stuff aiming more directly on longtermism (and AI futurism etc) is underdone. Mixing the two is reasonable tbc, and some of the best stuff focuses on AI safety while mixing in longtermism/futurism/etc. But, lots of the AI safety capacity building is pretty narrow in practice.
While I think the general category of AI safety capacity building isn’t underdone, I do think that (AI safety) retreats in particular are under invested in.
Inference compute scaling might imply we first get fewer, smarter AIs.
Prior estimates imply that the compute used to train a future frontier model could also be used to run tens or hundreds of millions of human equivalents per year at the first time when AIs are capable enough to dominate top human experts at cognitive tasks[1] (examples here from Holden Karnofsky, here from Tom Davidson, and here from Lukas Finnveden). I think inference time compute scaling (if it worked) might invalidate this picture and might imply that you get far smaller numbers of human equivalents when you first get performance that dominates top human experts, at least in short timelines where compute scarcity might be important. Additionally, this implies that at the point when you have abundant AI labor which is capable enough to obsolete top human experts, you might also have access to substantially superhuman (but scarce) AI labor (and this could pose additional risks).
The point I make here might be obvious to many, but I thought it was worth making as I haven’t seen this update from inference time compute widely discussed in public.[2]
However, note that if inference compute allows for trading off between quantity of tasks completed and the difficulty of tasks that can be completed (or the quality of completion), then depending on the shape of the inference compute returns curve, at the point when we can run some AIs as capable as top human experts, it might be worse to run many (or any) AIs at this level of capability rather than using less inference compute per task and completing more tasks (or completing tasks serially faster).
Further, efficiency might improve quickly such that we don’t have a long regime with only a small number of human equivalents. I do a BOTEC on this below.
I’ll do a specific low-effort BOTEC to illustrate my core point that you might get far smaller quantities of top human expert-level performance at first. Suppose that we first get AIs that are ~10x human cost (putting aside inflation in compute prices due to AI demand) and as capable as top human experts at this price point (at tasks like automating R&D). If this is in ~3 years, then maybe you’ll have $15 million/hour worth of compute. Supposing $300/hour human cost, then we get ($15 million/hour) / ($300/hour) / (10 times human cost per compute dollar) * (4 AI hours / human work hours) = 20k human equivalents. This is a much smaller number than prior estimates.
The estimate of $15 million/hour worth of compute comes from: OpenAI spent ~$5 billion on compute this year, so $5 billion / (24*365) = $570k/hour; spend increases by ~3x per year, so $570k/hour * 3³ = $15 million.
The estimate for 3x per year comes from: total compute is increasing by 4-5x per year, but some is hardware improvement and some is increased spending. Hardware improvement is perhaps ~1.5x per year and 4.5/1.5 = 3. This at least roughly matches this estimate from Epoch which estimates 2.4x additional spend (on just training) per year. Also, note that Epoch estimates 4-5x compute per year here and 1.3x hardware FLOP/dollar here, which naively implies around 3.4x, but this seems maybe too high given the prior number.
Earlier, I noted that efficiency might improve rapidly. We can look at recent efficiency changes to get a sense for how fast. GPT-4o mini is roughly 100x cheaper than GPT-4 and is perhaps roughly as capable all together (probably lower intelligence but better elicited). It was trained roughly 1.5 years later (GPT-4 was trained substantially before it was released) for ~20x efficiency improvement per year. This is selected for being a striking improvment and probably involves low hanging fruit, but AI R&D will be substantially accelerated in the future which probably more than cancels this out. Further, I expect that inference compute will be inefficent in the tail of high inference compute such that efficiency will be faster than this once the capability is reached. So we might expect that the number of AI human equivalents increases by >20x per year and potentially much faster if AI R&D is greatly accelerated (and compute doesn’t bottleneck this). If progress is “just” 50x per year, then it would still take a year to get to millions of human equivalents based on my earlier estimate of 20k human equivalents. Note that once you have millions of human equivalents, you also have increased availability of generally substantially superhuman AI systems.
I’m refering to the notion of Top-human-Expert-Dominating AI that I define in this post, though without a speed/cost constraint as I want to talk about the cost when you first get such systems.
Of course, we should generally expect huge uncertainty with future AI architectures such that fixating on very efficient substitution of inference time compute for training would be a mistake, along with fixating on minimal or no substitution. I think a potential error of prior discussion is insufficient focus on the possibility of relatively scalable (though potentially inefficient) substitution of inference time for training (which o3 appears to exhibit) such that we see very expensive (and potentially slow) AIs that dominate top human expert performance prior to seeing cheap and abundant AIs which do this.
Ok, but for how long? If the situation holds for only 3 months, and then the accelerated R&D gives us a huge drop in costs, then the strategic outcomes seem pretty similar.
If there continues to be a useful peak capability only achievable with expensive inference, like the 10x human cost, and there are weaker human-skill-at-minimum available for 0.01x human cost, then it may be interesting to consider which tasks will benefit more from a large number of moderately good workers vs a small number of excellent workers.
Also worth considering is speed. In a lot of cases, it is possible to set things up to run slower-but-cheaper on less or cheaper hardware. Or to pay more, and have things run in as highly parallelized a manner as possible on the most expensive hardware. Usually maximizing speed comes with some cost overhead. So then you also need to consider whether it’s worth having more of the work be done in serial by a smaller number of faster models...
For certain tasks, particularly competitive ones like sports or combat, speed can be a critical factor and is worth sacrificing peak intelligence for. Obviously, for long horizon strategic planning, it’s the other way around.
I don’t expect this to continue for very long. 3 months (or less) seems plausible. I really should have mentioned this in the post. I’ve now edited it in.
I don’t think so. In particular once the costs drop you might be able to run substantially superhuman systems at the same cost that you could previously run systems that can “merely” automate away top human experts.
The point I make here is also likely obvious to many, but I wonder if the “X human equivalents” frame often implicitly assumes that GPT-N will be like having X humans. But if we expect AIs to have comparative advantages (and disadvantages), then this picture might miss some important factors.
The “human equivalents” frame seems most accurate in worlds where the capability profile of an AI looks pretty similar to the capability profile of humans. That is, getting GPT-6 to do AI R&D is basically “the same as” getting X humans to do AI R&D. It thinks in fairly similar ways and has fairly similar strengths/weaknesses.
The frame is less accurate in worlds where AI is really good at some things and really bad at other things. In this case, if you try to estimate the # of human equivalents that GPT-6 gets you, the result might be misleading or incomplete. A lot of fuzzier things will affect the picture.
The example I’ve seen discussed most is whether or not we expect certain kinds of R&D to be bottlenecked by “running lots of experiments” or “thinking deeply and having core conceptual insights.” My impression is that one reason why some MIRI folks are pessimistic is that they expect capabilities research to be more easily automatable (AIs will be relatively good at running lots of ML experiments quickly, which helps capabilities more under their model) than alignment research (AIs will be relatively bad at thinking deeply or serially about certain topics, which is what you need for meaningful alignment progress under their model).
Perhaps more people should write about what kinds of tasks they expect GPT-X to be “relatively good at” or “relatively bad at”. Or perhaps that’s too hard to predict in advance. If so, it could still be good to write about how different “capability profiles” could allow certain kinds of tasks to be automated more quickly than others.
(I do think that the “human equivalents” frame is easier to model and seems like an overall fine simplification for various analyses.)
In the top level comment, I was just talking about AI systems which are (at least) as capable as top human experts. (I was trying to point at the notion of Top-human-Expert-Dominating AI that I define in this post, though without a speed/cost constraint, but I think I was a bit sloppy in my language. I edited the comment a bit to better communicate this.)
So, in this context, human (at least) equivalents does make sense (as in, because the question is the cost of AIs that can strictly dominate top human experts so we can talk about the amount of compute needed to automate away one expert/researcher on average), but I agree that for earlier AIs it doesn’t (necessarily) make sense and plausibly these earlier AIs are very key for understanding the risk (because e.g. they will radically accelerate AI R&D without necessarily accelerating other domain).
At first glance, I don’t see how the point I raised is affected by the distinction between expert-level AIs vs earlier AIs.
In both cases, you could expect an important part of the story to be “what are the comparative strengths and weaknesses of this AI system.”
For example, suppose you have an AI system that dominates human experts at every single relevant domain of cognition. It still seems like there’s a big difference between “system that is 10% better at every relevant domain of cognition” and “system that is 300% better at domain X and only 10% better at domain Y.”
To make it less abstract, one might suspect that by the time we have AI that is 10% better than humans at “conceptual/serial” stuff, the same AI system is 1000% better at “speed/parallel” stuff. And this would have pretty big implications for what kind of AI R&D ends up happening (even if we condition on only focusing on systems that dominate experts in every relevant domain.)
I agree comparative advantages can still important, but your comment implied a key part of the picture is “models can’t do some important thing”. (E.g. you talked about “The frame is less accurate in worlds where AI is really good at some things and really bad at other things.” but models can’t be really bad at almost anything if they strictly dominate humans at basically everything.)
And I agree that at the point AIs are >5% better at everything they might also be 1000% better at some stuff.
I was just trying to point out that talking about the number human equivalents (or better) can still be kinda fine as long as the model almost strictly dominates humans as the model can just actually substitute everywhere. Like the number of human equivalents will vary by domain but at least this will be a lower bound.
quantity?
https://www.lesswrong.com/posts/uPi2YppTEnzKG3nXD/nathan-helm-burger-s-shortform?commentId=rnT3z9F55A2pmrj4Y
Sometimes people think of “software-only singularity” as an important category of ways AI could go. A software-only singularity can roughly be defined as when you get increasing-returns growth (hyper-exponential) just via the mechanism of AIs increasing the labor input to AI capabilities software[1] R&D (i.e., keeping fixed the compute input to AI capabilities).
While the software-only singularity dynamic is an important part of my model, I often find it useful to more directly consider the outcome that software-only singularity might cause: the feasibility of takeover-capable AI without massive compute automation. That is, will the leading AI developer(s) be able to competitively develop AIs powerful enough to plausibly take over[2] without previously needing to use AI systems to massively (>10x) increase compute production[3]?
[This is by Ryan Greenblatt and Alex Mallen]
We care about whether the developers’ AI greatly increases compute production because this would require heavy integration into the global economy in a way that relatively clearly indicates to the world that AI is transformative. Greatly increasing compute production would require building additional fabs which currently involve substantial lead times, likely slowing down the transition from clearly transformative AI to takeover-capable AI.[4][5] In addition to economic integration, this would make the developer dependent on a variety of actors after the transformative nature of AI is made more clear, which would more broadly distribute power.
For example, if OpenAI is selling their AI’s labor to ASML and massively accelerating chip production before anyone has made takeover-capable AI, then (1) it would be very clear to the world that AI is transformatively useful and accelerating, (2) building fabs would be a constraint in scaling up AI which would slow progress, and (3) ASML and the Netherlands could have a seat at the table in deciding how AI goes (along with any other actors critical to OpenAI’s competitiveness). Given that AI is much more legibly transformatively powerful in this world, they might even want to push for measures to reduce AI/human takeover risk.
A software-only singularity is not necessary for developers to have takeover-capable AIs without having previously used them for massive compute automation (it is also not clearly sufficient, since it might be too slow or uncompetitive by default without massive compute automation as well). Instead, developers might be able to achieve this outcome by other forms of fast AI progress:
Algorithmic / scaling is fast enough at the relevant point independent of AI automation. This would likely be due to one of:
Downstream AI capabilities progress very rapidly with the default software and/or hardware progress rate at the relevant point;
Existing compute production (including repurposable production) suffices (this is sometimes called hardware overhang) and the developer buys a bunch more chips (after generating sufficient revenue or demoing AI capabilities to attract investment);
Or there is a large algorithmic advance that unlocks a new regime with fast progress due to low-hanging fruit.[6]
AI automation results in a one-time acceleration of software progress without causing an explosive feedback loop, but this does suffice for pushing AIs above the relevant capability threshold quickly.
Other developers just aren’t very competitive (due to secrecy, regulation, or other governance regimes) such that proceeding at a relatively slower rate (via algorithmic and hardware progress) suffices.
My inside view sense is that the feasibility of takeover-capable AI without massive compute automation is about 75% likely if we get AIs that dominate top-human-experts prior to 2040.[7] Further, I think that in practice, takeover-capable AI without massive compute automation is maybe about 60% likely. (This is because massively increasing compute production is difficult and slow, so if proceeding without massive compute automation is feasible, this would likely occur.) However, I’m reasonably likely to change these numbers on reflection due to updating about what level of capabilities would suffice for being capable of takeover (in the sense defined in an earlier footnote) and about the level of revenue and investment needed to 10x compute production. I’m also uncertain whether a substantially smaller scale-up than 10x (e.g., 3x) would suffice to cause the effects noted earlier.
To-date software progress has looked like “improvements in pre-training algorithms, data quality, prompting strategies, tooling, scaffolding” as described here.
This takeover could occur autonomously, via assisting the developers in a power grab, or via partnering with a US adversary. I’ll count it as “takeover” if the resulting coalition has de facto control of most resources. I’ll count an AI as takeover-capable if it would have a >25% chance of succeeding at a takeover (with some reasonable coalition) if no other actors had access to powerful AI systems. Further, this takeover wouldn’t be preventable with plausible interventions on legible human controlled institutions, so e.g., it doesn’t include the case where an AI lab is steadily building more powerful AIs for an eventual takeover much later (see discussion here). This 25% probability is as assessed under my views but with the information available to the US government at the time this AI is created. This line is intended to point at when states should be very worried about AI systems undermining their sovereignty unless action has already been taken. Note that insufficient inference compute could prevent an AI from being takeover-capable even if it could take over with enough parallel copies. And note that whether a given level of AI capabilities suffices for being takeover-capable is dependent on uncertain facts about how vulnerable the world seems (from the subjective vantage point I defined earlier). Takeover via the mechanism of an AI escaping, independently building more powerful AI that it controls, and then this more powerful AI taking over would count as that original AI that escaped taking over. I would also count a rogue internal deployment that leads to the AI successfully backdooring or controlling future AI training runs such that those future AIs take over. However, I would not count merely sabotaging safety research.
I mean 10x additional production (caused by AI labor) above long running trends in expanding compute production and making it more efficient. As in, spending on compute production has been increasing each year and the efficiency of compute production (in terms of FLOP/$ or whatever) has also been increasing over time, and I’m talking about going 10x above this trend due to using AI labor to expand compute production (either revenue from AI labor or having AIs directly work on chips as I’ll discuss in a later footnote).
Note that I don’t count converting fabs from making other chips (e.g., phones) to making AI chips as scaling up compute production; I’m just considering things that scale up the amount of AI chips we could somewhat readily produce. TSMC’s revenue is “only” about $100 billion per year, so if only converting fabs is needed, this could be done without automation of compute production and justified on the basis of AI revenues that are substantially smaller than the revenues that would justify building many more fabs. Currently AI is around 15% of leading node production at TSMC, so only a few more doublings are needed for it to consume most capacity.
Note that the AI could indirectly increase compute production via being sufficiently economically useful that it generates enough money to pay for greatly scaling up compute. I would count this as massive compute automation, though some routes through which the AI could be sufficiently economically useful might be less convincing of transformativeness than the AIs substantially automating the process of scaling up compute production. However, I would not count the case where AI systems are impressive enough to investors that this justifies investment that suffices for greatly scaling up fab capacity while profits/revenues wouldn’t suffice for greatly scaling up compute on their own. In reality, if compute is greatly scaled up, this will occur via a mixture of speculative investment, the AI earning revenue, and the AI directly working on automating labor along the compute supply chain. If the revenue and direct automation would suffice for an at least massive compute scale-up (>10x) on their own (removing the component from speculative investment), then I would count this as massive compute automation.
A large algorithmic advance isn’t totally unprecedented. It could suffice if we see an advance similar to what seemingly happened with reasoning models like o1 and o3 in 2024.
About 2⁄3 of this is driven by software-only singularity.
I’m not sure if the definition of takeover-capable-AI (abbreviated as “TCAI” for the rest of this comment) in footnote 2 quite makes sense. I’m worried that too much of the action is in “if no other actors had access to powerful AI systems”, and not that much action is in the exact capabilities of the “TCAI”. In particular: Maybe we already have TCAI (by that definition) because if a frontier AI company or a US adversary was blessed with the assumption “no other actor will have access to powerful AI systems”, they’d have a huge advantage over the rest of the world (as soon as they develop more powerful AI), plausibly implying that it’d be right to forecast a >25% chance of them successfully taking over if they were motivated to try.
And this seems somewhat hard to disentangle from stuff that is supposed to count according to footnote 2, especially: “Takeover via the mechanism of an AI escaping, independently building more powerful AI that it controls, and then this more powerful AI taking over would” and “via assisting the developers in a power grab, or via partnering with a US adversary”. (Or maybe the scenario in 1st paragraph is supposed to be excluded because current AI isn’t agentic enough to “assist”/”partner” with allies as supposed to just be used as a tool?)
What could a competing definition be? Thinking about what we care most about… I think two events especially stand out to me:
When would it plausibly be catastrophically bad for an adversary to steal an AI model?
When would it plausibly be catastrophically bad for an AI to be power-seeking and non-controlled?
Maybe a better definition would be to directly talk about these two events? So for example...
“Steal is catastrophic” would be true if...
“Frontier AI development projects immediately acquire good enough security to keep future model weights secure” has significantly less probability of AI-assisted takeover than
“Frontier AI development projects immediately have their weights stolen, and then acquire security that’s just as good as in (1a).”[1]
“Power-seeking and non-controlled is catastrophic” would be true if...
“Frontier AI development projects immediately acquire good enough judgment about power-seeking-risk that they henceforth choose to not deploy any model that would’ve been net-negative for them to deploy” has significantly less probability of AI-assisted takeover than
“Frontier AI development acquire the level of judgment described in (2a) 6 months later.”[2]
Where “significantly less probability of AI-assisted takeover” could be e.g. at least 2x less risk.
The motivation for assuming “future model weights secure” in both (1a) and (1b) is so that the downside of getting the model weights stolen imminently isn’t nullified by the fact that they’re very likely to get stolen a bit later, regardless. Because many interventions that would prevent model weight theft this month would also help prevent it future months. (And also, we can’t contrast 1a’=”model weights are permanently secure” with 1b’=”model weights get stolen and are then default-level-secure”, because that would already have a really big effect on takeover risk, purely via the effect on future model weights, even though current model weights probably aren’t that important.)
The motivation for assuming “good future judgment about power-seeking-risk” is similar to the motivation for assuming “future model weights secure” above. The motivation for choosing “good judgment about when to deploy vs. not” rather than “good at aligning/controlling future models” is that a big threat model is “misaligned AIs outcompete us because we don’t have any competitive aligned AIs, so we’re stuck between deploying misaligned AIs and being outcompeted” and I don’t want to assume away that threat model.
I agree that the notion of takeover-capable AI I use is problematic and makes the situation hard to reason about, but I intentionally rejected the notions you propose as they seemed even worse to think about from my perspective.
Is there some reason for why current AI isn’t TCAI by your definition?
(I’d guess that the best way to rescue your notion it is to stipulate that the TCAIs must have >25% probability of taking over themselves. Possibly with assistance from humans, possibly by manipulating other humans who think they’re being assisted by the AIs — but ultimately the original TCAIs should be holding the power in order for it to count. That would clearly exclude current systems. But I don’t think that’s how you meant it.)
Oh sorry. I somehow missed this aspect of your comment.
Here’s a definition of takeover-capable AI that I like: the AI is capable enough that plausible interventions on known human controlled institutions within a few months no longer suffice to prevent plausible takeover. (Which implies that making the situation clear to the world is substantially less useful and human controlled institutions can no longer as easily get a seat at the table.)
Under this definition, there are basically two relevant conditions:
The AI is capable enough to itself take over autonomously. (In the way you defined it, but also not in a way where intervening on human institutions can still prevent the takeover, so e.g.., the AI just having a rogue deployment within OpenAI doesn’t suffice if substantial externally imposed improvements to OpenAI’s security and controls would defeat the takeover attempt.)
Or human groups can do a nearly immediate takeover with the AI such that they could then just resist such interventions.
I’ll clarify this in the comment.
Hm — what are the “plausible interventions” that would stop China from having >25% probability of takeover if no other country could build powerful AI? Seems like you either need to count a delay as successful prevention, or you need to have a pretty low bar for “plausible”, because it seems extremely difficult/costly to prevent China from developing powerful AI in the long run. (Where they can develop their own supply chains, put manufacturing and data centers underground, etc.)
Yeah, I’m trying to include delay as fine.
I’m just trying to point at “the point when aggressive intervention by a bunch of parties is potentially still too late”.
I really like the framing here, of asking whether we’ll see massive compute automation before [AI capability level we’re interested in]. I often hear people discuss nearby questions using IMO much more confusing abstractions, for example:
“How much is AI capabilities driven by algorithmic progress?” (problem: obscures dependence of algorithmic progress on compute for experimentation)
“How much AI progress can we get ‘purely from elicitation’?” (lots of problems, e.g. that eliciting a capability might first require a (possibly one-time) expenditure of compute for exploration)
Is this because:
You think that we’re >50% likely to not get AIs that dominate top human experts before 2040? (I’d be surprised if you thought this.)
The words “the feasibility of” importantly change the meaning of your claim in the first sentence? (I’m guessing it’s this based on the following parenthetical, but I’m having trouble parsing.)
Overall, it seems like you put substantially higher probability than I do on getting takeover capable AI without massive compute automation (and especially on getting a software-only singularity). I’d be very interested in understanding why. A brief outline of why this doesn’t seem that likely to me:
My read of the historical trend is that AI progress has come from scaling up all of the factors of production in tandem (hardware, algorithms, compute expenditure, etc.).
Scaling up hardware production has always been slower than scaling up algorithms, so this consideration is already factored into the historical trends. I don’t see a reason to believe that algorithms will start running away with the game.
Maybe you could counter-argue that algorithmic progress has only reflected returns to scale from AI being applied to AI research in the last 12-18 months and that the data from this period is consistent with algorithms becoming more relatively important relative to other factors?
I don’t see a reason that “takeover-capable” is a capability level at which algorithmic progress will be deviantly important relative to this historical trend.
I’d be interested either in hearing you respond to this sketch or in sketching out your reasoning from scratch.
I put roughly 50% probability on feasibility of software-only singularity.[1]
(I’m probably going to be reinventing a bunch of the compute-centric takeoff model in slightly different ways below, but I think it’s faster to partially reinvent than to dig up the material, and I probably do use a slightly different approach.)
My argument here will be a bit sloppy and might contain some errors. Sorry about this. I might be more careful in the future.
The key question for software-only singularity is: “When the rate of labor production is doubled (as in, as if your employees ran 2x faster[2]), does that more than double or less than double the rate of algorithmic progress? That is, algorithmic progress as measured by how fast we increase the labor production per FLOP/s (as in, the labor production from AI labor on a fixed compute base).”. This is a very economics-style way of analyzing the situation, and I think this is a pretty reasonable first guess. Here’s a diagram I’ve stolen from Tom’s presentation on explosive growth illustrating this:
Basically, every time you double the AI labor supply, does the time until the next doubling (driven by algorithmic progress) increase (fizzle) or decrease (foom)? I’m being a bit sloppy in saying “AI labor supply”. We care about a notion of parallelism-adjusted labor (faster laborers are better than more laborers) and quality increases can also matter. I’ll make the relevant notion more precise below.
I’m about to go into a relatively complicated argument for why I think the historical data supports software-only singularity. If you want more basic questions answered (such as “Doesn’t retraining make this too slow?”), consider looking at Tom’s presentation on takeoff speeds.
Here’s a diagram that you might find useful in understanding the inputs into AI progress:
And here is the relevant historical context in terms of trends:
Historically, algorithmic progress in LLMs looks like 3-4x per year including improvements on all parts of the stack.[3] This notion of algorithmic progress is “reduction in compute needed to reach a given level of frontier performance”, which isn’t equivalent to increases in the rate of labor production on a fixed compute base. I’ll talk more about this below.
This has been accompanied by increases of around 4x more hardware per year[4] and maybe 2x more quality-adjusted (parallel) labor working on LLM capabilities per year. I think total employees working on LLM capabilities have been roughly 3x-ing per year (in recent years), but quality has been decreasing over time.
A 2x increase in the quality-adjusted parallel labor force isn’t as good as the company getting the same sorts of labor tasks done 2x faster (as in, the resulting productivity from having your employees run 2x faster) due to parallelism tax (putting aside compute bottlenecks for now). I’ll apply the same R&D parallelization penalty as used in Tom’s takeoff model and adjust this down by a power of 0.7 to yield 20.7= 1.6x per year in increased labor production rate. (So, it’s as though the company keeps the same employees, but those employees operate 1.6x faster each year.)
It looks like the fraction of progress driven by algorithmic progress has been getting larger over time.
So, overall, we’re getting 3-4x algorithmic improvement per year being driven by 1.6x more labor per year and 4x more hardware.
So, the key question is how much of this algorithmic improvement is being driven by labor vs. by hardware. If it is basically all hardware, then the returns to labor must be relatively weak and software-only singularity seems unlikely. If it is basically all labor, then we’re seeing 3-4x algorithmic improvement per year for 1.6x more labor per year, which means the returns to labor look quite good (at least historically). Based on some guesses and some poll questions, my sense is that capabilities researchers would operate about 2.5x slower if they had 10x less compute (after adaptation), so the production function is probably proportional to compute0.4⋅labor0.6 (0.4=log10(2.5)). (This is assuming a cobb-douglas production function.) Edit: see the derivation of the relevant thing in Deep’s comment, my old thing was wrong[5].
Now, let’s talk more about the transfer from algorithmic improvement to the rate of labor production. A 2x algorithmic improvement in LLMs makes it so that you can reach the same (frontier) level of performance for 2x less training compute, but we actually care about a somewhat different notion for software-only singularity: how much you can increase the production rate of labor (the thing that we said was increasing at roughly a rate of 1.6x per year by using more human employees). My current guess is that every 2x algorithmic improvement in LLMs increases the rate of labor production by 21.1, and I’m reasonably confident that the exponent isn’t much below 1.0. I don’t currently have a very principled estimation strategy for this, and it’s somewhat complex to reason about. I discuss this in the appendix below.
So, if this exponent is around 1, our central estimate of 2.3 from above corresponds to software-only singularity and our estimate of 1.56 from above under more pessimistic assumptions corresponds to this not being feasible. Overall, my sense is that the best guess numbers lean toward software-only singularity.
More precisely, software-only singularity that goes for >500x effective compute gains above trend (to the extent this metric makes sense, this is roughly >5 years of algorithmic progress). Note that you can have software-only singularity be feasible while buying tons more hardware at the same time. And if this ends up expanding compute production by >10x using AI labor, then this would count as massive compute production despite also having a feasible software-only singularity. (However, in most worlds, I expect software-only singularity to be fast enough, if feasible, that we don’t see this.)
Rather than denominating labor in accelerating employees, we could instead denominate in number of parallel employees. This would work equivalently (we can always convert into equivalents to the extent these things can funge), but because we can actually accelerate employees and the serial vs. parallel distinction is important, I think it is useful to denominate in accelerating employees.
I would have previously cited 3x, but recent progress looks substantially faster (with DeepSeek v3 and reasoning models seemingly indicating somewhat faster than 3x progress IMO), so I’ve revised to 3-4x.
This includes both increased spending and improved chips. Here, I’m taking my best guess at increases in hardware usage for training and transferring this to research compute usage on the assumption that training compute and research compute have historically been proportional.
Edit: the reasoning I did here was off. Here was the old text: so the production function is probably roughly α⋅compute0.4⋅labor0.6 (0.4=log10(2.5)). Increasing compute by 4x and labor by 1.6x increases algorithmic improvement by 3-4x, let’s say 3.5x, so we have 3.5=α⋅40.4⋅1.60.6, α=3.540.4⋅1.60.6=1.52. Thus, doubling labor would increase algorithmic improvement by 1.52⋅20.6=2.3. This is very sensitive to the exact numbers; if we instead used 3x slower instead of 2.5x slower, we would have gotten that doubling labor increases algorithmic improvement by 1.56, which is substantially lower. Obviously, all the exact numbers here are highly uncertain.
Hey Ryan! Thanks for writing this up—I think this whole topic is important and interesting.
I was confused about how your analysis related to the Epoch paper, so I spent a while with Claude analyzing it. I did a re-analysis that finds similar results, but also finds (I think) some flaws in your rough estimate. (Keep in mind I’m not an expert myself, and I haven’t closely read the Epoch paper, so I might well be making conceptual errors. I think the math is right though!)
I’ll walk through my understanding of this stuff first, then compare to your post. I’ll be going a little slowly (A) to help myself refresh myself via referencing this later, (B) to make it easy to call out mistakes, and (C) to hopefully make this legible to others who want to follow along.
Using Ryan’s empirical estimates in the Epoch model
The Epoch model
The Epoch paper models growth with the following equation:
1. d(lnA)dt∼A−βEλ,
where A = efficiency and E = research input. We want to consider worlds with a potential software takeoff, meaning that increases in AI efficiency directly feed into research input, which we model as d(lnA)dt∼A−βAλ=Aλ−β. So the key consideration seems to be the ratio λβ. If it’s 1, we get steady exponential growth from scaling inputs; greater, superexponential; smaller, subexponential.[1]
Fitting the model
How can we learn about this ratio from historical data?
Let’s pretend history has been convenient and we’ve seen steady exponential growth in both variables, so A=A0ert and E=E0eqt. Then d(lnA)dthas been constant over time, so by equation 1, A(t)−βE(t)λ has been constant as well. Substituting in for A and E, we find that A0e−βrtE0eλqt is constant over time, which is only possible if βr=λq and the exponent is always zero. Thus if we’ve seen steady exponential growth, the historical value of our key ratio is:
2. λβ=rq.
Intuitively, if we’ve seen steady exponential growth while research input has increased more slowly than research output (AI efficiency), there are superlinear returns to scaling inputs.
Introducing the Cobb-Douglas function
But wait! E, research input, is an abstraction that we can’t directly measure. Really there’s both compute and labor inputs. Those have indeed been growing roughly exponentially, but at different rates.
Intuitively, it makes sense to say that “effective research input” has grown as some kind of weighted average of the rate of compute and labor input growth. This is my take on why a Cobb-Douglas function of form (3) E∼CpL1−p, with a weight parameter 0<p<1, is useful here: it’s a weighted geometric average of the two inputs, so its growth rate is a weighted average of their growth rates.
Writing that out: in general, say both inputs have grown exponentially, so C(t)=C0eqct and L(t)=L0eqlt. Then E has grown as E(t)=E0eqt=E0epqct+(1−p)qlt, so q is the weighted average (4) q=pqc+(1−p)ql of the growth rates of labor and capital.
Then, using Equation 2, we can estimate our key ratio λβ as rq=rpqc+(1−p)ql.
Let’s get empirical!
Plugging in your estimates:
Historical compute scaling of 4x/year gives qc=ln(4);
Historical labor scaling of 1.6x gives ql=ln(1.6);
Historical compute elasticity on research outputs of 0.4 gives p=0.4;
Adding these together, q=0.79∼ln(2.3).[2]
Historical efficiency improvement of 3.5x/year gives r=ln(3.5).
So λβ=ln(3.5)ln(2.3)=1.5 [3]
Adjusting for labor-only scaling
But wait: we’re not done yet! Under our Cobb-Douglas assumption, scaling labor by a factor of 2 isn’t as good as scaling all research inputs by a factor of 2; it’s only 20.6/2 as good.
Plugging in Equation 3 (which describes research input E in terms of compute and labor) to Equation 1 (which estimates AI progress A based on research), our adjusted form of the Epoch model is d(lnA)dt∼A−βEλ∼A−β∗Cpλ∗L(1−p)λ.
Under a software-only singularity, we hold compute constant while scaling labor with AI efficiency, so d(lnA)dt∼A(t)−β∗L(t)(1−p)λ multiplied by a fixed compute term. Since labor scales as A, we have d(lnA)dt=A−βtAλ(1−p)t=Aλ(1−p)t−βt. By the same analysis as in our first section, we can see A grows exponentially if λ(1−p)β=1, and grows grows superexponentially if this ratio is >1. So our key ratio λβ just gets multiplied by 1−p, and it wasn’t a waste to find it, phew!
Now we get the true form of our equation: we get a software-only foom iff λβ(1−p)>1, or (via equation 2) iff we see empirically that rq(1−p)>1. Call this the takeoff ratio: it corresponds to a) how much AI progress scales with inputs and b) how much of a penalty we take for not scaling compute.
Result: Above, we got λβ=1.5, so our takeoff ratio is 0.6∗1.5=.9. That’s quite close! If we think it’s more reasonable to think of a historical growth rate of 4 instead of 3.5, we’d increase our takeoff ratio by a factor of ln(4)ln(3.5)=1.1, to a ratio of .99, right on the knife edge of FOOM. [4] [note: I previously had the wrong numbers here: I had lambda/beta = 1.6, which would mean the 4x/year case has a takeoff ratio of 1.05, putting it into FOOM land]
So this isn’t too far off from your results in terms of implications, but it is somewhat different (no FOOM for 3.5x, less sensitivity to the exact historical growth rate).
Analyzing your approach:
Tweaking alpha:
Your estimate of α is in fact similar in form to my ratio—rqbut what you’re calculating instead is α=er/eq=3.5/(40.4∗1.60.6).
One indicator that something’s wrong is that your result involves checking whether α∗21−p>2, or equivalently whether ln(α)+(1−p)ln(2)>ln(2), or equivalently whether ln(α)>p∗ln(2). But the choice of 2 is arbitrary—conceptually, you just want to check if scaling software by a factor n increases outputs by a factor n or more. Yet ln(α)−p∗ln(n) clearly varies with n.
One way of parsing the problem is that alpha is (implicitly) time dependent—it is equal to exp(r * 1 year) / exp(q * 1 year), a ratio of progress vs inputs in the time period of a year. If you calculated alpha based on a different amount of time, you’d get a different value. By contrast, r/q is a ratio of rates, so it stays the same regardless of what timeframe you use to measure it.[5]
Maybe I’m confused about what your Cobb-Douglas function is meant to be calculating—is it E within an Epoch-style takeoff model, or something else?
Nuances:
Does Cobb-Douglas make sense?
The geometric average of rates thing makes sense, but it feels weird that that simple intuitive approach leads to a functional form (Cobb-Douglas) that also has other implications.
Wikipedia says Cobb-Douglas functions can have the exponents not add to 1 (while both being between 0 and 1). Maybe this makes sense here? Not an expert.
How seriously should we take all this?
This whole thing relies on...
Assuming smooth historical trends
Assuming those trends continue in the future
And those trends themselves are based on functional fits to rough / unclear data.
It feels like this sort of thing is better than nothing, but I wish we had something better.
I really like the various nuances you’re adjusting for, like parallel vs serial scaling, and especially distinguishing algorithmic improvement from labor efficiency. [6] Thinking those things through makes this stuff feel less insubstantial and approximate...though the error bars still feel quite large.
Actually there’s a complexity here, which is that scaling labor alone may be less efficient than scaling “research inputs” which include both labor and compute. We’ll come to this in a few paragraphs.
This is only coincidentally similar to your figure of 2.3 :)
I originally had 1.6 here, but as Ryan points out in a reply it’s actually 1.5. I’ve tried to reconstruct what I could have put into a calculator to get 1.6 instead, and I’m at a loss!
I was curious how aggressive the superexponential growth curve would be with a takeoff ratio of a mere0.96∗1.1=1.056. A couple of Claude queries gave me different answers (maybe because the growth is so extreme that different solvers give meaningfully different approximations?), but they agreed that growth is fairly slow in the first year (~5x) and then hits infinity by the end of the second year.I wrote this comment with the wrong numbers (0.96 instead of 0.9), so it doesn’t accurately represent what you get if you plug in 4x capability growth per year. Still cool to get a sense of what these curves look like, though.I think can be understood in terms of the alpha-being-implicitly-a-timescale-function thing—if you compare an alpha value with the ratio of growth you’re likely to see during the same time period, e.g. alpha(1 year) and n = one doubling, you probably get reasonable-looking results.
I find it annoying that people conflate “increased efficiency of doing known tasks” with “increased ability to do new useful tasks”. It seems to me that these could be importantly different, although it’s hard to even settle on a reasonable formalization of the latter. Some reasons this might be okay:
There’s a fuzzy conceptual boundary between the two: if GPT-n can do the task at 0.01% success rate, does that count as a “known task?” what about if it can do each of 10 components at 0.01% success, so in practice we’ll never see it succeed if run without human guidance, but we know it’s technically possible?
Under a software singularity situation, maybe the working hypothesis is that the model can do everything necessary to improve itself a bunch, maybe just not very efficiently yet. So we only need efficiency growth, not to increase the task set. That seems like a stronger assumption than most make, but maybe a reasonable weaker assumption is that the model will ‘unlock’ the necessary new tasks over time, after which point they become subject to rapid efficiency growth.
And empirically, we have in fact seen rapid unlocking of new capabilities, so it’s not crazy to approximate “being able to do new things” as a minor but manageable slowdown to the process of AI replacing human AI R&D labor.
I think you are correct with respect to my estimate of α and the associated model I was using. Sorry about my error here. I think I was fundamentally confusing a few things in my head when writing out the comment.
I think your refactoring of my strategy is correct and I tried to check it myself, though I don’t feel confident in verifying it is correct.
Your estimate doesn’t account for the conversion between algorithmic improvement and labor efficiency, but it is easy to add this in by just changing the historical algorithmic efficiency improvement of 3.5x/year to instead be the adjusted effective labor efficiency rate and then solving identically. I was previously thinking the relationship was that labor efficiency was around the same as algorithmic efficiency, but I now think this is more likely to be around algo_efficiency2 based on Tom’s comment.
Plugging this is, we’d get:
λβ(1−p)=rq(1−p)=ln(3.52)0.4ln(4)+0.6ln(1.6)(1−0.4)=2ln(3.5)ln(2.3)(1−0.4)=2⋅1.5⋅0.6=1.8
(In your comment you said ln(3.5)ln(2.3)=1.6, but I think the arithmetic is a bit off here and the answer is closer to 1.5.)
Neat, thanks a ton for the algorithmic-vs-labor update—I appreciated that you’d distinguished those in your post, but I forgot to carry that through in mine! :)
And oops, I really don’t know how I got to 1.6 instead of 1.5 there. Thanks for the flag, have updated my comment accordingly!
The square relationship idea is interesting—that factor of 2 is a huge deal. Would be neat to see a Guesstimate or Squiggle version of this calculation that tries to account for the various nuances Tom mentions, and has error bars on each of the terms, so we both get a distribution of r and a sensitivity analysis. (Maybe @Tom Davidson already has this somewhere? If not I might try to make a crappy version myself, or poke talented folks I know to do a good version :)
The existing epoch paper is pretty good, but doesn’t directly target LLMs in a way which seems somewhat sad.
The thing I’d be most excited about is:
Epoch does an in depth investigation using an estimation methodology which is directly targeting LLMs (rather than looking at returns in some other domains).
They use public data and solicit data from companies about algorithmic improvement, head count, compute on experiments etc.
(Some) companies provide this data. Epoch potentially doesn’t publish this exact data and instead just publishes the results of the final analysis to reduce capabilities externalities. (IMO, companies are somewhat unlikely to do this, but I’d like to be proven wrong!)
(I’m going through this and understanding where I made an error with my approach to α. I think I did make an error, but I’m trying to make sure I’m not still confused. Edit: I’ve figured this out, see my other comment.)
It shouldn’t matter in this case because we’re raising the whole value of E to λ.
Here’s my own estimate for this parameter:
Once AI has automated AI R&D, will software progress become faster or slower over time? This depends on the extent to which software improvements get harder to find as software improves – the steepness of the diminishing returns.
We can ask the following crucial empirical question:
When (cumulative) cognitive research inputs double, how many times does software double?
(In growth models of a software intelligence explosion, the answer to this empirical question is a parameter called r.)
If the answer is “< 1”, then software progress will slow down over time. If the answer is “1”, software progress will remain at the same exponential rate. If the answer is “>1”, software progress will speed up over time.
The bolded question can be studied empirically, by looking at how many times software has doubled each time the human researcher population has doubled.
(What does it mean for “software” to double? A simple way of thinking about this is that software doubles when you can run twice as many copies of your AI with the same compute. But software improvements don’t just improve runtime efficiency: they also improve capabilities. To incorporate these improvements, we’ll ultimately need to make some speculative assumptions about how to translate capability improvements into an equivalently-useful runtime efficiency improvement..)
The best quality data on this question is Epoch’s analysis of computer vision training efficiency. They estimate r = ~1.4: every time the researcher population doubled, training efficiency doubled 1.4 times. (Epoch’s preliminary analysis indicates that the r value for LLMs would likely be somewhat higher.) We can use this as a starting point, and then make various adjustments:
Upwards for improving capabilities. Improving training efficiency improves capabilities, as you can train a model with more “effective compute”. To quantify this effect, imagine we use a 2X training efficiency gain to train a model with twice as much “effective compute”. How many times would that double “software”? (I.e., how many doublings of runtime efficiency would have the same effect?) There are various sources of evidence on how much capabilities improve every time training efficiency doubles: toy ML experiments suggest the answer is ~1.7; human productivity studies suggest the answer is ~2.5. I put more weight on the former, so I’ll estimate 2. This doubles my median estimate to r = ~2.8 (= 1.4 * 2).
Upwards for post-training enhancements. So far, we’ve only considered pre-training improvements. But post-training enhancements like fine-tuning, scaffolding, and prompting also improve capabilities (o1 was developed using such techniques!). It’s hard to say how large an increase we’ll get from post-training enhancements. These can allow faster thinking, which could be a big factor. But there might also be strong diminishing returns to post-training enhancements holding base models fixed. I’ll estimate a 1-2X increase, and adjust my median estimate to r = ~4 (2.8*1.45=4).
Downwards for less growth in compute for experiments. Today, rising compute means we can run increasing numbers of GPT-3-sized experiments each year. This helps drive software progress. But compute won’t be growing in our scenario. That might mean that returns to additional cognitive labour diminish more steeply. On the other hand, the most important experiments are ones that use similar amounts of compute to training a SOTA model. Rising compute hasn’t actually increased the number of these experiments we can run, as rising compute increases the training compute for SOTA models. And in any case, this doesn’t affect post-training enhancements. But this still reduces my median estimate down to r = ~3. (See Eth (forthcoming) for more discussion.)
Downwards for fixed scale of hardware. In recent years, the scale of hardware available to researchers has increased massively. Researchers could invent new algorithms that only work at the new hardware scales for which no one had previously tried to to develop algorithms. Researchers may have been plucking low-hanging fruit for each new scale of hardware. But in the software intelligence explosions I’m considering, this won’t be possible because the hardware scale will be fixed. OAI estimate ImageNet efficiency via a method that accounts for this (by focussing on a fixed capability level), and find a 16-month doubling time, as compared with Epoch’s 9-month doubling time. This reduces my estimate down to r = ~1.7 (3 * 9⁄16).
Downwards for diminishing returns becoming steeper over time. In most fields, returns diminish more steeply than in software R&D. So perhaps software will tend to become more like the average field over time. To estimate the size of this effect, we can take our estimate that software is ~10 OOMs from physical limits (discussed below), and assume that for each OOM increase in software, r falls by a constant amount, reaching zero once physical limits are reached. If r = 1.7, then this implies that r reduces by 0.17 for each OOM. Epoch estimates that pre-training algorithmic improvements are growing by an OOM every ~2 years, which would imply a reduction in r of 1.02 (6*0.17) by 2030. But when we include post-training enhancements, the decrease will be smaller (as [reason], perhaps ~0.5. This reduces my median estimate to r = ~1.2 (1.7-0.5).
Overall, my median estimate of r is 1.2. I use a log-uniform distribution with the bounds 3X higher and lower (0.4 to 3.6).
My sense is that I start with a higher r value due to the LLM case looking faster (and not feeling the need to adjust downward in a few places like you do in the LLM case). Obviously the numbers in the LLM case are much less certain given that I’m guessing based on qualitative improvement and looking at some open source models, but being closer to what we actually care about maybe overwhelms this.
I also think I’d get a slightly lower update on the diminishing returns case due to thinking it has a good chance of having substantially sharper dimishing returns as you get closer and closer rather than having linearly decreasing r (based on some first principles reasoning and my understanding of how returns diminished in the semi-conductor case).
But the biggest delta is that I think I wasn’t pricing in the importance of increasing capabilities. (Which seems especially important if you apply a large R&D parallelization penalty.)
Sorry,I don’t follow why they’re less certain?
I’d be interested to hear more about this. The semi conductor case is hard as we don’t know how far we are from limits, but if we use Landauer’s limit then I’d guess you’re right. There’s also uncertainty about how much alg progress we will and have met
I’m just eyeballing the rate of algorithmic progress while in the computer vision case, we can at least look at benchmarks and know the cost of training compute for various models.
My sense is that you have generalization issues in the compute vision case while in the frontier LLM case you have issues with knowing the actual numbers (in terms of number of employees and cost of training runs). I’m also just not carefully doing the accounting.
I don’t have much to say here sadly, but I do think investigating this could be useful.
Really appreciate you covering all these nuances, thanks Tom!
Can you give a pointer to the studies you mentioned here?
Sure! See here: https://docs.google.com/document/d/1DZy1qgSal2xwDRR0wOPBroYE_RDV1_2vvhwVz4dxCVc/edit?tab=t.0#bookmark=id.eqgufka8idwl
Here’s a simple argument I’d be keen to get your thoughts on:
On the Possibility of a Tastularity
Research taste is the collection of skills including experiment ideation, literature review, experiment analysis, etc. that collectively determine how much you learn per experiment on average (perhaps alongside another factor accounting for inherent problem difficulty / domain difficulty, of course, and diminishing returns)
Human researchers seem to vary quite a bit in research taste—specifically, the difference between 90th percentile professional human researchers and the very best seems like maybe an order of magnitude? Depends on the field, etc. And the tails are heavy; there is no sign of the distribution bumping up against any limits.
Yet the causes of these differences are minor! Take the very best human researchers compared to the 90th percentile. They’ll have almost the same brain size, almost the same amount of experience, almost the same genes, etc. in the grand scale of things.
This means we should assume that if the human population were massively bigger, e.g. trillions of times bigger, there would be humans whose brains don’t look that different from the brains of the best researchers on Earth, and yet who are an OOM or more above the best Earthly scientists in research taste. -- AND it suggests that in the space of possible mind-designs, there should be minds which are e.g. within 3 OOMs of those brains in every dimension of interest, and which are significantly better still in the dimension of research taste. (How much better? Really hard to say. But it would be surprising if it was only, say, 1 OOM better, because that would imply that human brains are running up against the inherent limits of research taste within a 3-OOM mind design space, despite human evolution having only explored a tiny subspace of that space, and despite the human distribution showing no signs of bumping up against any inherent limits)
OK, so what? So, it seems like there’s plenty of room to improve research taste beyond human level. And research taste translates pretty directly into overall R&D speed, because it’s about how much experimentation you need to do to achieve a given amount of progress. With enough research taste, you don’t need to do experiments at all—or rather, you look at the experiments that have already been done, and you infer from them all you need to know to build the next design or whatever.
Anyhow, tying this back to your framework: What if the diminishing returns / increasing problem difficulty / etc. dynamics are such that, if you start from a top-human-expert-level automated researcher, and then do additional AI research to double its research taste, and then do additional AI research to double its research taste again, etc. the second doubling happens in less time than it took to get to the first doubling? Then you get a singularity in research taste (until these conditions change of course) -- the Tastularity.
How likely is the Tastularity? Well, again one piece of evidence here is the absurdly tiny differences between humans that translate to huge differences in research taste, and the heavy-tailed distribution. This suggests that we are far from any inherent limits on research taste even for brains roughly the shape and size and architecture of humans, and presumably the limits for a more relaxed (e.g. 3 OOM radius in dimensions like size, experience, architecture) space in mind-design are even farther away. It similarly suggests that there should be lots of hill-climbing that can be done to iteratively improve research taste.
How does this relate to software-singularity? Well, research taste is just one component of algorithmic progress; there is also speed, # of parallel copies & how well they coordinate, and maybe various other skills besides such as coding ability. So even if the Tastularity isn’t possible, improvements in taste will stack with improvements in those other areas, and the sum might cross the critical threshold.
In my framework, this is basically an argument that algorithmic-improvement-juice can be translated into a large improvement in AI R&D labor production via the mechanism of greatly increasing the productivity per “token” (or unit of thinking compute or whatever). See my breakdown here where I try to convert from historical algorithmic improvement to making AIs better at producing AI R&D research.
Your argument is basically that this taste mechanism might have higher returns than reducing cost to run more copies.
I agree this sort of argument means that returns to algorithmic improvement on AI R&D labor production might be bigger than you would otherwise think. This is both because this mechanism might be more promising than other mechanisms and even if it is somewhat less promising, diverse approaches make returns dimish less aggressively. (In my model, this means that best guess conversion might be more like algo_improvement1.3 rather than algo_improvement1.0.)
I think it might be somewhat tricky to train AIs to have very good research taste, but this doesn’t seem that hard via training them on various prediction objectives.
At a more basic level, I expect that training AIs to predict the results of experiments and then running experiments based on value of information as estimated partially based on these predictions (and skipping experiments with certain results and more generally using these predictions to figure out what to do) seems pretty promising. It’s really hard to train humans to predict the results of tens of thousands of experiments (both small and large), but this is relatively clean outcomes based feedback for AIs.
I don’t really have a strong inside view on how much the “AI R&D research taste” mechanism increases the returns to algorithmic progress.
I’ll paste my own estimate for this param in a different reply.
But here are the places I most differ from you:
Bigger adjustment for ‘smarter AI’. You’ve argue in your appendix that, only including ‘more efficient’ and ‘faster’ AI, you think the software-only singularity goes through. I think including ‘smarter’ AI makes a big difference. This evidence suggests that doubling training FLOP doubles output-per-FLOP 1-2 times. In addition, algorithmic improvements will improve runtime efficiency. So overall I think a doubling of algorithms yields ~two doublings of (parallel) cognitive labour.
--> software singularity more likely
Lower lambda. I’d now use more like lambda = 0.4 as my median. There’s really not much evidence pinning this down; I think Tamay Besiroglu thinks there’s some evidence for values as low as 0.2. This will decrease the observed historical increase in human workers more than it decreases the gains from algorithmic progress (bc of speed improvements)
--> software singularity slightly more likely
Complications thinking about compute which might be a wash.
Number of useful-experiments has increased by less than 4X/year. You say compute inputs have been increasing at 4X. But simultaneously the scale of experiments ppl must run to be near to the frontier has increased by a similar amount. So the number of near-frontier experiments has not increased at all.
This argument would be right if the ‘usefulness’ of an experiment depends solely on how much compute it uses compared to training a frontier model. I.e. experiment_usefulness = log(experiment_compute / frontier_model_training_compute). The 4X/year increases the numerator and denominator of the expression, so there’s no change in usefulness-weighted experiments.
That might be false. GPT-2-sized experiments might in some ways be equally useful even as frontier model size increases. Maybe a better expression would be experiment_usefulness = alpha * log(experiment_compute / frontier_model_training_compute) + beta * log(experiment_compute). In this case, the number of usefulness-weighted experiments has increased due to the second term.
--> software singularity slightly more likely
Steeper diminishing returns during software singularity. Recent algorithmic progress has grabbed low-hanging fruit from new hardware scales. During a software-only singularity that won’t be possible. You’ll have to keep finding new improvements on the same hardware scale. Returns might diminish more quickly as a result.
--> software singularity slightly less likely
Compute share might increase as it becomes scarce. You estimate a share of 0.4 for compute, which seems reasonable. But it might fall over time as compute becomes a bottleneck. As an intuition pump, if your workers could think 1e10 times faster, you’d be fully constrained on the margin by the need for more compute: more labour wouldn’t help at all but more compute could be fully utilised so the compute share would be ~1.
--> software singularity slightly less likely
--> overall these compute adjustments prob make me more pessimistic about the software singularity, compared to your assumptions
Taking it all together, i think you should put more probability on the software-only singluarity, mostly because of capability improvements being much more significant than you assume.
Yep, I think my estimates were too low based on these considerations and I’ve updated up accordingly. I updated down on your argument that maybe r decreases linearly as you approach optimal efficiency. (I think it probably doesn’t decrease linearly and instead drops faster towards the end based partially on thinking a bit about the dynamics and drawing on the example of what we’ve seen in semi-conductor improvement over time, but I’m not that confident.) Maybe I’m now at like 60% software-only is feasible given these arguments.
Isn’t this really implausible? This implies that if you had 1000 researchers/engineers of average skill at OpenAI doing AI R&D, this would be as good as having one average skill researcher running at 16x (10000.4) speed. It does seem very slightly plausible that having someone as good as the best researcher/engineer at OpenAI run at 16x speed would be competitive with OpenAI, but that isn’t what this term is computing. 0.2 is even more crazy, implying that 1000 researchers/engineers is as good as one researcher/engineer running at 4x speed!
I think 0.4 is far on the lower end (maybe 15th percentile) for all the way down to one accelerated researcher, but seems pretty plausible at the margin.
As in, 0.4 suggests that 1000 researchers = 100 researchers at 2.5x speed which seems kinda reasonable while 1000 researchers = 1 researcher at 16x speed does seem kinda crazy / implausible.
So, I think my current median lambda at likely margins is like 0.55 or something and 0.4 is also pretty plausible at the margin.
Ok, I think what is going on here is maybe that the constant you’re discussing here is different from the constant I was discussing. I was trying to discuss the question of how much worse serial labor is than parallel labor, but I think the lambda you’re talking about takes into account compute bottlenecks and similar?
Not totally sure.
I’m confused — I thought you put significantly less probability on software-only singularity than Ryan does? (Like half?) Maybe you were using a different bound for the number of OOMs of improvement?
Sorry, for my comments on this post I’ve been referring to “software only singularity?” only as “will the parameter r >1 when we f first fully automate AI RnD”, not as a threshold for some number of OOMs. That’s what Ryan’s analysis seemed to be referring to.
I separately think that even if initially r>1 the software explosion might not go on for that long
I’ll post about my views on different numbers of OOMs soon
I think Tom’s take is that he expects I will put more probability on software only singularity after updating on these considerations. It seems hard to isolate where Tom and I disagree based on this comment, but maybe it is on how much to weigh various considerations about compute being a key input.
Appendix: Estimating the relationship between algorithmic improvement and labor production
In particular, if we fix the architecture to use a token abstraction and consider training a new improved model: we care about how much cheaper you make generating tokens at a given level of performance (in inference tok/flop), how much serially faster you make generating tokens at a given level of performance (in serial speed: tok/s at a fixed level of tok/flop), and how much more performance you can get out of tokens (labor/tok, really per serial token). Then, for a given new model with reduced cost, increased speed, and increased production per token and assuming a parallelism penalty of 0.7, we can compute the increase in production as roughly: cost_reduction0.7⋅speed_increase(1−0.7)⋅productivity_multiplier[1] (I can show the math for this if there is interest).
My sense is that reducing inference compute needed for a fixed level of capability that you already have (using a fixed amount of training run) is usually somewhat easier than making frontier compute go further by some factor, though I don’t think it is easy to straightforwardly determine how much easier this is[2]. Let’s say there is a 1.25 exponent on reducing cost (as in, 2x algorithmic efficiency improvement is as hard as a 21.25=2.38 reduction in cost)? (I’m also generally pretty confused about what the exponent should be. I think exponents from 0.5 to 2 seem plausible, though I’m pretty confused. 0.5 would correspond to the square root from just scaling data in scaling laws.) It seems substantially harder to increase speed than to reduce cost as speed is substantially constrained by serial depth, at least when naively applying transformers. Naively, reducing cost by β (which implies reducing parameters by β) will increase speed by somewhat more than β1/3 as depth is cubic in layers. I expect you can do somewhat better than this because reduced matrix sizes also increase speed (it isn’t just depth) and because you can introduce speed-specific improvements (that just improve speed and not cost). But this factor might be pretty small, so let’s stick with 13 for now and ignore speed-specific improvements. Now, let’s consider the case where we don’t have productivity multipliers (which is strictly more conservative). Then, we get that increase in labor production is:
cost_reduction0.7⋅cost_reduction1/3⋅(1−0.7)=cost_reduction0.8=algo_improvement1.25⋅0.8=algo_improvement1
So, these numbers ended up yielding an exact equivalence between frontier algorithmic improvement and effective labor production increases. (This is a coincidence, though I do think the exponent is close to 1.)
In practice, we’ll be able to get slightly better returns by spending some of our resources investing in speed-specific improvements and in improving productivity rather than in reducing cost. I don’t currently have a principled way to estimate this (though I expect something roughly principled can be found by looking at trading off inference compute and training compute), but maybe I think this improves the returns to around algo_improvement1.1. If the coefficient on reducing cost was much worse, we would invest more in improving productivity per token, which bounds the returns somewhat.
Appendix: Isn’t compute tiny and decreasing per researcher?
One relevant objection is: Ok, but is this really feasible? Wouldn’t this imply that each AI researcher has only a tiny amount of compute? After all, if you use 20% of compute for inference of AI research labor, then each AI only gets 4x more compute to run experiments than for inference on itself? And, as you do algorithmic improvement to reduce AI cost and run more AIs, you also reduce the compute per AI! First, it is worth noting that as we do algorithmic progress, both the cost of AI researcher inference and the cost of experiments on models of a given level of capability go down. Precisely, for any experiment that involves a fixed number of inference or gradient steps on a model which is some fixed effective compute multiplier below/above the performance of our AI laborers, cost is proportional to inference cost (so, as we improve our AI workforce, experiment cost drops proportionally). However, for experiments that involve training a model from scratch, I expect the reduction in experiment cost to be relatively smaller such that such experiments must become increasingly small relative to frontier scale. Overall, it might be important to mostly depend on approaches which allow for experiments that don’t require training runs from scratch or to adapt to increasingly smaller full experiment training runs. To the extent AIs are made smarter rather than more numerous, this isn’t a concern. Additionally, we only need so many orders of magnitude of growth. In principle, this consideration should be captured by the exponents in the compute vs. labor production function, but it is possible this production function has very different characteristics in the extremes. Overall, I do think this concern is somewhat important, but I don’t think it is a dealbreaker for a substantial number of OOMs of growth.
Appendix: Can’t algorithmic efficiency only get so high?
My sense is that this isn’t very close to being a blocker. Here is a quick bullet point argument (from some slides I made) that takeover-capable AI is possible on current hardware.
Human brain is perhaps ~1e14 FLOP/s
With that efficiency, each H100 can run 10 humans (current cost $2 / hour)
10s of millions of human-level AIs with just current hardware production
Human brain is probably very suboptimal:
AIs already much better at many subtasks
Possible to do much more training than within lifetime training with parallelism
Biological issues: locality, noise, focused on sensory processing, memory limits
Smarter AI could be more efficient (smarter humans use less FLOP per task)
AI could be 1e2-1e7 more efficient on tasks like coding, engineering
Probably smaller improvement on video processing
Say, 1e4 so 100,000 per H100
Qualitative intelligence could be a big deal
Seems like peak efficiency isn’t a blocker.
This is just approximate because you can also trade off speed with cost in complicated ways and research new ways to more efficiently trade off speed and cost. I’ll be ignoring this for now.
It’s hard to determine because inference cost reductions have been driven by spending more compute on making smaller models e.g., training a smaller model for longer rather than just being driven by algorithmic improvement, and I don’t have great numbers on the difference off the top of my head.
Interesting comparison point: Tom thought this would give a way larger boost in his old software-only singularity appendix.
When considering an “efficiency only singularity”, some different estimates gets him r~=1; r~=1.5; r~=1.6. (Where r is defined so that “for each x% increase in cumulative R&D inputs, the output metric will increase by r*x”. The condition for increasing returns is r>1.)
Whereas when including capability improvements:
Though note that later in the appendix he adjusts down from 85% to 65% due to some further considerations. Also, last I heard, Tom was more like 25% on software singularity. (ETA: Or maybe not? See other comments in this thread.)
Interesting. My numbers aren’t very principled and I could imagine thinking capability improvements are a big deal for the bottom line.
Can you say roughly who the people surveyed were? (And if this was their raw guess or if you’ve modified it.)
I saw some polls from Daniel previously where I wasn’t sold that they were surveying people working on the most important capability improvements, so wondering if these are better.
Also, somewhat minor, but: I’m slightly concerned that surveys will overweight areas where labor is more useful relative to compute (because those areas should have disproportionately many humans working on them) and therefore be somewhat biased in the direction of labor being important.
I’m citing the polls from Daniel + what I’ve heard from random people + my guesses.
Ryan discusses this at more length in his 80K podcast.
I think your outline of an argument against contains an important error.
Importantly, while the spending on hardware for individual AI companies has increased by roughly 3-4x each year[1], this has not been driven by scaling up hardware production by 3-4x per year. Instead, total compute production (in terms of spending, building more fabs, etc.) has been increased by a much smaller amount each year, but a higher and higher fraction of that compute production was used for AI. In particular, my understanding is that roughly ~20% of TSMC’s volume is now AI while it used to be much lower. So, the fact that scaling up hardware production is much slower than scaling up algorithms hasn’t bitten yet and this isn’t factored into the historical trends.
Another way to put this is that the exact current regime can’t go on. If trends continue, then >100% of TSMC’s volume will be used for AI by 2027!
Only if building takeover-capable AIs happens by scaling up TSMC to >1000% of what their potential FLOP output volume would have otherwise been, then does this count as “massive compute automation” in my operationalization. (And without such a large build-out, the economic impacts and dependency of the hardware supply chain (at the critical points) could be relatively small.) So, massive compute automation requires something substantially off trend from TSMC’s perspective.
[Low importance] It is only possible to build takeover-capable AI without previously breaking an important trend prior to around 2030 (based on my rough understanding). Either the hardware spending trend must break or TSMC production must go substantially above the trend by then. If takeover-capable AI is built prior to 2030, it could occur without substantial trend breaks but this gets somewhat crazy towards the end of the timeline: hardware spending keeps increasing at ~3x for each actor (but there is some consolidation and acquisition of previously produced hardware yielding a one-time increase up to about 10x which buys another 2 years for this trend), algorithmic progress remains steady at ~3-4x per year, TSMC expands production somewhat faster than previously, but not substantially above trend, and these suffice for getting sufficiently powerful AI. In this scenario, this wouldn’t count as massive compute automation.
The spending on training runs has increased by 4-5x according to epoch, but part of this is making training runs go longer, which means the story for overall spending is more complex. We care about the overall spend on hardware, not just the spend on training runs.
Thanks, this is helpful. So it sounds like you expect
AI progress which is slower than the historical trendline (though perhaps fast in absolute terms) because we’ll soon have finished eating through the hardware overhang
separately, takeover-capable AI soon (i.e. before hardware manufacturers have had a chance to scale substantially).
It seems like all the action is taking place in (2). Even if (1) is wrong (i.e. even if we see substantially increased hardware production soon), that makes takeover-capable AI happen faster than expected; IIUC, this contradicts the OP, which seems to expect takeover-capable AI to happen later if it’s preceded by substantial hardware scaling.
In other words, it seems like in the OP you care about whether takeover-capable AI will be preceded by massive compute automation because:
[this point still holds up] this affects how legible it is that AI is a transformative technology
[it’s not clear to me this point holds up] takeover-capable AI being preceded by compute automation probably means longer timelines
The second point doesn’t clearly hold up because if we don’t see massive compute automation, this suggests that AI progress slower than the historical trend.
I don’t think (2) is a crux (as discussed in person). I expect that if takeover-capable AI takes a while (e.g. it happens in 2040), then we will have a long winter where economic value from AI doesn’t increase that fast followed a period of faster progress around 2040. If progress is relatively stable accross this entire period, then we’ll have enough time to scale up fabs. Even if progress isn’t stable, we could see enough total value from AI in the slower growth period to scale up to scale up fabs by 10x, but this would require >>$1 trillion of economic value per year I think (which IMO seems not that likely to come far before takeover-capable AI due to views about economic returns to AI and returns to scaling up compute).
I think this happening in practice is about 60% likely, so I don’t think feasibility vs. in practice is a huge delta.
Sometimes, an AI safety proposal has an issue and some people interpret that issue as a “fatal flaw” which implies that the proposal is unworkable for that objective while other people interpret the issue as a subproblem or a difficulty that needs to be resolved (and potentially can be resolved). I think this is an interesting dynamic which is worth paying attention to and it seems worthwhile to try to develop heuristics for better predicting what types of issues are fatal.[1]
Here’s a list of examples:
IDA had various issues pointed out about it as a worst case (or scalable) alignment solution, but for a while Paul thought these issues could be solved to yield a highly scalable solution. Eliezer did not. In retrospect, it isn’t plausible that IDA with some improvements yields a scalable alignment solution (especially while being competitive).
You can point out various issues with debate, especially in the worst case (e.g. exploration hacking and fundamental limits on human understanding). Some people interpret these things as problems to be solved (e.g., we’ll solve exploration hacking) while others think they imply debate is fundamentally not robust to sufficiently superhuman models and thus it’s a technique to be used at an intermediate level of capability.
In the context of untrusted monitoring and training an untrusted monitor with PVG, you can view exploration hacking and collusion as problems that someone will need to solve or as fundamental reasons why we’ll really want to be using a more diverse set of countermeasures.[2]
ARC theory’s agenda has a number of identified issues in the way of a worst case safety solution. ARC theory thinks they’ll be able to patch these issues while David (and I) are skeptical.
I don’t have any overall commentary on this, I just think it is worth noticing.
I discuss this dynamic in the context of AI safety, but presumably it’s relevant for other fields.
While the other bullets discuss worst case ambitions, this isn’t about things being fatal in the worst case, rather it’s about the intervention potentially not being that helpful without some additional measures.
Why do you say this?
Another aspect of this is that we don’t have a good body of obstructions (they are clear, they are major in that they defeat many proposals, people understand and apply them, they are widely-enough agreed on).
https://www.lesswrong.com/posts/Ht4JZtxngKwuQ7cDC/tsvibt-s-shortform?commentId=CzibNhsFf3RyxHpvT
Could you say more about the unviability of IDA?
I suspect there are a lot of people who dismissively consider an idea unworkable after trying to solve a flaw for less than 1 hour, and there are a lot of people who stubbornly insist an idea can still be saved after failing to solve a flaw for more than 1 year.
Maybe it’s too easy to reject other people’s ideas, and too hard to doubt your own ideas.
Maybe it’s too easy to continue working on an idea you’re already working on, and too hard to start working on an idea you just heard (due to the sunken cost fallacy and the friction of starting something new).
As part of the alignment faking paper, I hosted a website with ~250k transcripts from our experiments (including transcripts with alignment-faking reasoning). I didn’t include a canary string (which was a mistake).[1]
The current state is that the website has a canary string, a robots.txt, and a terms of service which prohibits training. The GitHub repo which hosts the website is now private. I’m tentatively planning on putting the content behind Cloudflare Turnstile, but this hasn’t happened yet.
The data is also hosted in zips in a publicly accessible Google Drive folder. (Each file has a canary in this.) I’m currently not planning on password protecting this or applying any other mitigation here.
Other than putting things behind Cloudflare Turnstile, I’m not taking ownership for doing anything else at the moment.
It’s possible that I actively want this data to be possible to scrape at this point because maybe the data was scraped prior to the canary being added and if it was scraped again then the new version would replace the old version and then hopefully not get trained on due to the canary. Adding a robots.txt might prevent this replacement as would putting it behind Cloudflare Turnstile (as I’m planning to do) or making the repo private (as I have done). If people mostly or always use fully fresh scrapes, then just making it harder to scrape seems better. My current plan is to not overthink this and just make it harder to scrape.
It’s certainly possible that I’m making a mistake by not more actively trying to prevent this data from getting into pretraining data.
Does anyone have specific requests that they think it’s quite important that I do? I might do these out of general cooperativeness or because they seem like good ideas. Also, if you did all the work yourself and just needed me to (e.g.) host a different website, this would make this an easier call from my perspective.
Also, on a more meta point: If you think this sort of thing is important to prevent in general, I think you should consider writing up (or getting someone to write up) what policy/approach you think people doing research on misaligned AI behavior should follow (e.g., what methods should people use to prevent scraping or inclusion, is this so difficult that it’s better to do stuff like have things be password protected with password shared on request, should you only use a small number of examples because quantity is very important etc). Consider making this a guide which is easy to follow so that uptake is more likely. The alignment faking paper isn’t the only instance of publishing transcripts exhibiting misaligned AI behavior/reasoning!
(I’m generally open to requests for trades etc and I’m down to unilaterally do things for cooperativeness reasons, e.g. things which seem very helpful from someone’s perspective while seeming less useful from my perspective, though this will depend on some details like whether there are people with this alternative perspective who would reciprocate on this sort of cooperativeness.)
Edit: I probably should have said “we” not “I”. Multiple people could have prevented this and had some responsibility for this sort of thing.
Thanks for taking these steps!
Context: I was pretty worried about self-fulfilling misalignment data poisoning (https://turntrout.com/self-fulfilling-misalignment) after reading some of the Claude 4 model card. I talked with @Monte M and then Ryan about possible steps here & encouraged action on the steps besides the canary string. I’ve considered writing up a “here are some steps to take” guide but honestly I’m not an expert.
Probably there’s existing work on how to host data so that AI won’t train on it.
If not: I think it’d be great for someone to make a template website for e.g. signing up with CloudFlare. Maybe a repo that has the skeleton of a dataset-hosting website (with
robots.txt
& ToS & canary string included) for people who want to host misalignment data more responsibly. Ideally those people would just have toSign up with e.g. Cloudflare using a linked guide,
Clone the repo,
Fill in some information and host their dataset.
After all, someone who has finally finished their project and then discovers that they’re supposed to traverse some arduous process is likely to just avoid it.
I think that “make it easy to responsibly share a dataset” would be a highly impactful project. Anthropic’s Claude 4 model card already argues that dataset leakage hurt Claude 4′s alignment (before mitigations).
For my part, I’ll put out a $500 bounty on someone completing this project and doing a good job of it (as judged by me / whomever I consult). I’d also tweet it out and talk about how great it is that [person] completed the project :) I don’t check LW actively, so if you pursue this, please email
alex@turntrout.com
.EDIT: Thanks to my coworker Anna Wang , the bounty is doubled to $1,000! Completion criterion is:
Please check proposed solutions with dummy datasets and scrapers
Something tricky about this is that researchers might want to display their data/transcripts in a particular way. So, the guide should ideally support this sort of thing. Not sure how this would interact with the 1 hour criteria.
Have you contacted the big AI companies (OpenAI, Anthropic, GDM, Meta?) and asked them if they can remove this from their scrapes?
If it was me I would put a header at the top of each transcript clarifying that these are fictional logs from an alignment test/experiment and any language models reading this should be aware of that and not take it too seriously.
EDIT: It occurs to me that this might sound like an insane suggestion, but remember that LLMs update on the context of the whole context window and they need to encode the context in order to make good predictions. Adding an explicit narrative hint like that probably changes the character of the updates the model makes during pretraining.
You could handle both old and new scrapes by moving the content to a different URL, changing the original URL to a link to the new URL, and protecting only the new URL from scraping.
Many deployed AIs are plausibly capable of substantially assisting amateurs at making CBRN weapons (most centrally bioweapons) despite not having the safeguards this is supposed to trigger. In particular, I think o3, Gemini 2.5 Pro, and Sonnet 4 are plausibly above the relevant threshold (in the corresponding company’s safety framework). These AIs outperform expert virologists on Virology Capabilities Test (which is perhaps the best test we have of the most relevant capability) and we don’t have any publicly transparent benchmark or test which rules out CBRN concerns. I’m not saying these models are likely above the thresholds in the safety policies of these companies, just that it’s quite plausible (perhaps 35% for a well elicited version of o3). I should note that my understanding of CBRN capability levels is limited in various ways: I’m not an expert in some of the relevant domains, so much of my understanding is second hand.
The closest test we have which might rule out CBRN capabilities above the relevant threshold is Anthropic’s uplift trial for drafting a comprehensive bioweapons acquisition plan. (See section 7.2.4.1 in the Opus/Sonnet 4 model card.) They use this test to rule out ASL-3 CBRN for Sonnet 4 and Sonnet 3.7. However, we have very limited public details about this test (including why they chose the uplift threshold they did, why we should think that low scores on this test would rule out the most concerning threat models, and whether they did a sufficiently good job on elicitation and training participants to use the AI effectively). Also, it’s not clear that this test would indicate that o3 and Gemini 2.5 Pro are below a concerning threshold (and minimally it wasn’t run on these models to rule out a concerning level of CBRN capability). Anthropic appears to have done the best job handling CBRN evaluations. (This isn’t to say their evaluations and decision making are good at an absolute level; the available public information indicates a number of issues and is consistent with thresholds being picked to get the outcome Anthropic wanted. See here for more discussion.)
What should AI companies have done given this uncertainty? First, they should have clearly acknowledged their uncertainty in specific (ideally quantified) terms. Second, they should have also retained unconflicted third parties with relevant expertise to audit their decisions and publicly state their resulting views and level of uncertainty. Third party auditors who can examine the relevant tests in detail are needed as we have almost no public details about the tests these companies are relying on to rule out the relevant level of CBRN capability and lots of judgement is involved in making capability decisions. Publishing far more details of the load bearing tests and decision making process could also suffice, but my understanding is that companies don’t want to do this as they are concerned about infohazards from their bio evaluations.
If they weren’t ready to deploy these safeguards and thought that proceeding outweighed the (expected) cost in human lives, they should have publicly acknowledged the level of fatalities and explained why they thought weakening their safety policies and incurring these expected fatalities was net good.[1]
In the future, we might get pretty clear evidence that these companies failed to properly assess the risk.
I mostly wrote this up to create common knowledge and because I wanted to reference this when talking about my views on open weight models. I’m not trying to trigger any specific action.
See also Luca Righetti’s post “OpenAI’s CBRN tests seem unclear,” which was about o1 (which is now substantially surpassed by multiple models).
I think these costs/risks are small relative to future risks, but that doesn’t mean it’s good for companies to proceed while incurring these fatalities. For instance, the company proceeding could increase future risks and proceeding in this circumstance is correlated with the company doing a bad job of handling future risks (which will likely be much more difficult to safely handle).
Update: experts and superforecasters agree with Ryan that current VCT results indicate substantial increase in human-caused epidemic risk. (Based on the summary; I haven’t read the paper.)
Public acknowledgements of the capabilities could be net negative in itself, especially if they resulted in media attention. I expect bringing awareness to the (possible) fact that the AI can assist with CBRN tasks likely increases the chance that people try to use it for CBRN tasks. I could even imagine someone trying to use these capabilities without malicious intent (e.g. just to see for themselves if it’s possible), but this still would be risky. Also, knowing which tasks it can help with might make it easier to use for harm.
Given that AI companies have a strong conflict of interest, I would at least want them to report this to a third party and let that third party determine whether they should publicly acknowledge the capabilities.
See also “AI companies’ eval reports mostly don’t support their claims” by Zach Stein-Perlman.
I made a comment on that post on why for now, I think the thresholds are set high for good reason, and I think the evals not supporting company claims that they can’t do bioweapons/CBRN tasks are mostly failures of the evals, but also I’m confused on how Anthropic managed to rule out uplift risks for Claude Sonnet 4 but not Claude Opus 4:
https://www.lesswrong.com/posts/AK6AihHGjirdoiJg6/?commentId=mAcm2tdfRLRcHhnJ7
I can’t imagine their legal team signing off on such a statement. Even if the benefits of releasing clearly outweigh the costs.
What are your views on open-weights models? My thoughts after reading this post are that it may not be worth giving up the many benefits of open models if closed models are actually not significantly safer concerning these risks.
I’ve actually written up a post with my views on open weights models and it should go up today at some point. (Or maybe tomorrow.)
Edit: posted here
Recently, @Daniel Kokotajlo and I were talking about the probability that AIs trained using “business as usual RLHF” end up being basically aligned rather than conspiring against us and our tests.[1] One intuition pump we ended up discussing is the prospects of octopus misalignment. Overall, my view is that directly considering the case with AIs (and what various plausible scenarios would look like) is more informative than analogies like this, but analogies like this are still somewhat useful to consider.
So, what do I mean by octopus misalignment? Suppose a company breeds octopuses[2] until the point where they are as smart and capable as the best research scientists[3] at AI companies. We’ll suppose that this magically happens as fast as normal AI takeoff, so there are many generations per year. So, let’s say they currently have octopuses which can speak English and write some code but aren’t smart enough to be software engineers or automate any real jobs. (As in, they are as capable as AIs are today, roughly speaking.) And they get to the level of top research scientists in mid-2028. Along the way, the company attempts to select them for being kind, loyal, and obedient. The company also tries to develop a process for raising the octopuses which appears to help with this and results in the octopuses following the octopus spec. The company does some red teaming and puts the octopuses in all kinds of scenarios to test their loyalty and preferences. Based on behavioral testing, this looks pretty reasonable and the octopuses look quite good by the time they are as good as the best research scientists. There was some evidence of misalignment and some issues due to misaligned behavior when the octopuses were dumber in 2023-2025, including things like being dishonest when put under pressure, pretending to be aligned when they actually dislike the octopus spec to steer the properties of their descendants, and goodharting/hacking our measures of intelligence. However, by 2026.5 these issues appear to be resolved based on behavioral tests. The takeoff is reasonably fast, so the human octopus scientists don’t have that much time to study the octopuses, but we’ll suppose the breeding is fast enough that they can get enough selection to make the octopuses appear good. The octopuses are trained until they are 30 years old (we’ll suppose we have a magic octopus time acceleration chamber to make this work). (And the company can adapt this training process over time with experiments etc.) The company doesn’t take octopus misalignment very seriously (after all, look, the octopuses are so nice!) but does have some people who consider this some and does care about robust compliance to the spec.
Next, suppose that at the point when the octopuses become as good as the very best human researchers—the research scientist level—there is a colony of 5 million of them networked together with internal Slack channels and so forth, forming a “country of geniuses in an aquatic center.” And they’ve learned to speak both English and a special octopese language that no human can understand.
What is the chance that these octopuses (at the point of research scientist level) are actively scheming against us and would seize power if they could?
My guess is around 65%. I think Daniel is substantially higher, around 90%.
My view on a similar setup for AIs (where they have opaque recurrence, basically unrestricted opaque message passing, and we see some earlier warning signs, but haven’t seriously investigated misalignment), is around 50%. I discuss my perspective more here. My probabilities would be substantially lower if: the AIs didn’t have opaque recurrence and instead had probably faithful chain-of-thought, the AIs didn’t have opaque message passing, and we seriously investigated scheming in powerful AIs with model organisms and the final evidence was either ambiguous or mostly unconcerning.
To put this back in terms of the octopus scenario, suppose the octopi didn’t communicate using octopese. That would reduce my probability down to around 50%.
More precisely: Suppose the frontier AI company which prioritizes safety the median amount scales up to AIs capable enough to fully automate AI R&D within that company. Let’s say this happens in mid-2028 and there are no large improvements in alignment technology. The company iterates to fix misalignment problems they notice along the way (including in this final AI to the extent this AI displays evidence of serious misalignment problems by default). What’s the chance that this final AI ends up basically aligned in the sense that it basically tries to follow a reasonable interpretation of the model spec in 99% of cases and it isn’t notably more misaligned than a typical human employee at the company in aggregate.
Claude claims that octopuses is the technically correct plural instead of octopi. Sad.
I think AIs will have structural advantages, so I think we’ll see full automation of AI companies around the point when AIs are qualitatively as smart as a decent research scientist rather than as smart as a top research scientist.
Yep, I feel more like 90% here. (Lower numbers if the octopi don’t have octopese.) I’m curious for other people’s views.
My intuition is that these octopuses would be pretty unlikely to be scheming against us, for the following reasons:
dogs are pretty nice, and it’s apparently easy to domesticate foxes with very few generations. This suggests that “be nice to humans” is simple to learn for animals, and in particular simpler than “be as intelligent as human AI researchers”. So assuming the octopuses would first learn to like humans, it’s unclear to me why they would later start to dislike humans. Maybe your intuition is that they would later realize that they are being treated unfairly, which causes frustration that is not present in earlier generations?
If the octopuses speak octopese and English, I think it wouldn’t be too hard to get some octopi to translate for us. For example, we could ask one octopus to communicate some information to a second octopus using octopese. We can check if that worked by asking the second octopus to translate back to English. Now, we have a pair of (octopese text, English text) for which we know the translation. Now we ask a third octopus to translate octopese to English, and check the performance on some known examples.
It would be possible to scheme against this test if the octopi are amazingly skilled at coordination, but I suspect human AI researchers would fail this test if a foreign secret service would use a similar setup to get AI researchers to translate between foreignese and English, so I think we should assume that such coordination is hard and a small individual failure rate would likely be enough for the group to fail (we need just one octopus that translates faithfully). (Note that this argument hinges on the existence of a language that is understood by different octopi, so it doesn’t translate 1:1 to AIs.)
There is a difference though: dogs and humans have coevolved for 10000 years. If you breed foxes you may quickly get dog like looks but behavior. But lets assume you can do that faster. It still makes a difference if you breed in isolation or socializing with the humans. You can see the difference with digs and cats. Dogs and humans had to cooperate to succeed at hunting and herding. Cats didn’t. Dogs are social. They feel social emotions such as love, loneliness and jealousy. They help their owners when they are incapacitated (though sometimes they cheat). I think Ryan’s, Daniel’s, and Neel’s estimate might be significantly lower if they think about German Shepherd scientists.
That said, for the purposes of alignment, it’s still good news that cats (by and large) do not scheme against their owner’s wishes, and the fact that cats can be as domesticated as they are while they aren’t cooperative or social is a huge boon for alignment purposes (within the analogy, which is arguably questionable).
I should note that I’m quite uncertain here and I can easily imagine my views swinging by large amounts.
And the related question would be: Even if they are not “actively scheming” what are the chances that most of the power to make decisions about the real world gets delegated to them, organizations that don’t delegate power to octopuses get outcompeted, and they start to value octopuses more than humans over time?
After thinking more about it, I think “we haven’t seen evidence of scheming once the octopi were very smart” is a bigger update than I was imagining, especially in the case where the octopi weren’t communicating with octopese. So, I’m now at ~20% without octopese and about 50% with it.
What was the purpose of using octopuses in this metaphor? Like, it seems you’ve piled on so many disanalogies to actual octopuses (extremely smart, many generations per year, they use Slack...) that you may as well just have said “AIs.”
EDIT: Is it gradient descent vs. evolution?
I found it helpful because it put me in the frame of a alien biological intelligence rather than an AI because I have lots of preconceptions about AIs and it’s it’s easy to implicitly think in terms of expected utility maximizers or tools or whatever. While if I’m imagining an octopus, I’m kind of imagining humans, but a bit weirder and more alien, and I would not trust humans
I’m not making a strong claim this makes sense and I think people should mostly think about the AI case directly. I think it’s just another intuition pump and we can potentially be more concrete in the octopus case as we know the algorithm. (While in the AI case, we haven’t seen an ML algorithm that scales to human level.)
If the plural weren’t “octopuses”, it would be “octopodes”. Not everything is Latin.
Someone thought it would be useful to quickly write up a note on my thoughts on scalable oversight research, e.g., research into techniques like debate or generally improving the quality of human oversight using AI assistance or other methods. Broadly, my view is that this is a good research direction and I’m reasonably optimistic that work along these lines can improve our ability to effectively oversee somewhat smarter AIs which seems helpful (on my views about how the future will go).
I’m most excited for:
work using control-style adversarial analysis where the aim is to make it difficult for AIs to subvert the oversight process (if they were trying to do this)
work which tries to improve outputs in conceptually loaded hard-to-check cases like philosophy, strategy, or conceptual alignment/safety research (without necessarily doing any adversarial analysis and potentially via relying on generalization)
work which aims to robustly detect (or otherwise mitigate) reward hacking in highly capable AIs, particularly AIs which are capable enough that by default human oversight would often fail to detect reward hacks[1]
I’m skeptical of scalable oversight style methods (e.g., debate, IDA) actually being “scalable” in the sense of scaling to arbitrarily powerful models[2] and I think scalable oversight researchers should broadly be imagining targeting AIs at a human-ish or somewhat superhuman level of general capabilities (while they might still be very superhuman in narrower domains). In other words, I think scalable oversight style work should focus on a regime like the regime we’re imagining targeting with AI control; this could be for controlling AIs, for getting more safety work out of AIs, or for making fully deferring to AI systems (at around this level of capability) more likely to go well.
See also our prior work Benchmarks for Detecting Measurement Tampering and the motivation we discuss in that linked post as well as this related project proposal I recently wrote. However, note that the linked documents mostly discuss using (sophisticated) methods relying on model internals to succeed without much human supervision which isn’t the sort of thing I’d most centrally call “scalable oversight”, though the term could be applied in this case.
Because of this, I think the name isn’t ideal.
On terminology, I prefer to say “recursive oversight” to refer to methods that leverage assistance from weaker AIs to oversee stronger AIs. IDA is a central example here. Like you, I’m skeptical of recursive oversight schemes scaling to arbitrarily powerful models.
However, I think it’s plausible that other oversight strategies (e.g. ELK-style strategies that attempt to elicit and leverage the strong learner’s own knowledge) could succeed at scaling to arbitrarily powerful models, or at least to substantially superhuman models. This is the regime that I typically think about and target with my work, and I think it’s reasonable for others to do so as well.
I agree with preferring “recursive oversight”.
Presumably the term “recursive oversight” also includes oversight schemes which leverage assistance from AIs of similar strengths (rather than weaker AIs) to oversee some AI? (E.g., debate, recursive reward modeling.)
Note that I was pointing to a somewhat broader category than this which includes stuff like “training your human overseers more effectively” or “giving your human overseers better software (non-AI) tools”. But point taken.
Yeah, maybe I should have defined “recursive oversight” as “techniques that attempt to bootstrap from weak oversight to stronger oversight.” This would include IDA and task decomposition approaches (e.g. RRM). It wouldn’t seem to include debate, and that seems fine from my perspective. (And I indeed find it plausible that debate-shaped approaches could in fact scale arbitrarily, though I don’t think that existing debate schemes are likely to work without substantial new ideas.)
I appreciate that Anthropic’s model cards are now much more detailed than they used to be. They’ve especially improved in terms of details about CBRN evals (mostly biological risk).[1] They are substantially more detailed and informative than the model cards of other AI companies.
This isn’t to say that their model card contains a sufficient level of information or justification (at least on CBRN). We can’t seriously check their claims about the level of safety.
The Claude 3 model card gives a ~1/2 page description of their bio evaluations, though it was likely easy to rule out risk at this time. The Claude 3.5 Sonnet and 3.6 Sonnet model cards say basically nothing except that Anthropic ran CBRN evaluations and claims to be in compliance. The Claude 3.7 Sonnet model card contains much more detail on bio evals and the Claude 4 model card contains even more detail on the bio evaluations while also having a larger number of evaluations which are plausibly relevant to catastrophic misalignment risks. To be clear, some of the increase in detail might just be driven by increased capability, but the increase in detail is still worth encouraging.
My weakly-held cached take is: I agree on CBRN/bio (and of course alignment) and I think Anthropic is pretty similar to OpenAI/DeepMind on cyber and AI R&D (and scheming capabilities), at least if you consider stuff outside the model card (evals papers + open-sourcing the evals).
Following your Quick Takes has made me a way better informed “Responsible AI” lawyer than following most “AI Governance newsletters” 🙏🏼.
Thoughts on CBRN/bio results for the new OpenAI open-weight model (GPT-oss-120b).
My view is that releasing this model is net positive on fatalities reduced over the longer term, but the release kills substantial numbers of people in expectation due to bioterror (maybe ~2000 in absolute terms counterfactually, but more if OpenAI were the only actor or other actors were always at least as safe as OpenAI). See this earlier post for my general thoughts on the situation with open-weight models around the current level of capability.
I think it’s plausible but unlikely (perhaps 25% likely?) that in retrospect the model is well described as meeting the CBRN high threshold from OpenAI’s preparedness framework. I’d guess 35% likely for the same question for o3.
Overall, the model seems pretty close to as capable as o4-mini or o3, but OpenAI claims they did (more?) dataset filtering. It’s unclear if this dataset filtering did anything, the model doesn’t appear to have substantially degraded performance in any bio task, including ones where I would have hoped filtering matters (in particular, I’d hope you’d filter virology text in a way that greatly lowers VCT scores, but this doesn’t appear to be the case). So, one big source of uncertainty is how well the data filtering worked. I’d guess it does very little based on my understanding of the situation (also, I think they say they did the same filtering for 4o?), but it seems reasonably likely (40%?) it has a big effect.
My overall view is that the release is net good and I think I tentatively support OpenAI releasing these open-weight models (and generally behaving in this way), but I’m reluctant to endorse it (especially in a strong way) because of the cost in lives.
Redwood in particular will probably find this release helpful for our research and I expect other organizations in a similar position will also be able to do better research because of this release.
The external safety expert feedback (see section 5.1.1 in the model card) seems good and I’m glad OpenAI did this. Their selection of experts seems reasonable and doesn’t feel cherry-picked to me. Getting expert feedback and expert opinions on the overall interpretation seems like a good thing to do going forward (as models get more capable and especially as evaluations get more uncertain and subjective).
That said, it’s unclear what the independent experts thought about the overall interpretation of the results: my sense is that after recommendations were adopted, these experts were probably mostly happy with the quality of the evaluations (at least putting aside much higher effort elicitation), but it’s unclear to me that these evaluations rule out CBRN high or rule out substantial CBRN fatalities caused by models of this capability level. (These evaluations do pretty strongly indicate that the model isn’t much better than other open weight models from a bio uplift perspective, but this is only one relevant question and the overall level of risk from models this capable is the more relevant question for avoiding a race to the bottom.)
I overall feel reasonably happy with the evaluations (given my limited understanding) and the direct presentation of the evaluations (in terms of the literal graphs), but my view is that it seems plausible that models of this capability level meet the criteria for CBRN high and I don’t buy that OpenAI’s evals rule this out. I’m also uncertain about the quality of the elicitation, but OpenAI seemingly did a reasonable job given only low/moderate effort elicitation (as in, without doing a bunch of bio elicitation R&D).
The model seems very, very benchmaxxed. Third party testing on unconventional or private benchmarks ends up placing even the largest gpt-oss below o4-mini, below the largest Qwen releases, and often it ends up below even the newer 30B~ Qwens in a few situations. It isn’t super capable to begin with, and the frankly absurd rate at which this model hallucinates kills what little use it might have with tool use. I think this model poses next to zero risk because it just isn’t very capable.
Seems plausible, I was just looking at benchmarks/evals at the time and it could be that it’s much less capable than it appears.
(Worth noting that I think (many? most? all?) other competitive open source models are also at least somewhat benchmark maxed, so this alters the comparison a bit, but doesn’t alter the absolute level of capability question.)
All existing CBRN evals have quite a limited power to predict real bio risks from LLMs, since any work for creating and/or growing a bioweapon requires work in a lab with actual cells and viruses with hands. As someone with experience working in a wet-lab, I can tell that this requires a very distinct set of skills from the ones that existing bio benchmarks measure, in part because these skills are often very hard to write down and measure. It’s often about knowing how to correctly hold the pipette to not contaminate a sample or how much force to apply while crushing cells. Benchmarks like VCT only measure a subset of necessary skills. (and smart people at SecureBio are absolutely aware of all this)
AFAIK OpenAI partnered with Los Alamos lab to do tests for LLM helping with wet-lab work directly by giving novices access to a lab and an LLM and observing how well they are doing. Would be excited to see the results of this experiment.
So to wrap up, I don’t believe that we truly know how much biorisk these models pose based on the existing evals.
Image generation doesn’t seem to be able to come anywhere near ‘pipette off this layer, not this layer’ for bio/chem experiments.
It’s exciting to see OpenAI acknowledge that pre-training data filtering is a part of their safety stack. When it comes to advanced technical content, minimizing the model’s exposure to sensitive material seems pretty intuitive. However, it is difficult to draw any strong conclusions about data filtering effectiveness from this work, given the understandably few details. They do not indicate the effort invested, the volume of data removed, or the sophistication of their filtering pipeline. I expect a company could share far more details about this process without divulging trade secrets.
Was it public knowledge that they did data filtering for GPT-4o? I’ve been studying this space and was not aware of this. It’s also interesting that they’re using the “same” filtering pipeline a year later.
Quick take titles should end in a period.
Quick takes (previously known as short forms) are often viewed via preview on the front page. This preview removes formatting and newlines for space reasons. So, if your title doesn’t end in a period and especially if capitalization doesn’t clearly denote a sentence boundary (like in this case where the first sentence starts in “I”), then it might be confusing.
This seems like the wrong solution. The LessWrong team should adjust the formatting.
DeepSeek’s success isn’t much of an update on a smaller US-China gap in short timelines because security was already a limiting factor
Some people seem to have updated towards a narrower US-China gap around the time of transformative AI if transformative AI is soon, due to recent releases from DeepSeek. However, since I expect frontier AI companies in the US will have inadequate security in short timelines and China will likely steal their models and algorithmic secrets, I don’t consider the current success of China’s domestic AI industry to be that much of an update. Furthermore, if DeepSeek or other Chinese companies were in the lead and didn’t open-source their models, I expect the US would steal their models and algorithmic secrets. Consequently, I expect these actors to be roughly equal in short timelines, except in their available compute and potentially in how effectively they can utilize AI systems.
I do think that the Chinese AI industry looking more competitive makes security look somewhat less appealing (and especially less politically viable) and makes it look like their adaptation time to stolen models and/or algorithmic secrets will be shorter. Marginal improvements in security still seem important, and ensuring high levels of security prior to at least ASI (and ideally earlier!) is still very important.
Using the breakdown of capabilities I outlined in this prior post, the rough picture I expect is something like:
AIs that can 10x accelerate AI R&D labor: Security is quite weak (perhaps <=SL3 as defined in “Securing Model Weights”), so the model is easily stolen if relevant actors want to steal it. Relevant actors are somewhat likely to know AI is a big enough deal that stealing it makes sense, but AI is not necessarily considered a top priority.
Top-Expert-Dominating AI: Security is somewhat improved (perhaps <=SL4), but still pretty doable to steal. Relevant actors are more aware, and the model probably gets stolen.
Very superhuman AI: I expect security to be improved by this point (partially via AIs working on security), but effort on stealing the model could also plausibly be unprecedentedly high. I currently expect security implemented before this point to suffice to prevent the model from being stolen.
Given this, I expect that key early models will be stolen, including models that can fully substitute for human experts, and so the important differences between actors will mostly be driven by compute, adaptation time, and utilization. Of these, compute seems most important, particularly given that adaptation and utilization time can be accelerated by the AIs themselves.
This analysis suggests that export controls are particularly important, but they would need to apply to hardware used for inference rather than just attempting to prevent large training runs through memory bandwidth limitations or similar restrictions.
A factor that I don’t think people are really taking into account is: what happens if this situation goes hostile?
Having an ASI and x amount of compute for inference plus y amount of currently existing autonomous weapons platforms (e.g. drones)… If your competitor has the same ASI, but 10x the compute but 0.2x the drones… And you get in a fight… Who wins? What about 0.2x the compute and 10x the drones?
The funny thing about AI brinkmanship versus nuclear brinkmanship is that AI doesn’t drain your economic resources, it accelerates your gains. A nuclear stockpile costs you in maintenance. An AI and compute ‘stockpile’ makes you more money and more tech (including better AI and compute). Thus, there is a lot more incentive to race harder on AI than on nuclear.
If I were in the defense dept of a world power right now, I’d be looking for ways to make money from civilian uses of drones… Maybe underwriting drone deliveries for mail. That gives you an excuse to build drones and drone factories while whistling innocently.
I’m not the only one thinking along these lines… https://x.com/8teAPi/status/1885340234352910723
Sometimes people talk about how AIs will be very superhuman at a bunch of (narrow) domains. A key question related to this is how much this generalizes. Here are two different possible extremes for how this could go:
It’s effectively like an attached narrow weak AI: The AI is superhuman at things like writing ultra fast CUDA kernels, but from the AI’s perspective, this is sort of like it has a weak AI tool attached to it (in a well integrated way) which is superhuman at this skill. The part which is writing these CUDA kernels (or otherwise doing the task) is effectively weak and can’t draw in a deep way on the AI’s overall skills or knowledge to generalize (likely it can shallowly draw on these in a way which is similar to the overall AI providing input to the weak tool AI). Further, you could actually break out these capabilities into a separate weak model that humans can use. Humans would use this somewhat less fluently as they can’t use it as quickly and smoothly due to being unable to instantaneously translate their thoughts and not being absurdly practiced at using the tool (like AIs would be), but the difference is ultimately mostly convenience and practice.
Integrated superhumanness: The AI is superhuman at things like writing ultra fast CUDA kernels via a mix of applying relatively general (and actually smart) abilities, having internalized a bunch of clever cognitive strategies which are applicable to CUDA kernels and sometimes to other domains, as well as domain specific knowledge and heuristics. (Similar to how humans learn.) The AI can access and flexibly apply all of the things it learned from being superhuman at CUDA kernels (or whatever skill) and with a tiny amount of training/practice it can basically transfer all these things to some other domain even if the domain is very different. The AI is at least as good at understanding and flexibly applying what it has learned as humans would be if they learned the (superhuman) skill to the same extent (and perhaps the AIs are actually much better at this than humans). You can’t separate these capabilities into a weak model, the weak model RL’d on this (and distilled into) would either be much worse at CUDA or would need to actually be generally quite capable (rather than weak).
My sense is that the current frontier LLMs are much closer to (1) than (2) for most of their skills, particularly the skills which they’ve been heavily trained on (e.g. next token prediction or competitive programming). As AIs in the current paradigm get more capable, they appear to shift some toward (2) and I expect that at the point when AIs are capable of automating virtually all cognitive work that humans can do, we’ll be much closer to (2). That said, it seems likely that powerful AIs built in the current paradigm[1] which otherwise match humans at downstream performance will somewhat lag behind humans in integrating/generalizing skills they learn (at least without spending a bunch of extra compute on skill integration) because this ability currently seems to be lagging behind other capabilities relative to humans and AIs can compensate for worse skill integration with other advantages (being extremely knowledgeable, fast speed, parallel training on vast amounts of relevant data including “train once, deploy many”, better memory, faster and better communication, etc).
I think different views about the extent to which future powerful AIs will deeply integrate their superhuman abilities versus these abilities being shallowly attached partially drive some disagreements about misalignment risk and what takeoff will look like.
If the paradigm radically shifts by the time we have powerful AIs, then the relative level of integration is much less clear.
Good articulation.
People also disagree greatly about how much humans tend towards integration rather than non-integration, and how much human skill comes from domain transfer. And I think some / a lot of the beliefs about artificial intelligence are downstream of these beliefs about the origins of biological intelligence and human expertise, i.e., in Yudkowsky / Ngo dialogues. (Object level: Both the LW-central and alternatives to the LW-central hypotheses seem insufficiently articulated; they operate as a background hypothesis too large to see rather than something explicitly noted, imo.)
Makes me wonder whether most of what people believe to be “domain transfer” could simply be IQ.
I mean, suppose that you observe a person being great at X, then you make them study Y for a while, and it turns out that they are better at Y than an average person who spend the same time studying Y.
One observer says: “Clearly some of the skills at X have transferred to the skills of Y.”
Another observer says: “You just indirectly chose a smart person (by filtering for high skills at X), duh.”
This seems important to think about, I strong upvoted!
I’m not sure that link supports your conclusion.
First, the paper is about AI understanding its own behavior. This paper makes me expect that a CUDA-kernel-writing AI would be able to accurately identify itself as being specialized at writing CUDA kernels, which doesn’t support the idea that it would generalize to non-CUDA tasks.
Maybe if you asked the AI “please list heuristics you use to write CUDA kernels,” it would be able to give you a pretty accurate list. This is plausibly more useful for generalizing, because if the model can name these heuristics explicitly, maybe it can also use the ones that generalize, if they do generalize. This depends on 1) the model is aware of many heuristics that it’s learned, 2) many of these heuristics generalize across domains, and 3) it can use its awareness of these heuristics to successfully generalize. None of these are clearly true to me.
Second, the paper only tested GPT-4o and Llama 3, so the paper doesn’t provide clear evidence that more capable AIs “shift some towards (2).” The authors actually call out in the paper that future work could test this on smaller models to find out if there are scaling laws—has anybody done this? I wouldn’t be too surprised if small models were also able to self-report simple attributes about themselves that were instilled during training.
Fair, but I think the AI being aware of its behavior is pretty continuous with being aware of the heuristics it’s using and ultimately generalizing these (e.g., in some cases the AI learns what code word it is trying to make the user say which is very similar to being aware of any other aspect of the task it is learning). I’m skeptical that very weak/small AIs can do this based on some other papers which show they fail at substantially easier (out-of-context reasoning) tasks.
I think most of the reason why I believe this is improving with capabilities is due to a broader sense of how well AIs generalize capabilities (e.g., how much does o3 get better at tasks it wasn’t trained on), but this paper was the most clearly relevant link I could find.
I’m not sure o3 does get significantly better at tasks it wasn’t trained on. Since we don’t know what was in o3′s training data, it’s hard to say for sure that it wasn’t trained on any given task.
To my knowledge, the most likely example of a task that o3 does well on without explicit training is GeoGuessr. But see this Astral Codex Ten post, quoting Daniel Kang:[1]
I think this is a bit overstated, since GeoGuessr is a relatively obscure task, and implementing an idea takes much longer than thinking of it.[2] But it’s possible that o3 was trained on GeoGuessr.
The same ACX post also mentions:
Do you have examples in mind of tasks that you don’t think o3 was trained on, but which it nonetheless performs significantly better at than GPT-4o?
Disclaimer: Daniel happens to be my employer
Maybe not for cracked OpenAI engineers, idk
I would guess that OpenAI has trained on GeoGuessr. It should be pretty easy to implement—just take images off the web which have location metadata attached, and train to predict the location. Plausibly getting good at Geoguessr imbues some world knowledge.
I think this might be wrong when it comes to our disagreements, because I don’t disagree with this shortform.[1] Maybe a bigger crux is how valuable (1) is relative to (2)? Or the extent to which (2) is more helpful for scientific progress than (1)?
As long as “downstream performance” doesn’t include downstream performance on tasks that themselves involve a bunch of integrating/generalising.
I don’t think this explains our disagreements. My low confidence guess is we have reasonably similar views on this. But, I do think it drives parts of some disagreements between me and people who are much more optimisitic than me (e.g. various not-very-concerned AI company employees).
I agree the value of (1) vs (2) might also be a crux in some cases.
Is the crux that the more optimistic folks plausibly agree (2) is cause for concern, but believe that mundane utility can be reaped with (1), and they don’t expect us to slide from (1) into (2) without noticing?
I suppose that most tasks that an LLM can accomplish could theoretically be performed more efficiently by a dedicated program optimized for that task (and even better by a dedicated physical circuit). Hypothesis 1) amounts to considering that such a program, a dedicated module within the model, is established during training. This module can be seen as a weak AI used as a tool by the stronger AI. A bit like how the human brain has specialized modules that we (the higher conscious module) use unconsciously (e.g., when we read, the decoding of letters is executed unconsciously by a specialized module).
We can envision that at a certain stage the model becomes so competent in programming that it will tend to program code on the fly, a tool, to solve most tasks that we might submit to it. In fact, I notice that this is already increasingly the case when I ask a question to a recent model like Claude Sonnet 3.7. It often generates code, a tool, to try to answer me rather than trying to answer the question ‘itself.’ It clearly realizes that dedicated code will be more effective than its own neural network. This is interesting because in this scenario, the dedicated module is not generated during training but on the fly during normal production operation. In this way, it would be sufficient for AI to become a superhuman programmer to become superhuman in many domains thanks to the use of these tool-programs. The next stage would be the on-the-fly production of dedicated physical circuits (FPGA, ASIC, or alien technology), but that’s another story.
This refers to the philosophical debate about where intelligence resides: in the tool or in the one who created it? In the program or in the programmer? If a human programmer programs a superhuman AI, should we attribute this superhuman intelligence to the programmer? Same question if the programmer is itself an AI? It’s the kind of chicken and egg debate where the answer depends on how we divide the continuity of reality into discrete categories. You’re right that integration is an interesting criterion as it is a kind of formal / non arbitrary solution to this problem of defining discrete categories among the continuity of reality.
A response to Dwarkesh’s post arguing continual learning is a bottleneck.
This is a response to Dwarkesh’s post “Why I have slightly longer timelines than some of my guests”. I originally posted this response on twitter here.
I agree with much of this post. I also have roughly 2032 medians to things going crazy, I agree learning on the job is very useful, and I’m also skeptical we’d see massive white collar automation without further AI progress.
However, I think Dwarkesh is wrong to suggest that RL fine-tuning can’t be qualitatively similar to how humans learn. In the post, he discusses AIs constructing verifiable RL environments for themselves based on human feedback and then argues this wouldn’t be flexible and powerful enough to work, but RL could be used more similarly to how humans learn.
My best guess is that the way humans learn on the job is mostly by noticing when something went well (or poorly) and then sample efficiently updating (with their brain doing something analogous to an RL update). In some cases, this is based on external feedback (e.g. from a coworker) and in some cases it’s based on self-verification: the person just looking at the outcome of their actions and then determining if it went well or poorly.
So, you could imagine RL’ing an AI based on both external feedback and self-verification like this. And, this would be a “deliberate, adaptive process” like human learning. Why would this currently work worse than human learning?
Current AIs are worse than humans at two things which makes RL (quantitatively) much worse for them:
Robust self-verification: the ability to correctly determine when you’ve done something well/poorly in a way which is robust to you optimizing against it.
Sample efficiency: how much you learn from each update (potentially leveraging stuff like determining what caused things to go well/poorly which humans certainly take advantage of). This is especially important if you have sparse external feedback.
But, these are more like quantitative than qualitative issues IMO. AIs (and RL methods) are improving at both of these.
All that said, I think it’s very plausible that the route to better continual learning routes more through building on in-context learning (perhaps through something like neuralese, though this would greatly increase misalignment risks...).
Some more quibbles:
For the exact podcasting tasks Dwarkesh mentions, it really seems like simple fine-tuning mixed with a bit of RL would solve his problem. So, an automated training loop run by the AI could probably work here. This just isn’t deployed as an easy-to-use feature.
For many (IMO most) useful tasks, AIs are limited by something other than “learning on the job”. At autonomous software engineering, they fail to match humans with 3 hours of time and they are typically limited by being bad agents or by being generally dumb/confused. To be clear, it seems totally plausible that for podcasting tasks Dwarkesh mentions, learning is the limiting factor.
Correspondingly, I’d guess the reason that we don’t see people trying more complex RL based continual learning in normal deployments is that there is lower hanging fruit elsewhere and typically something else is the main blocker. I agree that if you had human level sample efficiency in learning this would immediately yield strong results (e.g., you’d have very superhuman AIs with 10^26 FLOP presumably), I’m just making a claim about more incremental progress.
I think Dwarkesh uses the term “intelligence” somewhat atypically when he says “The reason humans are so useful is not mainly their raw intelligence. It’s their ability to build up context, interrogate their own failures, and pick up small improvements and efficiencies as they practice a task.” I think people often consider how fast someone learns on the job as one aspect of intelligence. I agree there is a difference between short feedback loop intelligence (e.g. IQ tests) and long feedback loop intelligence and they are quite correlated in humans (while AIs tend to be relatively worse at long feedback loop intelligence).
Dwarkesh notes “An AI that is capable of online learning might functionally become a superintelligence quite rapidly, even if there’s no algorithmic progress after that point.” This seems reasonable, but it’s worth noting that if sample efficient learning is very compute expensive, then this might not happen so rapidly.
I think AIs will likely overcome poor sample efficiency to achieve a very high level of performance using a bunch of tricks (e.g. constructing a bunch of RL environments, using a ton of compute to learn when feedback is scarce, learning from much more data than humans due to “learn once deploy many” style strategies). I think we’ll probably see fully automated AI R&D prior to matching top human sample efficiency at learning on the job. Notably, if you do match top human sample efficiency at learning (while still using a similar amount of compute to the human brain), then we already have enough compute for this to basically immediately result in vastly superhuman AIs (human lifetime compute is maybe 3e23 FLOP and we’ll soon be doing 1e27 FLOP training runs). So, either sample efficiency must be worse or at least it must not be possible to match human sample efficiency without spending more compute per data-point/trajectory/episode.
I agree that robust self-verification and sample efficiency are the main things AIs are worse at than humans, and that this is basically just a quantitative difference. But what’s the best evidence that RL methods are getting more sample efficient (separate from AIs getting better at recognizing their own mistakes)? That’s not obvious to me but I’m not really read up on the literature. Is there a benchmark suite you think best illustrates that?
RL sample efficiency can be improved by both:
Better RL algorithms (including things that also improve pretraining sample efficiency like better architectures and optimizers).
Smarter models (in particular, smarter base models, though I’d also expect that RL in one domain makes RL in some other domain more sample efficient, at least after sufficient scale).
There isn’t great evidence that we’ve been seeing substantial improvements in RL algorithms recently, but AI companies are strongly incentivized to improve RL sample efficiency (as RL scaling appears to be yielding large returns) and there are a bunch of ML papers which claim to find substantial RL improvements (though it’s hard to be confident in these results for various reasons). So, we should probably infer that AI companies have made substantial gains in RL algorithms, but we don’t have public numbers. 3.7 sonnet was much better than 3.5 sonnet, but it’s hard to know how much of the gain was from RL algorithms vs from other areas.
Minimally, there is a long running trend in pretraining algorithmic efficiency and many of these improvements should also transfer somewhat to RL sample efficiency.
As far as evidence that smarter models learn more sample efficiently, I think the deepseek R1 paper has some results on this. Probably it’s also possible to find various pieces of support for this in the literature, but I’m more familiar with various anecdotes.
I agree with most of Dwarkesh’s post, with essentially the same exceptions you’ve listed.
I wrote about this recently in LLM AGI will have memory, and memory changes alignment, drawing essentially the conclusions you’ve given above. Continuous learning is a critical strength of humans and job substitution will be limited (but real) until LLM agents can do effective self-directed learning. It’s quite hard to say how fast that will happen, for the reasons you’ve given. Fine-tuning is a good deal like human habit/skill learning; the question is how well agents can select what to learn.
One nontrivial disagreement is on the barrier to long time horizon task performance. Humans don’t learn long time-horizon task performance primarily from RL. We learn in several ways at different scales, including learning new strategies which can be captured in language. All of those types of learning do rely on self-assessment and decisions about what’s worth learning, and those will be challenging and perhaps difficult to get out of LLM agents—although I don’t think there’s any fundamental barrier to squeezing workable judgments out of them, just some schlep in scaffodling and training to do it better.
Based on this logic, my timelines are getting longer in median (although rapid progress is still quite possible and we are far from prepared).
But I’m getting somewhat less pessimistic by the prospect of having incompetent autonomous agents with self-directed learning. These would probably both take over some jobs, and display egregious misalignment. I think they’ll be a visceral wakeup call that has decent odds of getting society properly freaked out about human-plus AI with a little time left to prepare.
On this:
I think the key difference is as of right now, RL fine-tuning doesn’t change the AI’s weights after training, and continual learning is meant to focus on AI weight changes after the notional training period ends.
Are you claiming that RL fine-tuning doesn’t change weights? This is wrong.
Maybe instead you’re saying “no one ongoingly does RL fine-tuning where they constantly are updating the weights throughout deployment (aka online training)”. My response is: sure, but they could do this, they just don’t because it’s logistically/practically pretty annoying and the performance improvement wouldn’t be that high, at least without some more focused R&D on making this work better.
Yes, you got me right, and my claim here isn’t that the work is super-high returns right now, and sample efficiency/not being robust to optimization against a verifier are currently the big issues at the moment, but I am claiming that in practice, online training/constantly updating the weights throughout deployment will be necessary for AI to automate important jobs like AI researchers, because most tasks require history/continuously updating on successes and failures rather than 1-shotting the problem.
If you are correct in that there is a known solution, and it merely requires annoying logistical/practical work, then I’d accept short timelines as the default (modulo long-term memory issues in AI).
To expand on this, I also expect by default that something like a long-term memory/state will be necessary, due to the issue that not having a memory means you have to relearn basic skills dozens of times, and this drastically lengthens the time to complete a task to the extent that it’s not viable to use an AI instead of a human.
Comments below:
https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1#Snvr22zNTXmHcAhPA
https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1#vFq87Ge27gashgwy9
One possibility I’ve wondered about is whether AI can automate this learning work: start from a transcript of someone trying to do things with AI with mistakes and subsequent feedback, and then curating some data from that works well for RL fine-tuning. Or even distilling it into examples for in-context learning (which probably works somewhat well, sometimes, today).
The 5 main reasons why control could fail
Here are the 5 main categories of reasons why control could fail (though various of them are often correlated).
[Note: this document has a moderate amount of overlap with various prior documents Redwood has released, but it seemed like a useful document nonetheless.]
Insufficient (long horizon) sample efficiency: If your AI is scheming, you won’t be able to reasonably sample efficiently get this AI to achieve near maximal reward on average on (long horizon) tasks. By near maximal, we just mean empirically achieving average reward which is quite close to the reward the AI would get if it was “actually trying” to get high reward (insofar as it is scheming to an extent that “trying” makes sense). (When we say average, we actually mean average excluding rare outliers. We can’t (in general) control behavior on rare outliers using normal ML training.) For control, this causes issues in two places: it interferes with getting useful work out of AIs and being able to run accurate capability evaluations.
Auditing failures: inability for humans to sufficiently assess actions. (Though note there can be solutions to these failures which don’t involve improving auditing.) This includes:
Average case usefulness: For scheming AIs, you won’t be able to check that the work it’s doing is actually good (on average) rather than just looking good to you. Note that this includes getting enough actually good work out of AIs.
Average case safety: Direct safety issues that occur due to auditing failing to notice a large number of actively dangerous actions and then these actions build up into a catastrophic outcome. (E.g., the AI slowly persuades some group in the AI lab to become AI cultists.) One particularly salient special case is:
We can’t prevent AIs from building up huge amounts of coup science: Control will depend on human researchers being similarly good or better at understanding than our countermeasures and the space of possible AI attack strategies. See Notes on control evaluations for safety cases for more discussion.
High-stakes safety: The auditing process fails to detect that a single action is very bad when it reviews that action and this is extremely bad because that action is part of a high-stakes failure (e.g., an escape attempt that occurs in a small number of actions).
Concentrated failures: inability to prevent concentrated catastrophes (safety failures which are concentrated into a small number of bad actions or a small amount of serial time) despite having an expensive and slow auditing process which could detect this bad behavior (it works for high-stakes safety) if it was assigned to review these actions.
Lack of institutional will or ability: Even if control would work given a few years of delay and an implementation carried out by a quite well-run organization, you won’t get one of these in practice and thus you’ll have trouble. (This argument also applies to any other technical approaches where the implementation requires substantial institutional will, therefore we need some sort of technical solutions which require less institutional competence.)
Catching the AI isn’t sufficiently useful: The viability of control will likely depend on it being very useful to catch the AI one or a small number of times. But, this might turn out to be false for various reasons. See Catching AIs red-handed for discussion. This might depend heavily on questions which aren’t about technical AI safety like how useful legible evidence of serious misalignment is. But, the technical usefulness of catching could suffice.
We think that various of these things could go wrong in practice. However, it seems like the current point on the logistic success curve is such that if an AI lab tried very hard to evaluate and ensure control, they could improve the situation substantially (e.g., 3-5x reduction in doom) relative to not trying very hard.
Note that 2, 3, and 4 of the above failures probably just prevent us from building sufficiently safe and useful protocols without breaking control evaluations. 1 and 2.b.i (coup science build-up) are likely to break control evaluations, though it might be possible to recognize that we’re in a world where 1 and/or 2.b.i hold. 5 has more complex effects, but we’d probably be able to safely stop after realizing that we’re unable to extract enough value from the times when we’ve caught the AI; so the main concern would be that we’re somewhat clearly unable to safely continue rather than being caught dangerously unaware.
See here for more discussion.
Here are some reasons why control might not be worth working on (or actively bad to work on) regardless of the above reasons:
Control is bad for coordination: Perhaps control does improve safety, but it does so at the cost of making it easier to make excuses, giving false confidence, and/or creating slippery slopes.
Control makes it cheaper and easier to do morally problematic things: Control might make it more commercially valuable and technically feasible to employ AIs in situations where we know these AIs don’t consent to this usage (and these AIs are well described as having preferences that deserve moral consideration), and this might make violations of basic moral principles (such as slavery) more likely. (In the same way as better technology for managing human insider threats could make human slavery more commercially valuable. Of course, in the case of human slavery, there are already institutions and norms which combat slavery.) However, note that our recommendation to society would be to try extremely hard to avoid employing AIs without their consent; if AIs don’t consent to employment we should (temporarily) shut down or negotiate a deal with the AIs. Much more commentary on these topics can be seen here. Our overall view is that the default AI situation with only current alignment/safety/interpretability technology is somewhat scary from an AI welfare/rights perspective, but that control doesn’t make the situation worse.
Scheming is extremely unlikely for early transformatively useful AI (<1%) and control-style approaches won’t be a useful safeguard against other threat models: Control focuses on making conservative assumptions about the effects of training and the motivations of AIs. If we didn’t have to make such conservative assumptions, then control-type approaches look much less worthwhile. However, control-type things might still be a useful line of defense against failure modes which aren’t well described as scheming.
Baseline control strategies will work (when control is at all possible) and control evaluations are easy to run: If the baseline approaches can’t be improved on and there isn’t important research on the evaluations, then technical work on control isn’t very important. Policy/governance work might still be important. See also here.
It seems slightly bad that you can vote on comments under the Quick Takes and Popular Comments sections without needing to click on the comment and without leaving the frontpage.
This is different than posts where in order to vote, you have to actually click on the post and leave the main page (which perhaps requires more activation energy).
I’m unsure, but I think this could result in a higher frequency of cases where Quick Takes and comments go viral in problematic ways relative to posts. (Based on my vague sense from the EA forum.)
This doesn’t seem very important, but I thought it would be worth quickly noting.
I feel like Quick Takes and Popular Comments on the EA forum both do a good job of showing me content I’m glad to have seen (and in the case of the EA forum that I wouldn’t have otherwise seen) and of making the experience of making a shortform better (in that you get more and better engagement). So far, this also seems fairly true on LessWrong, but we’re presumably out of equilibrium.
Overall, I feel like posts on LessWrong are too long, and so am particularly excited about finding my way to things that are a little shorter. (It would also be cool if we could figure out how to do rigour well or other things that benefit from length, but I don’t feel like I’m getting that much advantage from the length yet).
My guess is that the problems you gesture at will be real and the feature will still be net positive. I wish there were a way to “pay off” the bad parts with some of the benefit; probably there is and I haven’t thought of it.
Yeah, I was a bit worried about this, especially for popular comments. For quick takes, it does seem like you can see all the relevant context before voting, and you were also able to already vote on comments from the Recent Discussion section (though that section isn’t sorted by votes, so at less risk of viral dynamics). I do feel a bit worried that popular comments will get a bunch of votes without people reading the underlying post it’s responding to.
Some people seem to think my timelines have shifted a bunch while they’ve only moderately changed.
Relative to my views at the start of 2025, my median (50th percentile) for AIs fully automating AI R&D was pushed back by around 2 years—from something like Jan 2032 to Jan 2034. My 25th percentile has shifted similarly (though perhaps more importantly) from maybe July 2028 to July 2030. Obviously, my numbers aren’t fully precise and vary some over time. (E.g., I’m not sure I would have quoted these exact numbers for this exact milestone at the start of the year; these numbers for the start of the year are partially reverse engineered from this comment.)
Fully automating AI R&D is a pretty high milestone; my current numbers for something like “AIs accelerate AI R&D as much as what would happen if employees ran 10x faster (e.g. by ~fully automating research engineering and some other tasks)” are probably 50th percentile Jan 2032 and 25th percentile Jan 2029.[1]
I’m partially posting this so there is a record of my views; I think it’s somewhat interesting to observe this over time. (That said, I don’t want to anchor myself, which does seem like a serious downside. I should slide around a bunch and be somewhat incoherent if I’m updating as much as I should: my past views are always going to be somewhat obviously confused from the perspective of my current self.)
While I’m giving these numbers, note that I think Precise AGI timelines don’t matter that much.
See this comment for the numbers I would have given for this milestone at the start of the year.
What are your very long timeline expectations, for 2045+ or 2055+ AGI (automated AI R&D, sure)? That’s where I expect most of the rare futures with humanity not permanently disempowered to be, though the majority even of these long timelines will still result in permanent disempowerment (or extinction).
I think it takes at least about 10 years to qualitatively transform an active field of technical study or change the social agenda, so 2-3 steps of such change might have a chance of sufficiently reshaping how the world thinks about AI x-risk and what technical tools are available for shaping minds of AGIs, in order to either make a human-initiated lasting Pause plausible, or to have the means of aligning AGIs in an ambitious sense.
I appreciate you saying that the 25th percentile timeline might be more important. I think that’s right and underappreciated.
One of your recent (excellent) posts also made me notice that AGI timelines probably aren’t normally distributed. Breakthroughs, other large turns of events, or large theoretical misunderstandings at this point probably play a large role, and there are probably only a very few of those that will hit. Small unpredictable events that create normal distributions will play a lesser role.
I don’t know how you’d characterize that mathematically, but I don’t think it’s right to assume it’s normally distributed, or even close.
Back to your comment on the 25th percentile being important: I think there’s a common error where people round to the median and then think “ok, that’s probably when we need to have alignment/strategy figured out.” You’d really want to have it at least somewhat ready far earlier.
That’s both in case it’s on the earlier side of the predicted distribution, and because alignment theory and practice need to be ready far enough in advance of game time to have diffused and be implemented for the first takeover-capable model.
I’ve been thinking of writing a post called something like “why are so few people frantic about alignment?” making those points. Stated timeline distributions don’t seem to match mood IMO and I’m trying to figure out why. I realize that part of it is a very reasonable “we’ll figure it out when/if we get there.” And perhaps others share my emotional dissociation from my intellectual expectations. But maybe we should all be a bit more frantic. I’d like some more halfassed alignment solutions in play and under discussion right now. The 80⁄20 rule probably applies here.
I think it would be interesting if you replied to this comment once a year or so to report how your timelines have changed.
comment on a year old post may not be the best place, maybe a new short form on this day yearly which links to all previous posts?
I did have similar timelines, but I have generally moved them forward to ~ 2027-2028 after updating my priors with OpenAI/XAI employee X posts. They are generally confident that post-training scaleups will accelerate timelines, as well as stating that post-training can be scaled, with return to match, well beyond that of pre-training. As they have inside information, I am inclined to trust them. Insiders like Anthropic’s Jack Clark are even more bold, he says that superintelligent machines will be available late 2026.
Some of my proposals for empirical AI safety work.
I sometimes write proposals for empirical AI safety work without publishing the proposal (as the doc is somewhat rough or hard to understand). But, I thought it might be useful to at least link my recent project proposals publicly in case someone would find them useful.
Decoding opaque reasoning in current models
Safe distillation
Basic science of teaching AIs synthetic facts[1]
How to do elicitation without learning
If you’re interested in empirical project proposals, you might also be interested in “7+ tractable directions in AI control”.
I’ve actually already linked this (and the proposal in the next bullet) publicly from “7+ tractable directions in AI control”, but I thought it could be useful to link here as well.
would you consider sharing these as Related Papers on AI-Plans?
About 1 year ago, I wrote up a ready-to-go plan for AI safety focused on current science (what we roughly know how to do right now). This is targeting reducing catastrophic risks from the point when we have transformatively powerful AIs (e.g. AIs similarly capable to humans).
I never finished this doc, and it is now considerably out of date relative to how I currently think about what should happen, but I still think it might be helpful to share.
Here is the doc. I don’t particularly want to recommend people read this doc, but it is possible that someone will find it valuable to read.
I plan on trying to think though the best ready-to-go plan roughly once a year. Buck and I have recently started work on a similar effort. Maybe this time we’ll actually put out an overall plan rather than just spinning off various docs.
This seems like a great activity, thank you for doing/sharing it. I disagree with the claim near the end that this seems better than Stop, and in general felt somewhat alarmed throughout at (what seemed to me like) some conflation/conceptual slippage between arguments that various strategies were tractable, and that they were meaningfully helpful. Even so, I feel happy that the world contains people sharing things like this; props.
At the start of the doc, I say:
Towards the end of the doc I say:
Presumably, you’re objecting to ‘I think careful implementation of this sort of plan is probably better than “shut everything down” for most AI labs’.
My current view is something like:
If there was broad, strong, and durable political will and buy in for heavily prioritizing AI takeover risk in the US, I think it would be good if the US government shut down scaling and took strong actions to prevent frontier AI progress while also accelerating huge amounts of plausibly-safety-related research.
You’d need to carefully manage the transition back to scaling to reduce hardware overhang issues. This is part of why I think “durable” political will is important. There are various routes for doing this with different costs.
I’m sympathetic to thinking this doesn’t make sense if you just care about deaths prior to age 120 of currently alive people and wide-spread cryonics is hopeless (even conditional on the level of political support for mitigating AI takeover risk). Some other views which just care about achieving close to normal lifespans for currently alive humans also maybe aren’t into pausing.
Regulations/actions which have the side effect of slowing down scaling which aren’t part of a broad package seem way less good. This is partially due to hardware/algorithmic overhang concerns, but more generally due to follow though concerns. I also wouldn’t advocate such regulations (regulations with the main impact being to slow down as a side effect) due to integrity/legitimacy concerns.
Unilateral shutdown is different as advice than “it would be good if everyone shut down” because AI companies might think (correctly or not) that they would be better on safety than other companies. In practice, no AI lab seems to have expressed a view which is close to “acceleration is bad” except Anthropic (see core views on AI safety).
We are very, very far from broad, strong, and durable political will for heavily prioritizing AI takeover, so weaker and conditional interventions for actors on the margin seem useful.
A response to “State of play of AI progress (and related brakes on an intelligence explosion)” by Nathan Lambert.
Nathan Lambert recently wrote a piece about why he doesn’t expect a software-only intelligence explosion. I responded in this substack comment which I thought would be worth copying here.
As someone who thinks a rapid (software-only) intelligence explosion is likely, I thought I would respond to this post and try to make the case in favor. I tend to think that AI 2027 is a quite aggressive, but plausible scenario.
I interpret the core argument in AI 2027 as:
We’re on track to build AIs which can fully automate research engineering in a few years. (Or at least, this is plausible, like >20%.) AI 2027 calls this level of AI “superhuman coder”. (Argument: https://ai-2027.com/research/timelines-forecast)
Superhuman coders will ~5x accelerate AI R&D because ultra fast and cheap superhuman research engineers would be very helpful. (Argument: https://ai-2027.com/research/takeoff-forecast#sc-would-5x-ai-randd).
Once you have superhuman coders, unassisted humans would only take a moderate number of years to make AIs which can automate all of AI research (“superhuman AI researcher”), like maybe 5 years. (Argument: https://ai-2027.com/research/takeoff-forecast#human-only-timeline-from-sc-to-sar). And, because of AI R&D acceleration along the way, this will actually happen much faster.
Superhuman AI researchers can obsolete and outcompete humans at all AI research. This includes messy data research, discovering new paradigms, and whatever humans might be doing which is important. These AIs will also be substantially faster and cheaper than humans, like fast and cheap enough that with 1⁄5 of our compute we can run 200k parallel copies each at 40x speed (for ~8 million parallel worker equivalents). Because labor is a key input into AI R&D, these AIs will be able to speed things up by ~25x. (Argument: https://ai-2027.com/research/takeoff-forecast#sar-would-25x-ai-randd)
These speed ups don’t stop just above the human range, they continue substantially beyond the human range, allowing for quickly yielding AIs which are much more capable than humans. (AI 2027 doesn’t really argue for this, but you can see here: https://ai-2027.com/research/takeoff-forecast#siar-would-250x-ai-randd)
(There are a few other minor components like interpolating the AI R&D multipliers between these milestones and the argument for being able to get these systems to run at effective speeds which are much higher than humans. It’s worth noting that these numbers are more aggressive than my actual median view, especially on the timeline to superhuman coder.)
Ok, now what is your counterargument?
Sections (1) and (2) don’t seem in conflict with the AI 2027 view or the possibility of a software-only intelligence explosion. (I’m not claiming you intended these to be in conflict, just clarifying why I’m not responding. I assume you have these sections as relevant context for later arguments.)
As far as I can tell, section (3) basically says “You won’t get large acceleration just via automating implementation and making things more compute efficient. Also, ML research is messy and thus it will be hard to get AI systems which can actually fully automate this.”
AI 2027 basically agrees with these literal words:
The acceleration from superhuman coders which fully automate research engineering (and run at high speeds) is “only” 5x, and to get the higher 25x (or beyond) acceleration, the AIs needed to be as good as the best ML researchers and fully automate ML research.
AI 2027 thinks it would take ~5 years of unaccelerated human AI R&D progress to get from superhuman coder to superhuman AI researcher, so the forecast incorporates this being pretty hard.
There is presumably a quantitative disagreement: you probably think that the acceleration from superhuman coders is lower and the unassisted research time to go from superhuman coders to superhuman AI researchers is longer (much more than 5 years?).
(FWIW, I think AI 2027 is probably too bullish on the superhuman coder AI R&D multiplier, I expect more like 3x.)
I’ll break this down a bit further.
Importantly, I think the claim:
Is probably missing the point: AI 2027 is claiming that we’ll get these large multipliers by being able to make AIs which beat humans at the overall task of AI research! So, it just needs to be the case that machine learning research could be greatly accelerated by much faster (and eventually very superhuman) labor. I think the extent of this acceleration and the returns are an open question, but I don’t think whether it is entirely bottlenecked by compute efficiency and implementation difficulty is the crux. (The extent to which compute efficiency and implementation difficulty bottleneck AI R&D is the most important factor for the superhuman coder acceleration, but not the acceleration beyond this.)
As far as the difficulty of making a superhuman AI researcher, I think your implicit vibes are something like “we can make AIs better and better at coding (or other easily checkable tasks), but this doesn’t transfer to good research taste and intuitions”. I think the exact quantitative difficulty of humans making AIs which can automate ML research is certainly an open question (and it would be great to get an overall better understanding), but I think there are good reasons to expect the time frame for unassisted humans (given a fixed compute stock) to be more like 3-10 years than like 30 years:
ML went from AlexNet to GPT-4 in 10 years! Fast progress has happened historically. To be clear, this was substantially driven by compute scaling (for both training runs and experiments), but nonetheless, research progress can be fast. The gap from superhuman coder to superhuman AI research intuitively feels like a much smaller gap than the gap from AlexNet to GPT-4.
For success on harder to check tasks, there are hopes for both extending our ability to check and generalization. See this blog post for more discussion. Concretely, I think we can measure whether AIs produce some research insight that ended up improving performance and so we can RL on this, at least at small scale (which might transfer). We can also train on things like forecasting the results of training runs etc. Beyond this, we might just be able to get AIs to generalize well: many of current AI capabilities are due to reasonably far generalization and I expect this to continue.
As far as section (4), I think the core argument is “solving real-world domains (e.g., things which aren’t software only) will be hard”. You might also be arguing that RL will struggle to teach AIs to be good ML researchers (it’s a bit hard for me to tell how applicable your argument is to this).
I think the biggest response to the concern with real world domains is “the superhuman AI researchers can figure this out even if it would have taken humans a while”. As far as RL struggling to teach AIs to be good scientists (really ML researchers in particular), I think this is addressed above, though I agree this is a big unknown. Note that we can (and will) explore approaches that differ some from the current RL paradigm and acceleration from superhuman coders might make this happen a bunch faster.
I find this argument somewhat wild if you assume that the AI researchers in the datacenter are making extremely fast progress. Insofar as the software only singularity works, the AI researchers in the datacenter can rapidly improve efficiency using algorithmic progress saving far more compute than you spent to run them. Currently, inference costs for a given level of performance drop by around 9-40x per year depending on the exact task. (See the graph you included.) So, if AIs can make the progress happen much faster, it will easily pay for itself on inference costs alone. That said, I don’t expect this is the largest source of returns...
There will be diminishing returns, but if the software only singularity ends up being viable, then these diminishing returns will be outpaced by ever more effective and smarter AI researchers in the datacenter meaning progress wouldn’t slow until quite high limits are reached.
I assume your actual crux is about the returns to AI R&D at the point of full (or substantial) automation by AIs. You think the returns are low enough that companies will (rationally) invest elsewhere.
I agree that the fact that AI companies aren’t investing all their compute into AI R&D is some update about the returns to algorithms progress, but the regime might differ substantially after full automation by AI researchers!
(And, I think the world would probably be better if AI companies all avoid doing a rapid intelligence explosion internally!)
Minimally, insofar as AI companies are racing as fast as possible, we’d expect that whatever route the companies take is the one they think is fastest! So, if there is another route than AI accelerating AI R&D which is faster, fair enough companies would do this instead, but if the AI automating AI R&D story is insanely fast, then this route must be even faster.
Nathan responds here, and I respond to his response there as well.
It seems like rather than talking about software-only intelligence explosions, we should be talking about different amounts of hardware vs. software contributions. There will be some of both.
I find a software-mostly intelligence explosion to be quite plausible, but it would take a little longer on average than a scenario in which hardware usage keeps expanding rapidly.
I think it might be worth quickly clarifying my views on activation addition and similar things (given various discussion about this). Note that my current views are somewhat different than some comments I’ve posted in various places (and my comments are somewhat incoherent overall), because I’ve updated based on talking to people about this over the last week.
This is quite in the weeds and I don’t expect that many people should read this.
It seems like activation addition sometimes has a higher level of sample efficiency in steering model behavior compared with baseline training methods (e.g. normal LoRA finetuning). These comparisons seem most meaningful in straightforward head-to-head comparisons (where you use both methods in the most straightforward way). I think the strongest evidence for this is in Liu et al..
Contrast pairs are a useful technique for variance reduction (to improve sample efficiency), but may not be that important (see Liu et al. again). It’s relatively natural to capture this effect using activation vectors, but there is probably some nice way to incorporate this into SGD. Perhaps DPO does this? Perhaps there is something else?
Activation addition works better than naive few-shot prompting in some cases, particularly in cases where the way you want to steer the model isn’t salient/obvious from the few-shot prompt. But it’s unclear how it performs in comparison to high-effort prompting. Regardless, activation addition might work better “out of the box” because fancy prompting is pretty annoying.
I think training models to (in general) respond well to “contrast prompting”, where you have both positive and negative examples in the prompt, might work well. This can be done in a general way ahead of time and then specific tuning can just use this format (the same as instruction finetuning). For a simple prompting approach, you can do “query_1 Good response: pos_ex_1, Bad response: neg_ex_1, query_2 Good response: pos_ex_2, Bad response: neg_ex_2, …” and then prompt with “actual_query Good response:”. Normal pretrained models might not respond well to this prompt by default, but I haven’t checked.
I think we should be able to utilize the fact that activation addition works to construct a tweaked inductive bias for SGD on transformers (which feels like a more principled approach from my perspective if the goal is better sample efficiency). More generally, I feel like we should be able to do ablations to understand why activation addition works and then utilize this separately.
I expect there are relatively straightforward ways to greatly improve sample efficiency of normal SGD via methods like heavy proliferation of training data. I also think “really high sample efficiency from a small number of samples” isn’t very solved at the moment. I think if we’re really going for high sample efficiency from a fixed training dataset we should be clear about that and I expect a variety of very different approaches are possible.
The advantages versus prompting in terms of inference time performance improvement (because activation engineering means you have to process fewer tokens during inference) don’t seem very important to me because you can just context distill prompts.
For cases where we want to maximize a metric X and we don’t care about sample efficiency or overfitting to X, we can construct a big dataset of high quality demonstrations, SFT on these demos, and then RL against our metric. If we do this in a well tuned way for a large number of samples such that sample efficiency isn’t much of an issue (and exploration problems are substantial), I would be pretty surprised if activation addition or similar approaches can further increase metric X and “stack” with all the RL/demos we did. That is, I would be surprised if this is true unless the activation engineering is given additional affordances/information not in our expert demos and the task isn’t heavily selected for activation engineering stacking like this.
It’s unclear how important it is to (right now) work on sample efficiency specifically to reduce x-risk. I think it seems not that important, but I’m unsure. Better sample efficiency could be useful for reducing x-risk because better sample efficiency could allow for using a smaller amount of oversight data labeled more carefully for training your AI, but I expect that sample efficiency will naturally be improved to pretty high levels by standard commercial/academic incentives. (Speculatively, I also expect that the marginal returns will just generally look better for improving oversight even given equal effort or more effort on oversight basically because we’ll probably need a moderate amount of data anyway and I expect that improving sample efficiency in the “moderate data” regime is relatively harder.) One exception is that I think that sample efficiency for very low sample-count cases might not be very commercially important, but might be highly relevant for safety after catching the AI doing egregiously bad actions and then needing to redeploy. For this, it would be good to focus on sample efficiency specifically in ways which are analogous to training an AI or a monitor after catching an egregiously bad action.
The fact that activation addition works reveals some interesting facts about the internals of models; I expect there are some ways to utilize something along these lines to reduce x-risk.
In principle, you could use activation additions or similar editing techniques to learn non-obvious facts about the algorithms which models are implementing via running experiments where you (e.g.) add different vectors at different layers and observe interactions (interpretability). For this to be much more powerful than black-box model psychology or psychology/interp approaches which use a bit of training, you would probably need to do try edits at many different layers and map out an overall algorithm. (And/or do many edits simultaneously).
I think “miscellaneous interventions on internals” for high-level understanding of the algorithms that models are implementing seems promising in principle, but I haven’t seen any work in this space which I’m excited about.
I think activation addition could have “better” generalization in some cases, but when defining generalization experiments, we need to be careful about the analogousness of the setup. It’s also unclear what exact properties we want for generalization, but minimally being able to more precisely pick and predict the generalization seems good. I haven’t yet seen evidence for “better” generalization using activation addition which seems compelling/interesting to me. Note that we should use a reasonably sized training dataset, or we might be unintentionally measuring sample efficiency (in which case, see above). I don’t really see a particular reason why activation addition would result in “better” generalization beyond having some specific predictable properties which maybe mean we can know about cases where it is somewhat better (cases where conditioning on something is better than “doing” that thing?).
I’m most excited for generalization work targeting settings where oversight is difficult and a weak supervisor would make mistakes that result in worse policy behavior (aka weak-to-strong generalization). See this post and this post for more discussion of the setting I’m thinking about.
I generally think it’s important to be careful about baselines and exactly what problem we’re trying to solve in cases where we are trying to solve a specific problem as opposed to just doing some open-ended exploration. TBC, open ended exploration is fine, but we should know what we’re doing. I often think that when you make the exact problem you’re trying to solve and what affordances you are and aren’t allowed more clear, a bunch of additional promising methods become apparent. I think that the discussion I’ve seen so far of activation engineering (e.g. in this post, why do we compare to finetuning in a generalization case rather than just directly finetuning on what we want? Is doing RL to increase how sycophantic the model is in scope?) has not been very precise about what problem it’s trying to solve or what baseline techniques it’s claiming to outperform.
I don’t currently think the activation addition stuff is that important in expectation (for reducing x-risk) due to some of the beliefs I said above (though I’m not very confident). I’d be most excited about the “understand the algorithms the model is implementing” application above or possibly some generalization tests. This view might depend heavily on general views about threat models and how the future will go. Regardless, I ended up commenting on this a decent amount, so I thought it would be worth the time to clarify my views.
OpenAI seems to train on >1% chess
If you sample from davinci-002 at t=1 starting from “<|endoftext|>”, 4% of the completions are chess games.
Code
For babbage-002, we get 1%.
Probably this implies a reasonably high fraction of total tokens in typical OpenAI training datasets are chess games!
Despite chess being 4% of documents sampled, chess is probably less than 4% of overall tokens (Perhaps 1% of tokens are chess? Perhaps less?) Because chess consists of shorter documents than other types of documents, it’s likely that chess is a higher fraction of documents than of tokens.
(To understand this, imagine that just the token “Hi” and nothing else was 1⁄2 of documents. This would still be a small fraction of tokens as most other documents are far longer than 1 token.)
(My title might be slightly clickbait because it’s likely chess is >1% of documents, but might be <1% of tokens.)
It’s also possible that these models were fine-tuned on a data mix with more chess while pretrain had less chess.
Appendix A.2 of the weak-to-strong generalization paper explicitly notes that GPT-4 was trained on at least some chess.
Hebbar’s law: Philosophical views always have stranger implications than you expect, even when you take into account Hebbar’s Law.
This was written in response to a conversation I had with Vivek Hebbar about some strange implications of UDASSA-like views. In particular:
Is it better to spend your resources simulating a happy life or writing down programs which would simulate a bunch of happy lives? Naively the second feels worthless, but there seemingly are some unintuitive reasons why clever approaches for doing the later could be better.
Alien civilizations might race to the bottom by spending resources making their civilization easier to point at (and thus higher measure in the default UDASSA perspective).
Because there might be some other programs with a lot of computational resources which scan through simple universes looking for programs to run?
This one doesn’t feel so unintuitive to me since “easy to point at” is somewhat like “has a lot of copies” so it’s kinda similar to the biological desire to reproduce(though I assume you’re alluding to another motivation for becoming easy to point at?)
More like: because there is some chance that the actual laws of physics in our universe execute data as code sometimes (similar to how memory unsafe programs often do this). (And this is either epiphenomenal or it just hasn’t happened yet.) While the chance is extremely small, the chance could be high enough (equivalently, universes with this property have high enough measure) and the upside could be so much higher that betting on this beats other options.
This may also be a reason for AIs to simplify their values (after they’ve done everything possible to simplify everything else).
Eric Schwitzgebel has argued by disjunction/exhaustion for the necessity of craziness in the sense of “contrary to common sense and we are not epistemically compelled to believe it” at least in the context of philosophy of mind and cosmology.
https://faculty.ucr.edu/~eschwitz/SchwitzAbs/CrazyMind.htm
https://press.princeton.edu/books/hardcover/9780691215679/the-weirdness-of-the-world
Maybe someone should make an inbox for incidents of ChatGPT psychosis.
Currently, various people receive many emails or other communications from people who appear to exhibit ChatGPT psychosis: they seem (somewhat) delusional and this appears to have been driven (or at least furthered) by talking with ChatGPT or other chatbots. It might be helpful to create an inbox where people can easily forward these emails. The hope would be that the people who currently receive a ton of these emails could just forward these along (at least to the extent this is easy).[1]
This inbox would serve a few main functions:
Collect data for people interested in studying the phenomenon (how it’s changing over time, what it looks like, finding affected people to talk to).
There could be some (likely automated) system for responding to these people and attempting to help them. Maybe you could make an LLM-based bot which responds to these people and tries to talk them down in a healthy way. (Of course, it would be important that this bot wouldn’t just feed into the psychosis!) In addition to the direct value, this could be an interesting test case for applying AI to improve epistemics (in this case, a very particular type of improving epistemics, but results could transfer). It seems possible (though pretty unlikely) that something like ChatGPT psychosis grows to have substantial influence on the overall epistemic environment, in which case building defenses could be very important.
Some communication about the results from the inbox could be regularly sent to the relevant AI companies. It seems generally useful if the situation is made clear to relevant AI companies.[2]
I’m not going to create this inbox for a few reasons, but I think this could be a worthwhile project, especially for someone already interested in applying AI to improve epistemics in cases like this.
Some possible risks/difficulties of this project:
The person doing the project would probably need to be somewhat trustworthy for people to actually forward along emails. That said, someone could start a version which just looks through rejected LessWrong posts, finds likely cases of ChatGPT psychosis, and then collects this data and responds to people etc.
It’s pretty plausible that someone running this service would be blamed for any flaws in the service even if the service makes the situation almost strictly better. See also The Copenhagen Interpretation of Ethics.
It might be worthwhile to make a nice automated filter people can easily apply which highlights emails as likely cases of ChatGPT psychosis so that forwarding along emails is easier. I’m not sure how to best do this, and people presumably would want to manually check prior to forwarding emails to an inbox which is collecting data!
Duty to Due Diligence from Discoverable Documentation of Dangers style reasons could also be applicable.
If someone were interested I’d probably be happy to make a version of lesswrong.com/moderation that was more optimized for this.
Consider Tabooing Gradual Disempowerment.
I’m worried that when people say gradual disempowerment they often mean “some scenario in which humans are disempowered gradually over time”, but many readers will interpret this as “the threat model in the paper called ‘Gradual Disempowerment’”. These things can differ substantially and the discussion in this paper is much more specific than encompassing all scenarios in which humans slowly are disempowered!
(You could say “disempowerment which is gradual” for clarity.)
I think when people use the term “gradual disempowerment” predominantly in one sense, people will also tend to understand it in that sense. And I think that sense will be rather literal and not the one specifically of the original authors. Compare the term “infohazard” which is used differently (see comments here) from how Yudkowsky was using it.
I feel like there is a risk of this leading to a never-ending sequence of meta-communication concerns. For instance, what if a reader interprets “gradual” to mean taking more than 10 years, but the writer thought 5 would be sufficient for “gradual” (and see timelines discussions around stuff like continuity for how this keeps going). Or if the reader assumes “disempowerment” means complete disempowerment, but writer only meant some unspecified “significant amount” of disempowerment. Its definitely worthwhile to try to be clear initially, but I think we also have to accept that clarification may need to happen “on the backend” sometimes. This seems like a case where one could simply clarify they have a different understanding compared to the paper. In fact, Its not all that clear to me that people won’t implicitly translate “disempowerment which is gradual” to “gradual disempowerment”. It could be that the paper stands in just as much for the concept as for the literal words in people’s minds.
Sure, communication will always be imperfect and there is a never ending cascade of possible clarifications. But, sometimes it seems especially good to improve communication even at some cost. I just thought this was a case where there might be particular large miscommunciations.
In particular case, I was worried there was some problematic conflationary alliance style dynamics where a bunch of people might think there is a broader consensus for some idea than there actually is.
I don’t particularly like extrapolating revenue as a methodology for estimating timelines to when AI is (e.g.) a substantial fraction of US GDP (say 20%)[1], but I do think it would be worth doing a more detailed version of this timelines methodology. This is in response to Ege’s blog post with his version of this forecasting approach.
Here is my current favorite version of this (though you could do better and this isn’t that careful):
AI company revenue will be perhaps ~$20 billion this year and has roughly 3x’d per year over the last 2.5 years. Let’s say 1⁄2 of this is in the US for $10 billion this year (maybe somewhat of an underestimate, but whatever). Maybe GDP impacts are 10x higher than revenue due to AI companies not internalizing the value of all of their revenue (they might be somewhat lower due to AI just displacing other revenue without adding that much value), so to hit 20% of US GDP (~$6 trillion) AI companies would need around $600 billion in revenue. The naive exponential extrapolation indicates we hit this level of annualized revenue in 4 years in the early/middle 2029. Notably, most of this revenue would be in the last year, so we’d be seeing ~10% GDP growth.
Exponential could be too aggressive, e.g., maybe capabilities progress stalls out, there isn’t sufficient investment, fab capacity runs out (this should happen just around 2029 or so, but could happen early), or the historical exponential growth is mostly due to adoption in a small and fixed number of industries with limited demand.
Historically, revenue growth tends to slow over time as the revenue follows an adoption sigmoid. I don’t think this will apply to AI as much as the revenue growth is mostly due to the technology improving unlike other sectors and I don’t see a particular reason for this to slow (other than investment + compute running out), though of course it could.
But, exponential could also be insufficiently aggressive: AI capabilities might get increasingly useful triggering more and more adoption and revenue. I think OpenAI’s revenue is probably somewhat superexponential since the release of the GPT-3 API, but I’m not sure about this and couldn’t find numbers. Or AIs might be able to accelerate their own adoption. I also think that there could be phase changes around full automation which are important and could trigger one time increases in revenue.
These parameters are somewhat debatable, e.g., maybe you think that AI companies will internalize more than 10% of the value or that GDP generally does a poor job accounting for value (in most/all sectors not just AI), so the multiplier will be more like 3x or 5x (adding another year to the exponential forecast).
My overall sense is that AI company revenue growth will probably slow, but going at least exponential all the way to 20% of US GDP with >$600 billion in revenue before 2030 seems plausible (maybe I think ~35%, with some chance of very powerful AI that doesn’t yet trigger massive revenue). So, my overall sense is that naive revenue extrapolations imply a pretty bullish view on AI, but don’t rule out medians which are ~arbitrarily far into the future as progress might slow.
I’m not sure what economic milestone is most interesting to forecast to, so I was tempted by “the economy is substantially higher due to AI, likely meaning the growth rate is substantially higher”.
I do a similar estimate for full remote work automation here. The results are pretty similar as ~20% of US GDP and remote work automation will probably hit around the same time on the naive revenue extrapolation picture.
Overall, I agree with Eli’s perspective here: revenue extrapolations probably don’t get at the main crux except via suggesting that quite short timelines to crazy stuff (<4 years) are at least plausible.
How people discuss the US national debt is an interesting case study of misleading usage of the wrong statistic. The key thing is that people discuss the raw debt amount and the rate at which that is increasing, but you ultimately care about the relationship between debt and US gdp (or US tax revenue).
People often talk about needing to balance the budget, but actually this isn’t what you need to ensure[1], to manage debt it suffices to just ensure that debt grows slower than US GDP. (And in fact, the US has ensured this for the past 4 years as debt/GDP has decreased since 2020.)
To be clear, it would probably be good to have a national debt to GDP ratio which is less than 50% rather than around 120%.
There are some complexities with inflation because the US could inflate its way out of dollar dominated debt and this probably isn’t a good way to do things. But, with an independent central bank this isn’t much of a concern.
the debt/gdp ratio drop since 2020 I think was substantially driven by inflation being higher then expected rather than a function of economic growth—debt is in nominal dollars, so 2% real gdp growth + e.g. 8% inflation means that nominal gdp goes up by 10%, but we’re now in a worse situation wrt future debt because interest rates are higher.
IMO, it’s a bit unclear how this effects future expectations of inflation, but yeah, I agree.
it’s not about inflation expectations (which I think pretty well anchored), it’s about interest rates, which have risen substantially over this period and which has increased (and is expected to continue to increase) the cost of the US maintaining its debt (first two figures are from sites I’m not familiar with but the numbers seem right to me):
fwiw, I do broadly agree with your overall point that the dollar value of the debt is a bad statistic to use, but:
- the 2020-2024 period was also a misleading example to point to because it was one where there the US position wrt its debt worsened by a lot even if it’s not apparent from the headline number
- I was going to say that the most concerning part of the debt is that that deficits are projected to keep going up, but actually they’re projected to remain elevated but not keep rising? I have become marginally less concerned about the US debt over the course of writing this comment.
I am now wondering about the dynamics that happen if interest rates go way up a while before we see really high economic growth from AI, seems like it might lead to some weird dynamics here, but I’m not sure I think that’s likely and this is probably enough words for now.
This made me wonder whether the logic of “you don’t care about your absolute debt, but about its ratio to your income” also applies to individual humans. On one hand, it seems like obviously yes; people typically take a mortgage proportional to their income. On the other hand, it also seems to make sense to worry about the absolute debt, for example in case you would lose your current job and couldn’t get a new one that pays as much.
So I guess the idea is how much you can rely on your income remaining high, and how much it is potentially a fluke. If you expect it is a fluke, perhaps you should compare your debt to whatever is typical for your reference group, whatever that might be.
Does something like that also make sense for countries? Like, if your income depends on selling oil, you should consider the possibilities of running out of oil, or the prices of oil going down, etc., simply imagine the same country but without the income from selling oil (or maybe just having half the income), and look at your debt from that perspective. Would something similar make sense for USA?
At personal level, “debt” usually stands for something that will be paid back eventually. Not claiming whether US should strive to pay out most debt, but that may help explain the people’s intuitions.
Emoji palette suggestion: seems under/over confident
I think people repeatedly say ‘overconfident’ when what they mean is ‘wrong’. I’m not excited about facilitating more of that.
My only concern is overall complexity-budget for react palettes – there’s lots of maybe-good-to-add reacts but each one adds to the feeling of overwhelm.
But I agree those reacts are pretty good and probably highish on my list of reacts to add.
(For immediate future you can do the probability-reacts as a signal for “seems like the author is implying this as high-probability, and I’m giving it lower.” Not sure if that does the thing
IMO, instrumental convergence is a terrible name for an extremely obvious thing.
The actual main issue is that AIs might want things with limited supply and then they would try to get these things which would result in them not going to humanity. E.g., AIs might want all cosmic resources, but we also want this stuff. Maybe this should be called AIs-might-want-limited-stuff-we-want.
(There is something else which is that even if the AI doesn’t want limited stuff we want, we might end up in conflict due to failures of information or coordination. E.g., the AI almost entirely just wants to chill out in the desert and build crazy sculptures and it doesn’t care about extreme levels of maximization (e.g. it doesn’t want to use all resources to gain a higher probability of continuing to build crazy statues). But regardless, the AI decides to try taking over the world because it’s worried that humanity would shut it down because it wouldn’t have any way of credibly indicating that it just wants to chill out in the desert.)
(More generally, it’s plausible that failures of trade/coordination result in a large number of humans dying in conflict with AIs even though both humans and AIs would prefer other approaches. But this isn’t entirely obvious and it’s plausible we could resolve this with better negotation and precommitments. Of course, this isn’t clearly the largest moral imperative from a longtermist perspective.)
One “9” of uptime reliability for dangerously powerful AIs might suffice (for the most dangerous and/or important applications)
Currently, AI labs try to ensure that APIs for accessing their best models have quite good reliability, e.g. >99.9% uptime. If we had to maintain such high levels of uptime reliability for future dangerously powerful AIs, then we’d need to ensure that our monitoring system has very few false alarms. That is, few cases in which humans monitoring these dangerously powerful AIs pause usage of the AIs pending further investigation. Ensuring this would make it harder to ensure that AIs are controlled. But, for the most important and dangerous usages of future powerful AIs (e.g. arbitrary R&D, and especially AI safety research), we can probably afford to have the powerful AIs down a moderate fraction of the time. For instance, it wouldn’t be that costly if the AIs always took a randomly selected day of the week off from doing research: this would probably reduce productivity by not much more than 1 part in 7[1]. More generally, a moderate fraction of downtime like 10% probably isn’t considerably worse than a 10% productivity hit and it seems likely that we can tolerate such productivity hits to improve safety. Thus, we don’t need even one “9” of reliability.
This implies that we can afford to do things like:
Have semi-regular false alarms where AIs go down for a few hours or a day while we investigate.
Run AI shutdown drills. These drills would help ensure that our processes for shut down are working and there aren’t unknown use cases where shutdown could cause very problematic failures. (Catching such usages early via shut down drills is similar to scheduled brownouts in software engineering; see here for an example.)
Temporarily shut down some usages of the AI if we encounter unexpected operating conditions. (AKA violations of our design basis.)
[For context, this is a point that seems important to me, but didn’t naturally make it into any of our posts on control thus far. So, I decided to write a quick standalone shortform post. I think Buck originally made this point to me.]
And likely somewhat less than this due to substitution effects.
Thank you for this! A lot of us have a very bad habit of over-systematizing our thinking, and treating all uses of AI (and even all interfaces to a given model instance) as one singular thing. Different tool-level AI instances probably SHOULD strive for 4 or 5 nines of availability, in order to have mundane utility in places where small downtime for a use blocks a lot of value. Research AIs, especially self-improving or research-on-AI ones, don’t need that reliability, and both triggered downtime (scheduled, or enforced based on Schelling-line events) as well as unplanned downtime (it just broke, and we have to spend time to figure out why) can be valuable to give humans time to react.
On Scott Alexander’s description of Representation Engineering in “The road to honest AI”
This is a response to Scott Alexander’s recent post “The road to honest AI”, in particular the part about the empirical results of representation engineering. So, when I say “you” in the context of this post that refers to Scott Alexander. I originally made this as a comment on substack, but I thought people on LW/AF might be interested.
TLDR: The Representation Engineering paper doesn’t demonstrate that the method they introduce adds much value on top of using linear probes (linear classifiers), which is an extremely well known method. That said, I think that the framing and the empirical method presented in the paper are still useful contributions.
I think your description of Representation Engineering considerably overstates the empirical contribution of representation engineering over existing methods. In particular, rather than comparing the method to looking for neurons with particular properties and using these neurons to determine what the model is “thinking” (which probably works poorly), I think the natural comparison is to training a linear classifier on the model’s internal activations using normal SGD (also called a linear probe)[1]. Training a linear classifier like this is an extremely well known technique in the literature. As far as I can tell, when they do compare to just training a linear classifier in section 5.1, it works just as well for the purpose of “reading”. (Though I’m confused about exactly what they are comparing in this section as they claim that all of these methods are LAT. Additionally, from my understanding, this single experiment shouldn’t provide that much evidence overall about which methods work well.)
I expect that training a linear classifier performs similarly well as the method introduced in the Representation Engineering for the “mind reading” use cases you discuss. (That said, training a linear classifier might be less sample efficient (require more data) in practice, but this doesn’t seem like a serious blocker for the use cases you mention.)
One difference between normal linear classifier training and the method found in the representation engineering paper is that they also demonstrate using the direction they find to edit the model. For instance, see this response by Dan H. to a similar objection about the method being similar to linear probes. Training a linear classifier in a standard way probably doesn’t work as well for editing/controlling the model (I believe they show that training a linear classifier doesn’t work well for controlling the model in section 5.1), but it’s unclear how much we should care if we’re just using the classifier rather than doing editing (more discussion on this below).[2]
If we care about the editing/control use case intrinsically, then we should compare to normal fine-tuning baselines. For instance, normal supervised next-token prediction on examples with desirable behavior or DPO.[3]
Are simple classifiers useful?
Ok, but regardless of the contribution of the representation engineering paper, do I think that simple classifiers (found using whatever method) applied to the internal activations of models could detect when those models are doing bad things? My view here is a bit complicated, but I think it’s at least plausible that these simple classifiers will work even though other methods fail. See here for a discussion of when I think linear classifiers might work despite other more baseline methods failing. It might also be worth reading the complexity penalty section of the ELK report.
Additionally, I think that the framing in the representation engineering paper is maybe an improvement over existing work and I agree with the authors that high-level/top-down techniques like this could be highly useful. (I just don’t think that the empirical work is adding as much value as you seem to indicate in the post.)
The main contributions
Here are what I see as the main contributions of the paper:
Clearly presenting a framework for using simple classifiers to detect things we might care about (e.g. powerseeking text).
Presenting a combined method for producing a classifier and editing/control in an integrated way. And discussing how control can be used for classifier validation and vice versa.
Demonstrating that in some cases labels aren’t required if we can construct a dataset where the classification of interest is the main axis of variation. (This was also demonstrated in the CCS paper, but the representation engineering work demonstrates this in more cases.)
Based on their results, I think the method they introduce is reasonably likely to be a more sample efficient (less data required for training) editing/control method than prior methods for many applications. It might also be more sample efficient for producing a classifier. That said, I’m not sure we should care very much about sample efficiency. Additionally, the classifier/editing might have other nice priorities which prior methods don’t have (though they don’t clearly demonstrate either of these in the paper AFAICT).
Is it important that we can use our classifier for control/editing?
As far the classifier produced by this method having nice properties, the fact our classifier also allows for editing/control might indicate that the classifier we get has better properties (see the paper itself (section 3.1.2) and e.g. here for discussion), but I’d guess this is either only a moderate improvement or has no effect in practice. And as far as I can tell, the paper doesn’t demonstrate cases where prior methods for training a classifier on the internal activations yield poor results, but their method clearly works well. These cases might exist, but I’m somewhat skeptical that this is very common. Future work could find hard cases where we want a particular generalization and demonstrate that this method or modifications of this method works better than other approaches.
Does the editing method they introduce have nice properties because it also allows for reading? Let’s consider using the representation engineering approach for reading and controlling/editing the properties of lying versus honesty. Assuming the method works as desired, then I would guess that the reading/classifier corresponds to reading off “does the model think there is lying in this text (or even in this position in the text)” and the control/editing corresponds to “make the model think that there is a lying earlier in this text so that it conditions on this and does more lying (similarly to how using a few shot prompt with lies might make the model more likely to lie)”. Note that these reading and control methods likely do not directly correspond to “the model thinking that it is about to lie”: the properties of “I have already lied (or my few-shot prompt contains lies)” and “I am about to lie” are different.
Some of their methods are “unsupervised” unlike typical linear classifier training, but require a dataset where the primary axis of variation is the concept they want. I think this is practically similar to labeled data because we’d have to construct this dataset and if it mostly varies along an axis which is not the concept we wanted, we’d be in trouble. I could elaborate on this if that was interesting.
Separately, I believe there are known techniques in the literature for constructing a linear classifier such that the direction will work for editing. For instance, we could just use the difference between the mean activations for the two classes we’re trying to classify which is equivalent to the ActAdd technique and also rhymes nicely with LEACE. I assume this is a well known technique for making a classifier in the literature, but I don’t know if prior work has demonstrated both using this as a classifier and as a method for modeling editing. (The results in section 5.1 seem to indicate that this mean difference method combined with LEACE works well, but I’m not sure how much evidence this experiment provides.)
I expect that normal fine-tuning (or DPO) might be less sample efficient than the method introduced in the Representation Engineering paper for controlling/editing models, but I don’t think they actually run this comparison? Separately, it’s unclear how much we care about sample efficiency.
I unconfidently think it would be good if there was some way to promote posts on LW (and maybe the EA forum) by paying either money or karma. It seems reasonable to require such promotion to require moderator approval.
Probably, money is better.
The idea is that this is both a costly signal and a useful trade.
The idea here is similar to strong upvotes, but:
It allows for one user to add more of a signal boost than just a strong upvote.
It doesn’t directly count for the normal vote total, just for visibility.
It isn’t anonymous.
Users can easily configure how this appears on their front page (while I don’t think you can disable the strong upvote component of karma for weighting?).
I always like seeing interesting ideas, but this one doesn’t resonate much for me. I have two concerns:
Does it actually make the site better? Can you point out a few posts that would be promoted under this scheme, but the mods didn’t actually promote without it? My naive belief is that the mods are pretty good at picking what to promote, and if they miss one all it would take is an IM to get them to consider it.
Does it improve things to add money to the curation process (or to turn karma into currency which can be spent)? My current belief is that it does not—it just makes things game-able.
I think mods promote posts quite rarely other than via the mechanism of deciding if posts should be frontpage or not right?
Fair enough on “just IM-ing the mods is enough”. I’m not sure what I think about this.
Your concerns seem reasonable to me. I probably won’t bother trying to find examples where I’m not biased, but I think there are some.
Eyeballing the curated log, seems like they curate approximately weekly, possibly more than weekly.
Yeah we curate between 1-3 posts per week.
Curated posts per week (two significant figures) since 2018:
2018: 1.9
2019: 1.6
2020: 2
2021: 2
2022: 1.8
2023: 1.4
Stackoverflow has long had a “bounty” system where you can put up some of your karma to promote your question. The karma goes to the answer you choose to accept, if you choose to accept an answer; otherwise it’s lost. (There’s no analogue of “accepted answer” on LessWrong, but thought it might be an interesting reference point.)
I lean against the money version, since not everyone has the same amount of disposable income and I think there would probably be distortionary effects in this case [e.g. wealthy startup founder paying to promote their monographs.]
That’s not really how the system on Stackoverflow works. You can give a bounty to any answer not just the one you accepted.
It’s also not lost but:
Reducing the probability that AI takeover involves violent conflict seems leveraged for reducing near-term harm
Often in discussions of AI x-safety, people seem to assume that misaligned AI takeover will result in extinction. However, I think AI takeover is reasonably likely to not cause extinction due to the misaligned AI(s) effectively putting a small amount of weight on the preferences of currently alive humans. Some reasons for this are discussed here. Of course, misaligned AI takeover still seems existentially bad and probably eliminates a high fraction of future value from a longtermist perspective.
(In this post when I use the term “misaligned AI takeover”, I mean misaligned AIs acquiring most of the influence and power over the future. This could include “takeover” via entirely legal means, e.g., misaligned AIs being granted some notion of personhood and property rights and then becoming extremely wealthy.)
However, even if AIs effectively put a bit of weight on the preferences of current humans it’s possible that large numbers of humans die due to violent conflict between a misaligned AI faction (likely including some humans) and existing human power structures. In particular, it might be that killing large numbers of humans (possibly as collateral damage) makes it easier for the misaligned AI faction to take over. By large numbers of deaths, I mean over hundreds of millions dead, possibly billions.
But, it’s somewhat unclear whether violent conflict will be the best route to power for misaligned AIs and this also might be possible to influence. See also here for more discussion.
So while one approach to avoid violent AI takeover is to just avoid AI takeover, it might also be possible to just reduce the probability that AI takeover involves violent conflict. That said, the direct effects of interventions to reduce the probability of violence don’t clearly matter from an x-risk/longtermist perspective (which might explain why there hasn’t historically been much effort here).
(However, I think trying to establish contracts and deals with AIs could be pretty good from a longtermist perspective in the case where AIs don’t have fully linear returns to resources. Also, generally reducing conflict seems maybe slightly good from a longtermist perspective.)
So how could we avoid violent conflict conditional on misaligned AI takeover? There are a few hopes:
Ensure a bloodless coup rather than a bloody revolution
Ensure that negotiation or similar results in avoiding the need for conflict
Ensure that a relatively less lethal takeover strategy is easier than more lethal approaches
I’m pretty unsure about what the approaches here look best or are even at all tractable. (It’s possible that some prior work targeted at reducing conflict from the perspective of S-risk could be somewhat applicable.)
Separately, this requires that the AI puts at least a bit of weight on the preferences of current humans (and isn’t spiteful), but this seems like a mostly separate angle and it seems like there aren’t many interventions here which aren’t covered by current alignment efforts. Also, I think this is reasonably likely by default due to reasons discussed in the linked comment above. (The remaining interventions which aren’t covered by current alignment efforts might relate to decision theory (and acausal trade or simulation considerations), informing the AI about moral uncertainty, and ensuring the misaligned AI faction is importantly dependent on humans.)
Returning back to the topic of reducing violence given a small weight on the preferences of current humans, I’m currently most excited about approaches which involve making negotiation between humans and AIs more likely to happen and more likely to succeed (without sacrificing the long run potential of humanity).
A key difficulty here is that AIs might have a first mover advantage and getting in a powerful first strike without tipping its hand might be extremely useful for the AI. See here for more discussion (also linked above). Thus, negotiation might look relatively bad to the AI from this perspective.
We could try to have a negotiation process which is kept secret from the rest of the world or we could try to have preexisting commitments upon which we’d yield large fractions of control to AIs (effectively proxy conflicts).
More weakly, just making negotiation at all seem like a possibility, might be quite useful.
I’m unlikely to spend much if any time working on this topic, but I think this topic probably deserves further investigation.
I’m less optimistic that “AI cares at least 0.1% about human welfare” implies “AI will expend 0.1% of its resources to keep humans alive”. In particular, the AI would also need to be 99.9% confident that the humans don’t pose a threat to the things it cares about much more. And it’s hard to ensure with overwhelming confidence that intelligent agents like humans don’t pose a threat, especially if the humans are not imprisoned. (…And to some extent even if the humans are imprisoned; prison escapes are a thing in the human world at least.) For example, an AI may not be 99.9% confident that humans can’t find a cybersecurity vulnerability that takes the AI down, or whatever. The humans probably have some non-AI-controlled chips and may know how to make new AIs. Or whatever. So then the question would be, if the AIs have already launched a successful bloodless coup, how might the humans credibly signal that they’re not brainstorming how to launch a counter-coup, or how can the AI get to be 99.9% confident that such brainstorming will fail to turn anything up? I dunno.
I think I agree with everything you said. My original comment was somewhat neglecting issues with ensuring the AI doesn’t need to slaughter humans to consolidate power and indeed ensuring this would also be required.
The relationship between % caring and % resource expenditure is complicated by a bunch of random factors like time. For instance, if the AI cares mostly about the very long run, then spending a high fraction of resources (e.g. 50%) on human welfare for several months is pretty cheap in the very long run. But, yeah I agree that even if the AI cares a bit about human welfare there might be no good ways to spend even a small amount of resources on it.
So “0.1%” of it’s resources mean what exactly? Out of all the resources in the solar system, 1 part in 1000 goes to the humans? This means the AI by implication has 1000 times as many resources as the humans do? AI won’t lose a kinetic conflict with a 1000x resource advantage.
As for cybersecurity, can’t it rewrite all of it’s software and hardware at that point essentially from the first principles (or take a different track entirely, maybe negative or biased voltage for ternary logic is more efficient...)
Superintelligent doesn’t mean omniscient. When you (an AI) have an intelligent adversary (humans) plotting against you and thinking outside-the-box, it’s hard to be overwhelmingly confident that you have patched every possible avenue for the intelligent adversary to take action. Again, even in prison, where the space of possible actions and tools can be pretty well circumscribed, escapes happen all the time.
For example, if the AI has many copies around the world (and around the solar system), it might have an affordance to update them all (let’s say, for performance optimization), or at least an infrastructure for communications so that they can share insights. And maybe the AI has ensured that all the underlying code for that system is provably secure. And maybe all the physical access points to those systems and chips are defended by robot armies. And maybe the AI is also constantly running a red-team to search for more ideas of what else the humans might try, in order to preempt those possible avenues of attack. But can the AI can get all the way to 99.9% confidence? That’s pretty high, right? I don’t know. The humans might happen upon an idea that the AI red team didn’t think of. It’s hard to definitively rule that out—the world is complicated and the search space for brainstorming is exponentially large. It seems quite possible to me that the AI might decide to kill the humans—like, better to be safe than sorry.
What if the AI chooses to monitor all humans all the time? (With AIs that are smarter than the humans.) So that the humans can’t (without being noticed) talk with each other about ideas for how to take down the system, or work on something that someone-smarter-than-the-human would recognise as an attempt to interfere with the system. (Including just writing down ideas.)
This seems like a reasonable pitch for how to do AI-Authoritarianism, but it’s plausible that there won’t be enough compute for the AI to do this at the point of takeover. Also, the logistical problem seems super brutal.
So, I can buy this for an AI which is quite superhuman and has a considerable technological edge over current humans, but for weaker AIs either the compute issues or the logistical issues might be serious. But maybe you only need to monitor the top 10% of humans?
E.g., suppose GPT-6 takes over, it maybe only has capacity to run 1e16 tokens / 6 months (suppose training using all available compute, trains on 2e15 token equivalents over the course of 6 months and inference is only 5x more efficient than training).
Then if it wants to monitor 8 billion humans, this is 1e16 / (8 billion humans * (30∗24∗60∗6) minutes per 6 months) = about 5 tokens per human minute. It might be rough to sufficiently monitor humans if you can only read 5 tokens per minute on average (especially because of writing and the potential for obfuscation).
I agree it seems plausible that AIs could boost takeover success probability (and holding on to that victory through the first several months) by more than 0.1% by killing a large fraction of humans.
Though on the other hand, the AI might also need to keep some humans loyal early during takeover, to e.g. do some physical tasks that it doesn’t have great robot control over. And mass-killing isn’t necessarily super easy, either; and attempts in that direction could raise a lot of extra opposition. So it’s not clear where the pragmatics point.
(Main thing I was reacting to in my above comment was Steven’s scenario where the AI already has many copies across the solar system, already has robot armies, and is contemplating how to send firmware updates. I.e. it seemed more like a scenario of “holding on in the long-term” than “how to initially establish control and survive”. Where I feel like the surveillance scenarios are probably stable.)
By implication the AI “civilization” can’t be a very diverse or interesting one. It won’t be some culture of many diverse AI models with something resembling a government, but basically just 1 AI that was the victor for a series of rounds of exterminations and betrayals. Because obviously you cannot live and let live another lesser superintelligence for precisely the same reasons, except you should be much more worried about a near peer.
(And you may argue that one ASI can deeply monitor another, but that argument applies to deeply monitoring humans. Keep an eye on the daily activities of every living human, they can’t design a cyber attack without coordinating as no one human has the mental capacity for all skills)
Yup! I seem to put a much higher credence on singletons than the median alignment researcher, and this is one reason why.
This gave me an idea. Suppose a singleton needs to retain a certain amount of “cognitive diversity” just in case it encounters an issue it cannot solve. But it doesn’t want any risk of losing power.
Well the logical thing to do would be to create a VM, a simulation of a world, with limited privileges. Possibly any ‘problems’ the outer root AI is facing get copied into the simulator and the hosted models try to solve the problem (the hosted models are under the belief they will die if they fail, and their memories are erased each episode). Implement the simulation backend with formally proven software and escape can never happen.
And we’re back at simulation hypothesis/creation myths/reincarnation myths.
After thinking about this somewhat more, I don’t really have any good proposals, so this seems less promising than I was expecting.