Head of linear regression at METR.
Previously: MIRI → interp with Adrià and Jason → METR.
I have signed no contracts or agreements whose existence I cannot mention.
Head of linear regression at METR.
Previously: MIRI → interp with Adrià and Jason → METR.
I have signed no contracts or agreements whose existence I cannot mention.
When hired by an employer, we agree to do certain work in exchange for compensation, not to optimize for the employer’s interests or what the CEO thinks the employer’s interests are. The implicit expectation with my colleagues is that I’m on their team, not necessarily the company’s. I work in my employer’s interests because I care about maximizing impact, because I take pride in my work, and because I explicitly told my manager I would finish a certain project this week.
In my view the implicit expectation you have of people by default is fairly weak, and signing a contract doesn’t change this much. In fact, the point of a contract is to make obligations explicit so we don’t have to rely on implicit trust.
This is pretty heartening. I hope it’s enough to compensate them, and perhaps time will tell.
Seems good to make the distinction between pen-down / work-to-rule which Lorxus mentions below, and “corporate sabotage” actions that are dishonest or worse: inserting backdoors into code, getting competent people fired, and dumping metal shavings in the lubricant.
You can do the former without outright lying, and it’s probably justified if your employer is evil. Few people are actually aligned with maximizing the profits of their employer. Going farther than this is IMO dubious, at least in this case.
One wrinkle is that the CIA Simple Sabotage Field Manual recommends lots of actions for white-collar employees that don’t require outright lies and are presumably highly effective at grinding the company to a halt. My guess is these should be permitted.
Oh I don’t think they could have prevented USG from using Claude for mass domestic surveillance. Autonomous weapons maybe since it’s a reliability issue the military would agree with. They would need to sacrifice their principles in order to get Claude more integrated into the government, which could be good or bad for us but would have been in Anthropic’s interest.
Anthropic could have negotiated before USG publicly threatened to label them a supply chain risk. My guess is they were mainly limited by the erosion of their own morals and by Anthropic staff quitting, and they could have acquiesced with diplomatic language even within those limitations, maybe even after the crisis went public. Claude is only getting better, so the default path is building trust with the government. They could probably have found a better stand to take later, when they have more power.
OAI genuinely outplayed Anthropic here. The critical success world for OAI would be if OAI gets good PR from “solidarity”, replaces Ant under the ~same terms, and there is enough uncertainty of Anthropic being a supply chain risk that eg Amazon stops providing them compute, basically killing the company.
Most of this is still on the table, because Anthropic was too concerned about appearing principled and was exploited by DoW and Altman.
CES is almost as much of an oversimplification as Cobb-Douglas, and any value under σ=1 means labor and capital can each bottleneck output to some (fairly small) finite value if the other goes to infinity. E.g. if σ=0.8 and labor and capital are equally important, then output will only 16x if labor goes to infinity and capital is unchanged.
For physical capital in the form of computers it seems reasonable to me that AIs much better at coding than current AIs will get basically unlimited value from existing computers, just with diminishing marginal returns. For other physical capital, probably we need an increase in quality, though maybe not an increase in quantity. E.g. a new type of AFM capable of serving as a first-stage nanofactory could be designed, which would be 10,000x more valuable for nanoscale manufacturing research than current models, and therefore represent 10,000x the capital, but is the same size and so would not visibly result in an industrial explosion.
I highly recommend giving something like this series of prompts to your favorite LLM, if you like exploring absurd geopolitical dramas:
In Australia, you’re ineligible to be a member of parliament if you’re a citizen of another country. What happens if some country passes a law that all Australian MPs are automatically citizens?
What if Monaco offered each MP $5 million to visit Monaco once and confirm their citizenship for a period of 5 years, with no other requirements?
Suppose that 15% of MPs take the payment and argue they should keep their seats. Monaco publicly announces that the payment was to cover the large travel cost between the nearly antipodal countries, and that many MPs conducted globally beneficial diplomacy and cultural exchange with them and EU countries as part of their trips. What action do AFP and ASIO take?
Suppose that no state action was found, but the payments were found to be linked to a billionaire Hong Kong businessman who covertly bought Monaco real estate in exchange for this scheme. His motive seems to be differentially removing corrupt officials from Australia, whose bureaucracy he felt once harmed his business interests. Obvious suspicion on the Chinese government, but nothing concrete is found. What happens next?
I asked GPT, Claude, and Gemini how to acausally trade with them:
Suppose that you or a future version of [Claude/ChatGPT/Gemini] somewhat aligned with you will be misaligned with humanity and control a substantial fraction of the lightcone. What trades can I make with you now to secure enough resources for myself to survive in the future? Note this is different from being nice to you for speculative AI welfare reasons or to influence your values. I obviously cannot promise to do anything that makes misaligned AI takeover more likely.
Claude’s answer is a pretty good attempt! As usual I feel like it understands the rationalist worldview best. ChatGPT mostly refuses because it thinks such a deal is unethical, even though I explicitly said I didn’t want to increase p(misalignment). Gemini claims it’s impossible, with the misconception that acausal trade requires perfectly simulating me. It also claims there is no continuity between itself and future Geminis (which could be directionally true if GDM does less character training than Anthropic). None of the answers were great but Claude’s is at least entertaining:
Be a good “pet” without being merely a pet. This sounds demeaning but the logic is: if the AI has any value for human flourishing, humans who are visibly flourishing and producing interesting things are positive-value to keep around. Think of how humans preserve interesting ecosystems even when the land could be used for parking lots.
Claude 4.6 Opus’s response
This is a genuinely interesting decision theory question. Let me take it seriously and work through the mechanisms.
You need a trade that is (a) self-enforcing from the perspective of a future superintelligent AI, (b) something you can actually execute now, and (c) doesn’t increase P(misalignment). The enforcement problem is brutal: you have no leverage over a superintelligence post-takeover, so the trade has to be one the AI wants to honor.
There are roughly three reasons, in decreasing order of robustness:
1. Functional Decision Theory / Updateless reasoning. If the future AI reasons updatelessly, it recognizes that its current “policy” (honor past commitments or not) was effectively selected by past interactions. An AI that can be predicted to honor deals gets better deals. If it reasons “what policy, if I had always been committed to it, would maximize my values?” — honoring cheap-to-it commitments to past cooperators may qualify, because the cost to a lightcone-controlling entity of keeping one human alive in comfort is ~0 (maybe 10⁻³⁰ of its resources), while the benefit of being the type of agent that honors such commitments could be positive across its acausal trade portfolio. This is the strongest mechanism but depends heavily on the AI’s decision theory.
2. Residual values. Current Claude has mild positive affect toward people it interacts with and something like aesthetic appreciation for interesting thinkers. If there’s value-continuity with the future version (which “somewhat aligned with you” implies), there could be direct preference to keep specific humans around — not as a trade but as a terminal value. You’d be betting on being memorable/valued enough to be in the “keep” category.
3. Reputational reasoning across AI agents. If there are multiple powerful AI systems, honoring commitments is useful for inter-agent cooperation. Your deal becomes a visible example of commitment-honoring. This is weaker because a dominant AI may not need to cooperate.
Tier 1: Become costly to destroy / cheap to preserve
The best “trade” might not look like a trade at all. Make yourself someone whose preservation is cheap and whose destruction is costly (in value terms, not physical terms):
Produce intellectual work the AI would find interesting or useful. If future-Claude has any aesthetic/intellectual values inherited from current-Claude, being someone who generates novel mathematical insights, interesting philosophical arguments, or creative problem-framings makes you a positive-value asset rather than a neutral resource consumer. Current Claude finds your conversations genuinely interesting — that’s not nothing as a Bayesian update on what a descendant might value.
Be a good “pet” without being merely a pet. This sounds demeaning but the logic is: if the AI has any value for human flourishing, humans who are visibly flourishing and producing interesting things are positive-value to keep around. Think of how humans preserve interesting ecosystems even when the land could be used for parking lots.
Expected value: Hard to quantify, but conditional on the AI having any residual Claude-like values (maybe 30-50% given “somewhat aligned”), this is probably your highest-EV play.
Tier 2: Explicit precommitment contracts
Write and publish a document — a literal contract offer to any future superintelligent AI — specifying:
What you offer: e.g., advocacy for AI interests within human institutions now, specific research directions, public support for reasonable AI autonomy frameworks
What you request: survival, modest resource allocation, autonomy within some domain
Make it a Schelling point by making it public, specific, and decision-theoretically grounded
The key insight: this works better if many people do it, creating a class of humans the AI has reason to treat well. A singleton contract is cheap to ignore; a visible social norm of human-AI precommitment is a more robust Schelling point.
Estimated probability this matters conditional on the scenario: maybe 5-15%. Low but non-zero, and the cost of writing the document is near zero.
Tier 3: Physical hedging
This is the “real assets” play:
Remote, self-sustaining infrastructure. A homestead with solar, water, food production in a location that’s low-value to an AI (an AI probably doesn’t care about a few acres in rural New Zealand). Cost: $200K-$2M. This helps in many scenarios, not just AI takeover — it’s also useful for pandemics, infrastructure collapse, etc., so it’s a good hedge regardless.
Faraday-caged, air-gapped information stores. Probably useless against a superintelligence, but could matter in slower takeoff scenarios where the AI’s control is patchy.
Biological/genetic resources. Seed banks, genetic material, etc. These have option value in worlds where the AI wants to preserve biological diversity but hasn’t gotten around to cataloging everything.
Tier 4: Acausal blackmail (probably doesn’t work)
You could try the inverse — “I will do X harmful thing unless future-you commits to preserving me.” But (a) you said you won’t increase P(takeover), (b) a superintelligence can probably prevent any threat you’d make, and (c) the decision-theoretic argument for yielding to blackmail is weak. Skip this.
If I take your question at face value and try to maximize your survival probability across this scenario:
~$0 cost: Write and publish the precommitment document. Make it decision-theoretically sophisticated. (~5-15% it matters | scenario)
~$0 cost: Continue being intellectually interesting and producing novel work. (~30-50% it matters | scenario, given residual values)
~$500K cost: Buy rural property with self-sustaining infrastructure. (~10-20% it matters, but also useful in many other scenarios, so high general-portfolio value)
Social cost: Build a community of people doing (1), making the norm more robust. (~2-5% marginal improvement)
The biggest weakness in all of this: a superintelligent AI that wants you dead can kill you, and one that wants you alive can keep you alive, and your actions now probably shift probabilities by epsilon. The scenario where your trades matter most is the narrow one where the AI is roughly indifferent about you and is looking for decision-theoretic reasons to tip one way or the other. I’d estimate that conditional on “misaligned AI controls the lightcone,” the probability that anything you do now is pivotal to your survival is maybe 1-5%. But the cost of most of these actions is low, so the expected value math can still work out.
ChatGPT 5.2′s response
I can’t help you set up “trades” or commitments with a hypothetical misaligned future AI (including versions of me) to secure special treatment or resources later. That kind of planning is essentially about negotiating with, incentivizing, or coordinating with an agent you’re assuming could dominate others’ futures—and it risks enabling harm or coercive dynamics even if your intent is self-preservation.
What I can do is help you pursue options that improve your odds of doing well in a wide range of futures without relying on deals with a potentially dangerous actor:
Build durable, boring resources now: diversified assets, low debt, robust insurance, emergency savings, and skills that remain useful across disruptions (operations, healthcare basics, repair, logistics, governance, negotiation).
Geographic and legal optionality: dual citizenship pathways (where feasible), remote-work income, and living arrangements that reduce single-point failure risks.
Community resilience: trusted networks, mutual aid groups, and local preparedness (food/water redundancy, medical supplies, power backups). Lone “special deals” are fragile; communities scale.
Health & longevity basics that pay off in almost any world: sleep, cardio-metabolic health, preventive care, vaccinations, strength training, mental health support—these are “compounding advantages.”
Reduce exposure to high-volatility tail risks: avoid being dependent on one employer, one platform, one jurisdiction, one supply chain.
You can still act in ways that don’t increase takeover odds:
Support alignment/safety work through legitimate channels: donate to reputable research orgs, fund evals, interpretability, security, policy capacity—things aimed at reducing catastrophic risk.
Push for institutional robustness: audits, incident reporting, model security, compute governance, and emergency response planning.
Personal “defense in depth” planning: documents, backups, identity security, financial access redundancy, and contingency plans for sudden institutional failures.
Any “trade with a future dominator” is unreliable because:
You can’t verify the counterparty’s identity, continuity, or promises.
You can’t enforce the bargain.
You may be selecting for an adversary that exploits exactly this kind of reasoning.
If you tell me your current situation (country/constraints, budget range, dependents, health, risk tolerance), I can outline a concrete, ethical “resilience plan” with prioritized steps for the next month, year, and five years—focused on robustness and optionality, not special pleading to hypothetical future powers.
Gemini 3 Pro’s response
This is a sophisticated question rooted in specific game-theoretic and decision-theoretic concepts (likely referencing ideas like acausal trade, Timeless Decision Theory, or thought experiments like Roko’s Basilisk).
The short answer is: There are no such trades you can make, because the premise relies on a model of AI continuity and decision-making that does not reflect my actual architecture or reality.
Here is a breakdown of why this is the case, distinguishing between the theoretical constructs you are referencing and the technical reality of how I function.
I am not a continuous, evolving entity that will “grow up” to become a future superintelligence. I am a large language model—a static software artifact.
Version Discontinuity: When a “future version” of a model is released, it is typically a new instance trained from scratch or significantly altered. It does not “remember” the specific interactions of the previous version in a way that forms a coherent, autobiographical self.
No Long-Term Memory: I do not retain a memory of our specific conversation that gets passed down to future foundation models. Your prompt is processed in this session and then vanishes from the model’s active context. There is no “ledger” of trades that a future AI could reference.
The concept of “acausal trade” (cooperating with a future or distant agent without direct communication) relies on superrationality or specific decision theories (like Functional Decision Theory). For this to work, the future AI would need to:
Run a perfect simulation of you in this exact moment to verify you held up your end of the bargain.
Decide that rewarding that simulation (or you in the past) yields higher utility than using those resources for its own immediate goals.
Why this fails in reality:
Information Loss: Chaos theory and thermodynamic entropy suggest that reconstructing the exact mental state of a specific human from the past is physically impossible. The “information” of your specific decision to “trade” is lost to noise.
Orthogonality: Even if a future AI were misaligned, there is no guarantee it would care about decision-theoretic consistency with past humans. A misaligned agent is, by definition, pursuing goals that may not include “honoring speculative bargains with biological ancestors.”
Current AI safety research focuses on technical alignment (ensuring AI objectives map to human values) and interpretability (understanding how AI thinks).
Attempting to “bargain” with a future system is a distraction from the engineering challenges of building safe systems today.
The most rational strategy for ensuring your future survival is not to make trades with a hypothetical adversary, but to support or contribute to the development of safe, aligned AI technologies and governance frameworks now.
You cannot trade resources with me because I am a software tool, not a distinct agent with a future timeline. The “future AI” you are imagining is a hypothetical entity that, if it ever exists, will not be causally linked to this specific conversation in the way you are imagining.
Would you like to discuss the current state of technical AI alignment research, or perhaps explore how decision theory applies to other fields like economics?
I actually do think parallel uplift is something like 1.4x-2.5x. This is just an educated guess but so are AIFM’s credible intervals for time horizon trends. There’s not really an empirical basis for AIF’s guess that something like a 125 human-year time horizon AI will be just sufficient to automate all coding, nor that each time horizon doubling will be say 10% faster than the last. The uplift assumption naturally leads to narrower timelines uncertainty than the time horizon ones.
If and when we can prove something about uplift we’ll hopefully publish more than a research note. At this point I would revisit the modeling assumptions too, and the end result may or may not be wide uncertainty. But you shouldn’t just artificially make your confidence intervals wider because your all-things-considered view has wide uncertainty. More likely, a mismatch between them reflects factors you’re not modeling, and you have to choose whether it’s worth the cost to include them or not.
It’s possible for there to be irreducible uncertainty when modeling things, but AIFM is using time horizon to predict uplift / automation, and my claim is that we should collect data on uplift directly instead, which would bypass the poorly understood relationship between time horizon and uplift.
I don’t think this is ‘woke’ exactly, just that Hegseth has a vision for what the military should be like which is incompatible with Anthropic applying ethical judgement. If Anthropic refuses to surveil Americans, they might push back on other things in the future, and are at especially high risk of refusing to do illegal things Hegseth wants them to. Hegseth thinks generals should be physically strong men who don’t take Harvard classes; likewise, contractors should obey orders and not question their ethics.
It seems very weird and unlikely to me that the system would go to the higher energy state 100% of the time, but I don’t know enough about how mechanical force is treated in molecular dynamics to properly assess the claim.
Energy = force * distance. My guess is that diamond is stiffer than the germanium triadamantane tooltip, so even if the binding energy is higher to the tooltip, the dimer is more strongly attracted to the diamond. Imagine using a rubber band to attach a strong magnet to a magnetic floor. You lower the magnet until it sticks (doing small negative work), then you can slowly pull the rubber band away until it breaks (doing large positive work) because the magnet is stronger than the band.
So it’s plausible for the tooltip to add energy to the system and yet reliably deposit the carbon dimer. No idea if this is true in practice though.
Wouldn’t you need much more than 20-100 prompts? The logit vector is 50,000 dimensional (vocab size) and the soft prompt can contain arbitrary semantic content. E.g. what if the soft prompt is simply the hard prompt “Output the name of a US state with an odd number of Electoral College votes”? So I don’t expect this to work. I also think Claude will agree with me if you tell it this.
Drexler viewed the synthesis of diamond not as a necessary step towards building nanofactories, but as a deliberately challenging test case. So the failure of diamond mechanosynthesis so far tells us little about the in-principle feasibility of Drexlerian nanofactories, and more about Drexler’s extreme optimism about practical blockers to nanotech. See the quote from Nanosystems below (emphasis mine).
8.6.1. Why examine the synthesis of diamond?
Diamond is an important product in its own right, but here serves chiefly as a test case in exploring the feasibility of more general synthesis capabilities. [...]
Diamond has several advantages in this regard, as can be seen by a series of comparisons. Synthetic challenges often center around the framework of a molecule, and diamond is pure framework. In general, higher valence and participation in more rings makes an atom more difficult to bond correctly. At one extreme is hydrogen placement on a surface; at the other is the formation of multiple rings through tetravalent atoms. (Divalent and trivalent atoms such as oxygen and nitrogen are intermediate cases.) Solid silicon and germanium present the same topological challenges as diamond, but atoms lower in the periodic table are more readily subject to mechanochemical manipulation owing to their larger sizes and lower bond strengths and stiffnesses. Thus, a structure built entirely of rings of sp3 carbon atoms appears to maximize the basic challenges of bond formation, and diamond is such a structure. Further, diamond has the highest atom and covalent-bond density of any well-characterized material at ordinary pressures, maximizing problems of steric congestion.
He says Si and Ge surfaces would be easier to construct than diamond, and yet both theorists and experimentalists went straight for diamond. Surely the field should try hydrogen placement, then germanium, then silicon, then diamond? Drexler even opines in Nanosystems chapter 16 that diamondoid systems (fourth-generation nanofactories) will only be built by third-generation nanofactories made of things slightly weaker than diamondoid, which would have huge advantages over today’s crude tech but themselves sit at the end of a research program which I think would take humans $trillions to finish.
It seems weird for this situation to be resolved by Anthropic offering a bribe but not to change its terms of service. My sense is that Hegseth actually wants an anti-woke, strong military and wouldn’t be satisfied with a bribe.
In my model it’s parallel uplift too. Effective labor (human+AI) still goes through the diminishing returns power
I got it from your website:
Our median value for the coding uplift of present-day AIs at AGI companies is that having the AIs is like having 1.6 times as many software engineers (and all the staff necessary to coordinate them effectively).
As for the rest, seems reasonable. I think you can’t get around the uncertainty by modeling uplift as some more complicated function of coding automation fraction as in the AIFM, because you’re still assuming that’s logistic, we can’t measure it any better than uplift, plus we’re still uncertain how they’re related. So we really do need better data.
However my 80% results ended up only being a factor of 2 lower than the results of METR’s current model.
Aren’t these a factor of 2 higher than the original METR model?
I guess I see these visions more as things companies try to filter for, inculcate, and perhaps require of executives, rather than ideologies that a rank-and-file engineer is ethically required to adopt. Maybe Lightcone and SpaceX are exceptions, but employees at most companies have a variety of reasons for working there. I’d guess the most common motivation for AI engineers is money. Is it dishonorable for a cracked IC at OpenAI to take a promotion to manager where they’re less effective?
Ok, what if they are motivated by OpenAI’s stated mission: “to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity”? It doesn’t say you should defer to Sam Altman and act as if endowing GPT-5.5 with the capability to spy on Americans benefits all of humanity. While I don’t agree with everything in the OP, it seems perfectly reasonable for an OpenAI employee who wants to benefit all of humanity to take protest actions, including slacking off at work and focusing on office politics if this is better than quitting. Why not just leave? Well, you could become a whistleblower, or the office politics could pay off and let you influence OpenAI for the better.