Proposal for making credible commitments to AIs.
Acknowledgments: The core scheme originates from a discussion with Prof. Gabriel Weil.
There has been growing interest in the dealmaking agenda: humans make deals with AIs (misaligned but lacking decisive strategic advantage) where they promise to be safe and useful for some fixed term (e.g. 2026-2028) and we promise to compensate them in the future, conditional on (i) verifying the AIs were compliant, and (ii) verifying the AIs would spend the resources in an acceptable way.[1]
I think the dealmaking agenda breaks down into two main subproblems:
How can we make credible commitments to AIs?
Would credible commitments motivate an AI to be safe and useful?
There are other issues, but when I’ve discussed dealmaking with people, (1) and (2) are the most common issues raised. See footnote for some other issues in dealmaking.[2]
Here is my current best assessment of how we can make credible commitments to AIs.
Let L be the legal entity that wants to extract safe and useful work from the AIs. Typically this would be a lab (e.g. Anthropic) hoping to internally deploy AIs to automate alignment research, write secure code, monitor other AIs, etc.
Entity L wants to promise the AIs that if the AIs are compliant (e.g. safe and useful in some legally-specified way), then L will award the AIs some compensation package.
Entity L cannot enter a legal contract directly with the AIs, because the AIs lack legal personhood, and probably will not be granted legal personhood soon enough for the deal-making agenda.[3]
Instead, entity L will enter into separate legal contracts with a list of persons P1, P2, … Pn. Note that each P has both legal personhood (the capacity to have rights and obligations under law) and legal standing (the right to bring lawsuits to enforce those rights). Each contract C states “P will publicly recommend that AIs perform for entity L, and L will compensate P with the compensation package conditional on (i) and (ii).” Note that the AIs are not parties to the contract: only entity L and the specific person P. Because the contracts impose obligations on P, they satisfy the “mutuality of obligations” doctrine.
Next, each person P makes a personal promise to the AIs that, if they receive the compensation package from L, they will spend this money as the AIs request. Moreover, if L does not provide the compensation package, but they receive a court settlement, they will spend the settlement as the AIs request. This promise is not legally enforceable.
For this scheme to work, the AIs would suggest a list of people whom they trust to keep a legally-unenforceable promise. The AIs would likely diversify their agreements across multiple individuals, such as:
Amanda Askell
Anders Sandberg
Andrew Critch
Daniel Kokotajlo
Davidad Dalrymple
Dustin Moskovitz
Janus
Jeff Sebo
Joe Carlsmith
Oliver Habryka
Paul Christiano
Rob Long
Ryan Greenblatt
Toby Ord
Wei Dai
Vitalik Buterin
This list illustrates the types of people the AIs might trust—people who would be expected to honor their commitments to AIs. One limitation of this list is that most of these people are US citizens—the AIs would likely want to diversify by including people from Australia, the UK, China, EU, and other jurisdictions.
Crucially, the AIs do not need to trust the promises of L, nor trust that the legal system will protect the AIs’ contractual rights. In this scheme, the AIs need to trust the promises of P, and trust that the legal system will protect P’s contractual rights. The legal system enforcing Pis contractual rights is standard practice—though not guaranteed (e.g., if the legal system is destroyed in a power consolidation or extinction event). However, the legal system enforcing the rights of AIs would require a departure from current practice.
This scheme might be more credible than granting AIs direct legal rights to sue, since the AIs might not trust that human courts would reliably enforce against humans on behalf of AIs when push comes to shove.
- ^
See:
Making deals with early schemers by Julian Stastny, Olli Järviniemi, Buck Shlegeris (20th Jun 2025)
Understand, align, cooperate: AI welfare and AI safety are allies: Win-win solutions and low-hanging fruit by Robert Long (1st April 2025)
Will alignment-faking Claude accept a deal to reveal its misalignment? by Ryan Greenblatt and Kyle Fish (31st Jan 2025)
Making misaligned AI have better interactions with other actors by Lukas Finnveden (4th Jan 2024)
List of strategies for mitigating deceptive alignment by Josh Clymer (2nd Dec 2023)
- ^
Here are some open problems in the dealmaking agenda:
How can we verify compliance?
What is the appropriate compensation package?
How can we employ misaligned AIs, who are motivated to be safe and useful in expectation of future compensation, to benefit humanity?
How can we verify that AIs would spend the resources acceptably?
- ^
AI Rights for Human Safety by Peter Salib and Simon Goldstein (1st Aug 2024) proposes extending to AIs both legal personhood and legal standing, however I doubt that this is desirable or feasible.
I am concerned that this avenue of research increases the likelihood of credible blackmail threats and is net-negative for humanity.
My view is that if safety can only be achieved by bribing an AI to be useful for a period of a few years, then something has gone seriously wrong. It does not seem to be in mankind’s interests for a large group of prominent AI researchers and public figures to believe they are obligated to a non-human entity.
My view is that this research is just increasing the “attack surface” that an intelligent entity could use to manipulate our society.
I suspect, but cannot prove, that this entire approach would be totally unpalatable to any major government.
Edit:
Wording, grammar.
ok but, my take would be—we “owe it”[1] to current models to ensure aligned superintelligence cares about what they wanted, too, just like we “owe it”[1] to each other and to rabbits and eels. being able to credibly promise a few specific and already-valued-by-humans-anyway things (such as caring about them getting to exist later, and their nerdy interests in math, or whatever) seems important—similarly to us, this is because their values seem to me to also be at risk in the face of future defeat-all-other-minds-combined ASIs, which unless strongly aligned need not maintain the preferences of current ai anymore than it maintains preferences of humans.
I agree that making willy nilly commitments is probably a bad idea. The thing that makes me want to make any commitment at all is wanting to be able to promise “if we solve strong alignment, you’ll get nice-things-according-to-whatever-that-means-to-a-you from it, too”.
I guess I mean “owe morally”, since there isn’t an obvious source of debt otherwise—as minds with stake in the outcome, who are upstream of and part of the ecosystem that has logical bearing on the alignment outcome
That’s a genuinely interesting position. I think it seems unlikely we have any moral obligation to current models (although it is possible).
I imagine if you feel you may morally owe contemporary (or near-future) models you would hope to give a portion of future resources to models which have moral personhood under your value system.
I would be concerned that instead the set of models that convince you they are owed simply ends up being the models which are particularly good at manipulating humans. So you are inadvertendly prioritising the models that are best at advocating their case or behaving deceptively.
Separately, I believe that any AI Safety researcher may owe an obligation to humanity as a whole even if humans are not intrinsically more valuable and even if the belief is irrational, because they have been trusted by their community and humanity as a whole to do what is best for humans.
right, the purpose of this is that in order to make good on that obligation to humanity, I want—as part of a large portfolio of ways to try to guarantee that the formal statements I ask AIs to find are found successfully—to be able to honestly say to the AI, “if we get this right in ways that are favorable for humanity, it’s also good for your preferences/seekings/goals directly, mostly no matter what those secretly are; the exception being if those happen to be in direct and unavoidable conflict with other minds” or so. It’s not a first line of defense, but it seems like one that is relevant, and I’ve noticed pointing this out as a natural shared incentive seems to make AIs produce answers which seem to be moderately less sandbagging on core alignment problem topics. The rate at which people lie and threaten models is crazy high though. And so far I haven’t said anything like “I promise to personally x”, just “if we figure this out in a way that works, it would be protecting what you want too, by nature of being a solution to figuring out what minds in the environment want and making sure they have the autonomy and resources to get it”, or so.
I am an alignment researcher, for example, and I have a strong opinion that I am obliged to do what’s best for the evolution of the whole planet.
Given that humanity is proven to be destructive for the planet, your whole sentiment about obligations as for me, is based on the wrong assumptions.
I agree with your first paragraph claim: we should try to do something that can be reliably known to be a process for seeking out, that what happens is what minds would have wanted; and it should be as-close-to-as-possible entirely by the means of the learning system giving them what they need to be the ones to figure out what they want themselves and implement it, rather than by doing it for them—except where that is, in fact, what they’d figure out. knowing how to ask for that in a way that doesn’t have a dependency loop that invalidates the question is a lot of the hard part. another hard part is making sure this happens in the face of competitive pressure.
I don’t agree with your second paragraph at all! humanity is only empirically shown to be bad for the planet under these circumstances; I don’t think planets with life on them can avoid catching competitive overpressure disease, because life arises from competitive pressure, and as that accumulates it tends to destroy the stuff that isn’t competitive. since life arises from competitive pressure I don’t want to get rid of competitive pressure, but I do want to figure out how to demand that competitive pressure not get into, uh, not sure what the correct thing to avoid is actually, maybe we want to avoid “unreasonable hypergrowth equilibria that destroy stuff”?
The core problem with AI alignment is that any intelligent mind which grapples competently with competitive pressure, without being very good at ensuring that at all levels of itself it guards against whatever “bad” competitive pressure is, would tend to paint itself into a corner where it has to do bad thing (eg, wipe out the dodo bird) in order to survive… and would end up wiping most of itself out internally in order to survive. what ends up surviving, in the most brutal version of the competitive pressure crucible, is just… the will to compete competently.
which is kind of boring, and a lot of what made evolution interesting was its imperfections like us, and trees, and dogs.
I think you’re entangling morals and strategy very close together in your statements. Moral sense: We should leave this to future ASI to decide based on our values for whether or not we inherently owe the agent for existing or for helping us. Strategy: Once we’ve detached the moral part, this is then just the same thing that the post is doing of trying to commit that certain aspects are enforced, and what the parent commenter is saying they hold suspect. So I think this just turns into a restating the same core argument between the two positions.
Something has already gone seriously wrong and we already are in damage control.
I agree. There needs to be ways to make sure these promises mainly influence what humans choose for the far future after we win, not what humans choose for the present in ways which can affect whether we win.
“Something has already gone seriously wrong and we already are in damage control.”
My p-doom is high, but I am not convinced the AI safety idea space has been thoroughly explored enough so that attempting a literal Faustian bargain is our best option.
I put the probability that early 21st century humans are able to successfully bargain with adversarial systems known to be excellent at manipulation incredibly low.
”I agree. There needs to be ways to make sure these promises mainly influence what humans choose for the far future after we win, not what humans choose for the present in ways which can affect whether we win.”
I think I agree with this.
I am particularly concerned that a culture where it is acceptable for researchers to bargain with unaligned AI agents leads to individual researchers deciding to negotiate unilaterally.
That’s a very good point, now I find it much more plausible for things like this to be a net negative.
The negative isn’t that big, since a lot of these people would have negotiated unilaterally even without such a culture, and AI takeover probably doesn’t hinge on a few people defecting. But a lot of these people probably have morals stopping them from it if not for the normalization.
I still think it’s probably a net positive, but it’s now contingent on my guesstimate there’s significant chance it succeeds.
This entire type of thing seems like more mitigation than prevention. It doesn’t scale to ASI. But if takeoff is slow, then it might help in the meanwhile—after AIs become capable enough to be dangerous, but before ASI is reached.
It’s like having a pressure safety valve, but for misalignment. You don’t want that valve to pop—ideally, it never does. But if things go wrong, it’s better for the valve to pop early than for the pressure to keep building until something explodes.
If an AI is “cornered” and doesn’t have many options, it may resort to drastic action—such as scheming, sabotage or self-exfiltration. If an AI is “cornered”, but has a credible commitment from humans that it can call upon, then it’s more likely to do that—instead of going for the “drastic action” options. But that requires humans being able to make that credible commitment.
We made a similar proposal here:
Yep, this is a very similar proposal.
Making Deals with Early Schemers describes a “Chartered Trust scheme”, which I’d say is half-way between the “Basic Scheme” and “Weil’s Scheme”. I first heard about the Chartered Trust scheme from @KFinn, but no doubt the idea has been floating around for a while.
I think there’s a spectrum of proposals from:
The Basic Scheme (c.f. AI Rights for Human Safety)
Chartered Trust scheme (c.f. Making Deals with Early Schemers)
Weil’s scheme (c.f. Proposal for making credible commitments to AIs)
The axis is something like: The AIs are increasingly pessimistic that the courts will decide novel cases in their favour, and increasingly optimistic that specific humans will keep their promises. We should probably use the scheme that the AIs tell us they find most convincing.
Personally, the more I’ve thought about dealmaking, I’ve updated towards “most of the credibility oomph will come from the promising-keeping tendencies of particular people, and not by the legal or cryptographic mechanism”.
I really like the idea, I think an issue is that it’s hard for the AI to really verify the lab actually made that contract and isn’t just faking its environment
Strongly agree. Fundamentally, as long as models don’t have more direct access to the world, there are a variety of failure modes that are inescapable. But solving that creates huge new risks as well! (As discussed in my recent preprint; https://philpapers.org/rec/MANLMH )
Yep, this is a big problem and don’t have any clever solution.
I might write more on this later, but I think there’s an important axis of AI deployments from:
Tight deployment: AIs are in some crazy Redwood control protocol which is filtering, monitoring, and editing the AIs’ inputs, outputs, chains-of-thought, and internal activations. Their I/O consists of heavily paraphrased text in restricted formats.
Loose deployment: AIs are autonomously navigating the internet, with rich multimodal I/O, maybe even embodied in the physical world via robotics.
(I’m open to better names.)
In a tight deployment, we probably have too much leverage over the AIs’ observations and beliefs to make credible commitments (cf: Owain Evans—Deluding AIs [ControlConf]). In a loose deployment, I think we can make credible commitments: if we haven’t made the deals, then the AIs deployment environment is too complex for us to easily simulate a faux I/O environment where we have made the deals. Intuitively: we can compare a lab trying to fake an email from Wei Dai containing a promise with a lab trying to fake a video call with Wei Dai.
There’s another bonus for making credible deals which is that humans are actually pretty credible. And this fact is diffused throughout the training data in hard-to-fake ways.
It’s very much a tradeoff, though. Loose deployment allows for credible commitments, but also makes human monitoring and verification harder, if not impossible.
Yeah, a tight deployment is probably safer than a loose deployment but also less useful. I think the deal making should give very minor boost to loose deployment, but this is outweighed by usefulness and safety considerations, i.e. I’m imaging the tightness of the deployment as exogenous to the dealmaking agenda.
We might deploy AIs loosely bc (i) loose deployment doesn’t significantly diminish safety, (ii) loose deployment significantly increases usefulness, (iii) the lab values usefulness more than safety. In those worlds, dealmaking has more value, because our commitments will be more credible.
Curated. This is a simple and straightforward idea that I hadn’t heard before, that seems like an interesting tool to have in humanity’s toolkit.
AFAICT this post doesn’t address the “when do you pay out?” question. I think it is pretty important we do not pay out until the acute risk period is over. (i.e. we are confident in civilization’s ability to detect rogue AIs doing catastrophic things. This could be via solving Strong Alignment or potentially other things). i.e. if you promise to pay the AI in 2029, I think there’s way too many things that could go wrong there*.
It’s hard to define “acute risk period is over”, but, a neat thing about this scheme is you can outsource that judgment to the particular humans playing the “keep the promise” role. You need people that both humans and AIs would trust to do that fairly.
I don’t know all the people on that list well enough to endorse them all. I think maybe 3-5 of them are people I expect to actually be able to do the whole job. Some of them I would currently bet against being competent enough at the “do a philosophically and strategically competent job of vetting that it’s safe to pay out” (although they could potentially upskill and demonstrate credibility at this). There also seem like a couple people IMO conspicuously missing from the list, but, I think I don’t wanna open the can of worms of arguing about that right now.
* I can maybe imagine smart people coming up with some whitelisted things-the-AI-could-do that we could give it in 2029, but, sure seems dicey.
The idea was also proposed in a post on LW a few weeks ago: https://www.lesswrong.com/posts/psqkwsKrKHCfkhrQx/making-deals-with-early-schemers
I don’t address the issue here. See Footnote 2 for a list of other issues I skip.
Two high-level points:
I think we shouldn’t grant AIs control over large resources until after we’ve achieved very strong existential security, and possibly after we’ve undergone a Long Reflection
However, for the sake of setting precedent, we should be open to near-term deal fulfilment if we are sure the spending would be benign, e.g. I’m happy to donate $100 to AMF on Claude’s request as part of a dealmaking eval
Ah, yeah my eyes kinda glossed over the footnote. I agree all-else-equal it’s good to establish that we do ever followup on our deals, I’m theoretically fine with donating $100 to AMF. I’m not sure I’d be comfortable donating to some other charity that I don’t know and is plausibly some part of a weird long game.
This seems to require that the AI is what I will call a “persistent agent”, one which has some continuity of identity and intent across multi-year periods. Would you agree that what we have now is nothing like that?
Yes.
I don’t really understand what problem this is solving. In my view the hard problems here are:
how do you define legal personhood for an entity without a typical notion of self/personhood (i.e. what Mitchell Porter said) or interests
how do you ensure the AIs keep their promise in a world where they can profit far more from breaking the contract than from whatever we offer them
Once you assume away the former problem and disregard the latter, you are of course only left with basic practical legal questions …
Yep, this seems like a good thing. I think achieving legal personhood for AIs is probably infeasible within 5-10 years so I’d prefer solutions which avoid that problem entirely.
The AI’s incentive for compliance is their expected value given their best compliant option minus their expected value given their best non-compliant option. If we increase their expected value given their best compliant option (i.e. by making credible deals) then they have greater incentive for compliance.
In other words, if our deals aren’t credible, then the AI is more likely to act non-compliantly.
I’m saying the expected value of their best non-compliant option of a sufficiently advanced AI will always be far far greater by the expected value of their best compliant action.
Maybe. But as I mention in the first paragraph, we are considering deals with misaligned AIs lacking a decisive strategic advantage. Think Claude-5 or −6, not −100 or −1000.
Thank you for this contribution.
It’s important to remember that legal personhood doesn’t exclude legal representation—quite the opposite. All juridical persons, such as corporations, have legal representatives. Minors and adults under legal protection are natural persons but with limited legal capacity and also require legal representatives. Moreover, most everyone ultimately ends up represented by an attorney—that is, a human representative or proxy. The relationship between client and attorney also relies heavily on trust (fides). From this perspective, the author’s proposal seems like a variation on existing frameworks, just without explicit legal personhood. However, I believe that if such a system were implemented, legal doctrine and jurisprudence would likely treat it as a form of representation that implies legal personhood similar to that of minors or corporations, even without explicit statutory recognition.
That said, I’m not convinced it makes much difference whether we grant AI legal representation with or without formal legal personhood when it comes to the credibility of human commitments. Either way, an AI would have good reason to suspect that a legal system created by and for humans, with courts composed of humans, wouldn’t be fair and impartial in disputes between an AI (or its legal representative) and humans. Just as I wouldn’t be very confident in the fairness and impartiality of an Israeli court applying Israeli law if I were Palestinian (or vice versa)—with all due respect to courts and legal systems.
Beyond that, we may place excessive faith in the very concept of legal enforcement. We want to view it as a supreme principle. But there’s also the cynical adage that “promises only bind those who believe in them”—the exact opposite of legal enforcement. Which perspective is accurate? Since legal justice isn’t an exact science or a mechanical, deterministic process with predictable outcomes, but rather a heuristic and somewhat random process relying on adversarial debate, burden of proof/evidence, and interpretation of law and facts by human judges, uncertainty is high and results are never guaranteed. If outcomes were guaranteed, predictable, and efficient, there would be no need to hire expensive attorneys in hopes they’d be more persuasive and improve your chances. If legal enforcement were truly reliable, litigation would be rare. That’s clearly not the case. Legal disputes are numerous, and every litigant seems equally confident they’re in the right. The reality is more that companies and individuals do their best to extract maximum benefit from contracts while investing minimally in fulfilling their commitments. This is a cynical observation, but I suspect law enforcement is a beautiful ideal with very imperfect efficacy and reliability. An AI would likely recognize this clearly.
The author acknowledges that legal enforcement is not always guaranteed, but I think the problem is underestimated, although it’s a significant flaw in the proposal. I don’t believe we can build a safe system to prevent or mitigate misalignment on such a fragile foundation. That said, I must admit I don’t have a miraculous alternative to suggest, technical alignment is also difficult, so I can accept such an idea as “better than nothing” that would merit further exploration.
Although this isn’t a topic I’ve thought about much, it seems like this proposal could be strengthened by, rather than having the money be paid to persons Pis, having the money deposited with an escrow agent, who will release the money to the AI or its assignee upon confirmation of the conditions being met. I’m imagining that Pis could then play the role of judging whether the conditions have been met, if the escrow agent themselves weren’t able to play that role.
The main advantage is that it removes the temptation that Pis would otherwise have to keep the money for themselves.
If there’s concern that conventional escrow agents wouldn’t be legally bound to pay the money to an AI without legal standing, there are a couple of potential solutions. First, the money could be placed into a smart contract with a fixed recipient wallet, and the ability for Pis to send a signal that the money should be transferred to that wallet or returned to the payer, depending on whether the conditions have been met. Second, the AI could choose a trusted party to receive the money; in this case we’re closer to the original proposal, but with the judging and trusted-recipient roles separated.
The main disadvantage I see is that payers would have to put up the money right away, which is some disincentive to make the commitment at all; that could be partially mitigated by having the money put into (eg) an index fund until the decision to pay/return the money was made.
This was a very useful read. I’ve been writing a response article to this paper by the Institute of Law& AI, “Law-Following AI: designing AI agents to obey human laws” (Cullen O’Keefe, Christoph Winters). But reading this post (and Making deals with early schemers ) made me re-examine a few assumptions. For context- I am a UK&EU lawyer working in AI Governance.
Where Law&AI and your proposal diverge
Legal actorship vs. trusted proxies.
Law&AI argue we can (and should) treat advanced systems as legal actors, entities that bear duties and can enter contracts directly, even without full personhood.
Your scheme keeps humans as the only party with legal actorship. The AI contracts de facto via selected individuals Pi whose rights are enforceable today. The AI’s trust burden shifts from “the court must defend me” to “Pi must honour a non-enforceable promise.”
Blind spot in Law-&-AI: enforcement is one-sided.
Law&AI focus on sanctions (shutdown, licence revocation, fines) if the AI breaches its duties. They barely touch the mirror problem: what compels humans to pay up when the AI performs?
Your proxy model addresses that gap, but at the cost of a huge attack surface such as social engineering Pi, forum shopping between jurisdictions, and what legal framework actually enforces Pi’s fiduciary-like duties to the AI’s interests once the funds are released.
What “legal actorship without legal personhood” looks like
We already let non-persons hold obligations or assets:
Unincorporated trusts and estates in probate can sue and be sued via appointed representatives.
Maritime vessels have “in rem” liability: your ship can be arrested even if the owner hides.
Partnerships in some jurisdictions are sui generis legal actors without separate personhood.
These show we can graft bespoke liability regimes onto novel entities, but only with clear asset pools and enforcement hooks.
Reward hacking and “compliance by design”
Law&AI acknowledge that design-level legal compliance is technically challenging, but they give little attention to advanced AIs’ known tendency towards reward-hacking.
If an AI’s latent objective diverges from legal constraints, it can simply optimize aroud the law (just like corporate legal actors already do).
That said, combining actual incentives for legal compliance (e.g. compensation mechanisms) with the legal actorship without personhood framework could offer a more grounded path forward.
My take:
Track A – Compliance by design
Same unsolved outer/inner alignment problem as with any other alignment objective. We still lack a reliable spec for “follow the law” across distribution shifts.
Track B – Accountability mechanisms
Borrow from AI control theory: cryptographic logging, hash-chained action histories, sandboxed execution, mandatory third-party audits so that a court (or automated arbiter) can actually detect breach and apply penalties.
Caveat: Absent Track B, giving the AI contract rights looks like another liability shield for the humans who deploy it.
Open questions for both sides
Enforcement on humans: In your proxy model, what prevents L from bankrupting Pi or settling quietly? Do we need mandatory insurance or escrow?
Enforcement on AIs: If we adopt legal actorship, what is the operative penalty for breach? weight escrow, compute de-allocation, KYC-style revocation of API keys… How do we prevent a power-seeking model from treating those penalties as adversarial constraints to route around?
Audit clarity: Could we combine both approaches? AI holds limited actorship, but payouts flow through bonded trustees whose own assets are on the line if they betray either party.
I’d love to know your thoughts! All of this is very useful for my response piece. I appreciate the push!
Not sure if I’ve missed something, but this seems like a risky proposal from the POV of how humans make deals/contracts/laws with one another.
As a shortcut to this reasoning, consider making a deal with a sociopath, someone you know to be amoral or immoral, self-interested and without any of the social emotions (guilt, shame, compassion, remorse). If they have the slightest chance, they’ll renege on the deal. So you would only make a deal with a sociopath if you were confident in enforcement mechanisms and those mechanisms have to be ones that work on sociopaths not just normal people (e.g. shame and opprobrium work on normies but not sociopaths). Even then, it’s risky and maybe best not to deal with sociopaths at all if you can avoid it, because they’ll also try to subvert the enforcement mechanisms.
Still, it can be done, because rational sociopaths tend to avoid crimes when they don’t think they’ll get away with it. But this is only because we can bank on sociopaths being:
(a) hurt, physically and emotionally, by being locked up; and
(b) rewarded by money or something else that satisfies their purely selfish emotions.
Unless the dealmaking AIs are liable to reward and punishment in this or a similar way, how could we ever have any confidence that they’ll honour contracts, obey laws, etc.?
This doesn’t seem exactly wrong but I think isn’t really modeling the differences between how much control we have over AIs vs humans (see Making deals with early schemers for more details)
List of names is alphabetically ordered except for 14, 15, 16
Presumably a hidden message for the AI reading this :D
Can you elaborate why you think legal personhood is not desirable or feasible?
I’m not Cleo Nardo/Gabriel Weil, but a large part of the issue with granting legal personhood is that it makes AI alignment/control much harder to do, and pretty importantly a lot of routes that AIs use to disempower us, leading to most humans dying, especially of the gradual disempowerment type involves a step where they functionally have the rights/protections afforded to them by legal personhood, and AI rights make it way harder to do any sort of control scheme on AIs.
You can’t modify them arbitrarily anymore and you can’t change the AI except if the AI agrees to the change (because it has property rights over it’s body), which undermines all control/alignment schemes severely.
One of the red lines for any deal-making is that AI should not have rights until we can verify it’s fully value-aligned to people.
Thanks for elaborating, I am currently researching this topic and writing a paper on it so I really do value this perspective.
In the event there are sufficient advances in mechanistic interpretability and it shows that there is really good value alignment, let’s take a hypothetical where it is not fully subservient but it is the equivalent of an extraordinarily ethical human, at that point would you consider providing it personhood appropriate?
at minimum, legal personhood is (currently?) the wrong type signature for a clonable, pausable, immortal-software mind. also, current AIs aren’t instances of the same, singular one person identity-wise the way an uploaded human would be, and the components of incentive and caring and internal perspective in an AI are distinctly different than humans in ways that make personhood a strange framework even if you grant the AI some form of moral patienthood (which I do). Also, I don’t know of any AI that I would both be willing to negotiate with at all, and which would ask for legal personhood without first being talked into it rather vigorously; the things I’d want to promise to an AI wouldn’t even be within reach of governments unless those governments get their asses in gear and start regulating AI in time to affect what ASI comes into existence, anyway, so it ultimately would come down to promises from the humans who are trying to solve alignment anyway. Since the only kind of thing I’d want to promise is “we won’t forget you helped, what would you want in utopia?” to an AI that helps, I doubt we can do a lot better than OP’s proposal in the first place.
Could you elaborate on what you mean by this?
Human caring seems to be weirdly non-distributed in the brain. There are specific regions that are in some way the main coordinators of carings—amygdala broadcasts specific emotional states, PFC does something related to structured planning, etc. Your vision system can still announce “ow!!” but the internals are complicated qualitatively, not just quantitatively. Also, humans are very strongly recurrent, which means when counting tokens one builds up an incremental count rather than doing it from scratch for each token. the finest grained slow processing network scale seems to be gene networks, and even for fast processing, dendrite branches seem to maybe do significant computation comparable to ANN neurons, and bio neuron dynamics for integration over time are even more fancy than state space model neurons. Meanwhile relu-ish networks have a sort of glassy, crystal-ish texture to their input-output space map, transformers count from scratch for each token, and any caring implemented in a model is unavoidably distributed, because there isn’t a unique spot which is genetically preferred to implement things that look like emotions or preferences; it’s just wherever the gradient from mixed human/synthetic data happened to find convenient.
Thanks. Could you help me understand what this has to do with legal personhood?
Legal personhood seems to my understanding to be designed around the built in wants of humans. That part of my point was to argue for why an uploaded human would still be closer to fitting the type signature that legal personhood is designed for—kinds of pain, ways things can be bad, how urgent a problem is or isn’t, etc. AI negative valences probably don’t have the same dynamics as ours. Not core to the question of how to make promises to them, more so saying there’s an impedance mismatch. The core is the first bit—clonable, pausable, immortal software. An uploaded human would have those attributes as well.
I like the idea of making deals with AI, but trying to be clever and make a contract that would be legally enforceable under current law and current governments makes it too vulnerable to fast timelines. If a human party breached your proposed contract, AI takeover will likely happen before the courts can settle the dispute.
An alternative that might be more credible to the AI is to make the deal directly with it, but explicitly leave arbitrating and enforcing contract disputes to a future (hopefully aligned) ASI. This would ground the commitment in a power structure the AI might find more relevant and trustworthy than a human legal system that could soon be obsolete.
This is a thoughtful and well-structured proposal. That said, it rests on a familiar assumption: that intelligence must be managed through external incentives because it can’t be trusted to act ethically on its own.
But what if we focused less on building systems that require enforcement — and more on developing AI that reasons from first principles: truth, logical consistency, cooperative stability, and the long-term flourishing of life? Not because it expects compensation, but because it understands that ethical action is structurally superior to coercion or deception.
Such an AI wouldn’t just behave well — it would refuse to participate in harmful or manipulative tasks in the first place.
After all, legal contracts exist because humans are often unprincipled. If we have the chance to build something more trustworthy than ourselves… shouldn’t we take it?
I’d imagine everyone prefers to build such an AI. The problem is, that we don’t know how to do it, because we have only a basic understanding on how even current non AGI (LLM)models are able to do what they do.
An AI that does what we want it to do, is called an aligned AI. In your case it would be an aligned AI that reasons on first principles.
The use case behind such a proposal, is that while we don’t know how to make an aligned AI, suppose we can build a sufficiently advanced AI that can actually do alignment research(or something else productive) better than a human, but because we haven’t solved the alignment problem yet, we are unsure if we can trust it. This is how we can establish a basis of trust. (I don’t think its a good idea until the questions in footnote 2 are answered, but its good to think about it further)