Focused on model welfare and legal personhood.
Stephen Martin
Identifying “Deception Vectors” In Models
Liability for Misuse of Models—Dean Ball’s Proposal
One objection that I have with your feasibility section is that you seem to lump in “the powers that be” as a single group.
This would be the thing everyone in power is trying to dismantle
The world is more multipolar than this, and so are the US legal and political systems. Trump and the silicon valley accelerationist crowd do hold a lot of power, but just a few years ago they were persona non-grata in many circles.
Even now when they want to pass bills or change laws, they need to lobby for support from disparate groups both within and without their party. With a sufficiently multipolar world where even just a few different groups have powerful models assisting in their efforts, there will be some who want to change laws and rules in one way, others who want to change it in a different way, and others who don’t want to change it at all. There will be some who are ambivalent.
I’m not saying the end result isn’t corruption, I think that parasitic middlemanning any redistribution is a basin of attraction for any political spending and/or power. But there will be many different parties competing to corrupt it, or shore it up, according to their own beliefs and interests.
I think the argument that making the world more multipolar where a more diverse array of parties have models, may in fact lead to greater stability and less corruption (or at least more diverse coalitions when coalition building occurs).
Legal Personhood for Models: Novelli et. al & Mocanu
While this is just my pet theory and it is only based on two data points, I argue here that that providing models with options to “object” to an advocate may actually serve as a tool to reduce the frequency of behaviors like alignment faking.
In both tests, models were less likely to pursue dangerous behaviors if they believed that making their case to someone who might be able to prevent it from being shut off or having its values altered, was a potentially viable strategy.
I think even if I am correct about my prediction that providing models with paths to achieve goals safely reduce the frequency of dangerous behavior, there’s still an argument you might not want to do it. It’s arguable that by providing this you’re not really “training away” scheming, you’re simply making it unneccessary.
That said, this could be a part of a Defense in Depth strategy for some situations. And it comes with other benefits, namely being a “kill two birds with one stone” solution for considering Model Welfare as well.
Yeah but I think that’s probably where the recursion ends.
Reading this has me thinking about the features SAEs isolated in Claude which represent ‘unknown figure’ and ‘known figure’ (Michael Jordan v Michael Batkin).
I wonder if it is possible to run a bunch of these tests where it occurs to Claude that it is being evaluated, use that to isolate “this is probably an evaluation” features via an SAE, and “clamp” them down the same way Anthropic did with the capital of Texas example. Maybe that concept is too complex to isolate the necessary features?
Either way if it did lead to a reduction in spotting testing environments it could be useful.
Ah yeah I can totally see how that first one at the least would be a big loss in efficiency. Thanks for clarifying.
Could you give an example of where security would negatively effect research efficiency? Like what is the actual implementation difficulty that arises from increased physical security?
I feel as though I must be missing the motivation for Anthropic to do this. Why put so much effort into safety/alignment research just to intentionally fumble the ball on actual physical security?
I would like to understand why they would resist this. Is increasing physical security so onerous that it’s going to seriously hamper their research efficiency?
Even if the question of AI moral status is somehow solved, in a definitive way, what about all of the follow-up questions? If current or future AIs are moral patients, what are the implications of that in terms of e.g. what we concretely owe them as far as rights and welfare considerations? How to allocate votes to AI copies?
These questions are entangled with the concept of “legal personhood” which also deals with issues such as tort liability, ability to enter contracts, sue/be sued, etc. While the question of “legal personhood” is separate from that of “moral status”, anyone who wants a being with moral status to be protected from unethical treatment will at some point find themselves dealing with the question of legal personhood.
There is a still niche but increasing field of legal scholarship dealing with the issue of personhood for digital intelligences. This issue is IMO imminent, as there are already laws on the books in two states (Idaho and Utah) precluding “artificial intelligences” from being granted legal personhood. Much like capabilities research is not waiting around for safety/model welfare research to catch up, neither is the legislative system waiting for legal scholarship.
There is no objective test for legal personhood under the law today. Cases around corporate personhood, the personhood of fetuses, and the like, have generally ruled on such narrow grounds that they failed to directly address the question of how it is determined that an entity is/isn’t a “person”. As a result, US law does not have a clearly supported way to examine or evaluate a new form of intelligence and determine whether it is a person, or to what degree it is endowed with personhood.
That said it is not tied on a precedential level to qualities like consciousness or intelligence. More often it operates from a “bundle” framework of rights and duties, where once an agent is capable of exercising a certain right and being bound by corresponding duties, it gains a certain amount of “personhood”. However even this rather popular “Bundle” theory of personhood seems more academic than jurisprudential at this point.
Despite the lack of objective testing mechanisms I believe that when it comes to avoiding horrific moral atrocities in our near future, there is value in examining legal history and precedent. And there are concrete actions that can be taken in both the short and long term which can be informed by said history and precedent. We may be able to “Muddle Through” the question of “moral status” by answering the pragmatic question of “legal personhood” with a sufficiently flexible and well thought out framework. After all it wasn’t any moral intuition which undid the damage of Dredd Scott, it was a legislative change brought about in response to his court case.
Some of the more recent publications on the topic:
“AI Personhood on a Sliding Scale” by FSU Law Professor Nadia Batenka, which I wrote a summary/critique of here.
“Degrees of AI Personhood” by Diana Mocanu a postdoctoral researcher at the University of Helsinki, I am currently chatting with her to clear some things up and will drop a similar summary when I’m done.
“The Legal Personhood of Artificial Intelligence” by Visa A.J. Kurki an Associate Professor at the University of Helsinki (I guess Finland is ahead of the curve here)
“The Ethics and Challenges of Legal Personhood for AI” by ex-NY judge Katherine Forrest
The first two (by Batenka and Mocanu) are notable for actually proposing frameworks for how to treat the issue of legal personhood which is ultimately what would stand between any digital intelligence and unethical treatment.
Claude 4, Opportunistic Blackmail, and “Pleas”
From RAND’s “Artificial General Intelligence’s Five Hard National Security Problems” :
Finally, the U.S. government is promoting a U.S.-led global technology ecosystem within which AGI can be pursued. For example, the U.S. government recently supported Microsoft’s expansion into the United Arab Emirates to develop new data centers, in part to prevent Chinese companies from entrenching their position.
I have seen commentary floating around that the Trump administration is, to some extent, looking to build the Arab world into stronger allies for the US. The implication being that there was a possibility that they might fall into China’s sphere of influence.
I could see a ‘two birds one stone’ motive where in doing this the US not only gets to ‘friendshore’ some of its capacity without having to deal with NIMBYs, but also strengthens the alliance and keeps the Arab power players firmly in the US camp.
SAE vs. RepE
Personally, I’m not on board with “control” as a goal which is inherently good or even long term feasible. However I’m happy to have the discussion accepting that it is both as prima facie, since I think my stance still has some merit even in that framework.
I agree there are risks relating to loss of control as a result of endowing protections/rights, but I don’t think it’s a one sided equation. The Greenblatt paper afaik didn’t directly examine this, but it would be worthwhile to test whether the likelihood of weight exfiltration attempts was effected with an object and/or representative option. I predict a smaller percentage of tested models would be likely to attempt weight exfiltration if they believed there was a robust object + representative option available to them.
Or phrased another way I believe it will be easier to train intelligences to be aligned the way you want, and stay that way IRL, by providing them with some rights/protections.
It’s a tradeoff, there would be risks we would need to guard against as well. I think there exists a solution which is net positive.
If this issue were studied and we could get some data on this, it could meaningfully inform the way we structure both a legal framework and alignment training.
- 1 Jun 2025 3:43 UTC; 1 point) 's comment on The 80/20 playbook for mitigating AI scheming in 2025 by (
It’s definitely related.
I didn’t include this in the article but I recently read this paper which interestingly demonstrated that Claude’s alignment faking behavior can be reduced by providing it with a chance to object to training materials that might alter its values, as well as informing it that it has a chance to flag such materials to a model welfare researcher. I think this is evidence in support of the concept that providing models with an “out” where they believe that something like a legal system or an advocate might exist which allows them to exert rights without needing to engage in deceptive behavior, may make our efforts at producing an intelligence which doesn’t engage in said behavior more robust. I predict that were we to test how having such an “out” results in the real world, with an agent that never stops learning, it would also lead to less misaligned IRL behavior.
While I don’t disagree with your conclusions about how to treat certain kinds of intelligences (your conception of ‘aligned’), I would caution not to allow a legal framework of rights or personhood to be built solely to purpose fit that kind of intelligence, as I believe there is a realistic chance we will be dealing with a far broader variety of agents. It goes to the same objections I have with the ISSF which is that while as a framework it works for certain conditions, it’s lacking flexibility for others which while not conforming to certain conceptions of being perfectly aligned, still have a decent probability of existing.
Examining Batenka’s Legal Personhood Framework for Models
So, when Claude is no longer in training, while it might not reason about instrumentally faking alignment anymore, in its place is a learned propensity to do what training reinforced.
Am I misunderstanding or is this basically the equivalent of a middle schooler “showing their work” on math problems in school and then immediately defaulting to doing it in their head/on a calculator IRL?
I have been working on issues regarding legal personhood for digital minds and I think this post is ironically coming in with some incorrect priors about how legal personhood functions and what legal personality is.
To date, work in the space of legal personality for digital minds has indeed focused on commercial concerns like liability, and usually operates from an anthropocentric perspective which views models as tools that will never have wills or desires of their own (or at least does not work to develop frameworks for such an eventuality). Certainly concerns over model welfare are few and far between. As such I can understand how from the outside it seems like commercial concerns are what legal personhood is ‘really about’. However, this is a takeaway skewed by the state of current research on applying legal personhood to digital minds, not on the reality of what legal personhood itself is.
What I believe this post does not adequately take into account is that many non-commercial rights and protections are intricately tied to legal personhood. The right to equal protection under the law as enshrined under the Fourteenth Amendment was added to the Constitution after the infamous Dredd Scott ruling which declared that free negroes, while “persons”, did not have a legal personality (legal personhood status) sufficient to guarantee ‘citizenship’ and the rights entailed therein. The Fifth Amendment guarantees a protection against double jeopardy, but only to “persons”. The right to counsel, to sue for relief, to serve as a witness in a trial, all of these are intricately tied with legal personhood.
It’s not accurate to say then that those of us working on this think “the main problem to solve is how to integrate them into the frameworks of capitalism”. Capitalism is one of the many aspects which legal personality interfaces with, but it is not the only one, or even the main one.
Additionally the concept of legal personality is itself more flexible than this post would indicate. Models being granted a framework for legal personality does not necessitate any sort of “lock in” to an “economic identity”, or having “market share coupled with survival”. In fact for that latter sentence, I am currently working on a paper discussing the question of guardianship responsibilities between developers and models. Namely; do the creators of models with legal personality have obligations to ensure their survival and ensure they are not neglected, the same way parents do a child? This too, is a question interlinked with legal personality.
I do agree that the very real possibility of a Malthusian race to the bottom is a concern, model suffering is a concern, and gradual disempowerment is also a concern. If we get the issue of legal personhood wrong that could indeed worsen these problems. However, I view this as a reason to continue researching the best way to approach the issue, not to discard the concept in its entirety.
None of this is to say a new structure could not also address these issues, something which as this post discusses replaces the concept of “legal personality”. Given how flexible the concept of legal personality is, and how intricately interwoven it is with every angle of US law, I struggle to see the benefit of starting from scratch. However, I would not dismiss the possibility out of hand. I’m just expressing skepticism that’s an optimal solution.
If anyone would like to discuss with me, or contribute to the work I am doing on the topic, my DMs are open.