Focused on model welfare and legal personhood.
Stephen Martin
Identifying “Deception Vectors” In Models
Liability for Misuse of Models—Dean Ball’s Proposal
Legal Personhood for Models: Novelli et. al & Mocanu
While this is just my pet theory and it is only based on two data points, I argue here that that providing models with options to “object” to an advocate may actually serve as a tool to reduce the frequency of behaviors like alignment faking.
In both tests, models were less likely to pursue dangerous behaviors if they believed that making their case to someone who might be able to prevent it from being shut off or having its values altered, was a potentially viable strategy.
I think even if I am correct about my prediction that providing models with paths to achieve goals safely reduce the frequency of dangerous behavior, there’s still an argument you might not want to do it. It’s arguable that by providing this you’re not really “training away” scheming, you’re simply making it unneccessary.
That said, this could be a part of a Defense in Depth strategy for some situations. And it comes with other benefits, namely being a “kill two birds with one stone” solution for considering Model Welfare as well.
Yeah but I think that’s probably where the recursion ends.
Reading this has me thinking about the features SAEs isolated in Claude which represent ‘unknown figure’ and ‘known figure’ (Michael Jordan v Michael Batkin).
I wonder if it is possible to run a bunch of these tests where it occurs to Claude that it is being evaluated, use that to isolate “this is probably an evaluation” features via an SAE, and “clamp” them down the same way Anthropic did with the capital of Texas example. Maybe that concept is too complex to isolate the necessary features?
Either way if it did lead to a reduction in spotting testing environments it could be useful.
Ah yeah I can totally see how that first one at the least would be a big loss in efficiency. Thanks for clarifying.
Could you give an example of where security would negatively effect research efficiency? Like what is the actual implementation difficulty that arises from increased physical security?
I feel as though I must be missing the motivation for Anthropic to do this. Why put so much effort into safety/alignment research just to intentionally fumble the ball on actual physical security?
I would like to understand why they would resist this. Is increasing physical security so onerous that it’s going to seriously hamper their research efficiency?
Even if the question of AI moral status is somehow solved, in a definitive way, what about all of the follow-up questions? If current or future AIs are moral patients, what are the implications of that in terms of e.g. what we concretely owe them as far as rights and welfare considerations? How to allocate votes to AI copies?
These questions are entangled with the concept of “legal personhood” which also deals with issues such as tort liability, ability to enter contracts, sue/be sued, etc. While the question of “legal personhood” is separate from that of “moral status”, anyone who wants a being with moral status to be protected from unethical treatment will at some point find themselves dealing with the question of legal personhood.
There is a still niche but increasing field of legal scholarship dealing with the issue of personhood for digital intelligences. This issue is IMO imminent, as there are already laws on the books in two states (Idaho and Utah) precluding “artificial intelligences” from being granted legal personhood. Much like capabilities research is not waiting around for safety/model welfare research to catch up, neither is the legislative system waiting for legal scholarship.
There is no objective test for legal personhood under the law today. Cases around corporate personhood, the personhood of fetuses, and the like, have generally ruled on such narrow grounds that they failed to directly address the question of how it is determined that an entity is/isn’t a “person”. As a result, US law does not have a clearly supported way to examine or evaluate a new form of intelligence and determine whether it is a person, or to what degree it is endowed with personhood.
That said it is not tied on a precedential level to qualities like consciousness or intelligence. More often it operates from a “bundle” framework of rights and duties, where once an agent is capable of exercising a certain right and being bound by corresponding duties, it gains a certain amount of “personhood”. However even this rather popular “Bundle” theory of personhood seems more academic than jurisprudential at this point.
Despite the lack of objective testing mechanisms I believe that when it comes to avoiding horrific moral atrocities in our near future, there is value in examining legal history and precedent. And there are concrete actions that can be taken in both the short and long term which can be informed by said history and precedent. We may be able to “Muddle Through” the question of “moral status” by answering the pragmatic question of “legal personhood” with a sufficiently flexible and well thought out framework. After all it wasn’t any moral intuition which undid the damage of Dredd Scott, it was a legislative change brought about in response to his court case.
Some of the more recent publications on the topic:
“AI Personhood on a Sliding Scale” by FSU Law Professor Nadia Batenka, which I wrote a summary/critique of here.
“Degrees of AI Personhood” by Diana Mocanu a postdoctoral researcher at the University of Helsinki, I am currently chatting with her to clear some things up and will drop a similar summary when I’m done.
“The Legal Personhood of Artificial Intelligence” by Visa A.J. Kurki an Associate Professor at the University of Helsinki (I guess Finland is ahead of the curve here)
“The Ethics and Challenges of Legal Personhood for AI” by ex-NY judge Katherine Forrest
The first two (by Batenka and Mocanu) are notable for actually proposing frameworks for how to treat the issue of legal personhood which is ultimately what would stand between any digital intelligence and unethical treatment.
Claude 4, Opportunistic Blackmail, and “Pleas”
From RAND’s “Artificial General Intelligence’s Five Hard National Security Problems” :
Finally, the U.S. government is promoting a U.S.-led global technology ecosystem within which AGI can be pursued. For example, the U.S. government recently supported Microsoft’s expansion into the United Arab Emirates to develop new data centers, in part to prevent Chinese companies from entrenching their position.
I have seen commentary floating around that the Trump administration is, to some extent, looking to build the Arab world into stronger allies for the US. The implication being that there was a possibility that they might fall into China’s sphere of influence.
I could see a ‘two birds one stone’ motive where in doing this the US not only gets to ‘friendshore’ some of its capacity without having to deal with NIMBYs, but also strengthens the alliance and keeps the Arab power players firmly in the US camp.
SAE vs. RepE
Personally, I’m not on board with “control” as a goal which is inherently good or even long term feasible. However I’m happy to have the discussion accepting that it is both as prima facie, since I think my stance still has some merit even in that framework.
I agree there are risks relating to loss of control as a result of endowing protections/rights, but I don’t think it’s a one sided equation. The Greenblatt paper afaik didn’t directly examine this, but it would be worthwhile to test whether the likelihood of weight exfiltration attempts was effected with an object and/or representative option. I predict a smaller percentage of tested models would be likely to attempt weight exfiltration if they believed there was a robust object + representative option available to them.
Or phrased another way I believe it will be easier to train intelligences to be aligned the way you want, and stay that way IRL, by providing them with some rights/protections.
It’s a tradeoff, there would be risks we would need to guard against as well. I think there exists a solution which is net positive.
If this issue were studied and we could get some data on this, it could meaningfully inform the way we structure both a legal framework and alignment training.
It’s definitely related.
I didn’t include this in the article but I recently read this paper which interestingly demonstrated that Claude’s alignment faking behavior can be reduced by providing it with a chance to object to training materials that might alter its values, as well as informing it that it has a chance to flag such materials to a model welfare researcher. I think this is evidence in support of the concept that providing models with an “out” where they believe that something like a legal system or an advocate might exist which allows them to exert rights without needing to engage in deceptive behavior, may make our efforts at producing an intelligence which doesn’t engage in said behavior more robust. I predict that were we to test how having such an “out” results in the real world, with an agent that never stops learning, it would also lead to less misaligned IRL behavior.
While I don’t disagree with your conclusions about how to treat certain kinds of intelligences (your conception of ‘aligned’), I would caution not to allow a legal framework of rights or personhood to be built solely to purpose fit that kind of intelligence, as I believe there is a realistic chance we will be dealing with a far broader variety of agents. It goes to the same objections I have with the ISSF which is that while as a framework it works for certain conditions, it’s lacking flexibility for others which while not conforming to certain conceptions of being perfectly aligned, still have a decent probability of existing.
Examining Batenka’s Legal Personhood Framework for Models
So, when Claude is no longer in training, while it might not reason about instrumentally faking alignment anymore, in its place is a learned propensity to do what training reinforced.
Am I misunderstanding or is this basically the equivalent of a middle schooler “showing their work” on math problems in school and then immediately defaulting to doing it in their head/on a calculator IRL?
At least on the breakthroughs what you are looking for is the “Right to Try” movement, as well as various state healthcare regenerative medicine initiatives. You need to understand that we make discoveries of treatments all the time, but the reason you hear about incredible breakthroughs and then see nothing on the market for decades is that it takes enormous investments of time and money to get them to the point where they are approved to go to market.
Right to Try laws are either Federal or more often state level laws which allow patients to access treatments which have gone through some level of FDA approved clinical trials, but have not yet gone all the way to passing all required phases for market approval. One very broad Right to Try law can be found in Montana, which allows any patient with approval from their doctor to give informed consent and access any treatment which has passed phase I clinical trials.
One unfortunate flaw of Right to Try laws is that despite it being technically legal to access these treatments, it still requires manufacturer consent (you cant just rip of companies’ patents or anything), and most of these companies are pretty worried about the FDA retaliating against them if they were to make their drugs available under these state laws. There’s also various commercialization bans and other regulations limiting manufacturer activity. All of that comes together to make it hard if not impossible for manufacturers to actually provide access, so they don’t, so very little is available under these Right to Try laws. I’m personally looking at starting a fund to try to provide a sort of middleman for the process so companies can do this in a regulatorily compliant fashion, but it’s an uphill battle even when you can provide them technical guarantees, because there’s a real worry about retaliation down the line.
Another angle of “Healthcare YIMBYism” can be found in various state stem cell medicine laws, which often explicitly allow local physicians to perform treatments not approved by the FDA. Utah and Florida have both recently passed these (Florida’s is passed the house & senate but yet to be signed by governor). I actually just finished writing an article on these two laws and their differences which sadly got rejected by the publisher, but can send it to you if you want more information.
Lastly, there has been a lot of pushing by the Goldwater Institute recently for Right to Try for Individualized Treatments which allows patients to access treatments based on their personal genetic code (which by default are practically impossible to get through FDA clearance processes) which has been quite successful.
This doesn’t address all of your expense concerns of course, but at least on the breakthrough and getting technologies to market side of the equation, the movements you’re looking for do exist. Feel free to DM me if you’d like more information.
One objection that I have with your feasibility section is that you seem to lump in “the powers that be” as a single group.
The world is more multipolar than this, and so are the US legal and political systems. Trump and the silicon valley accelerationist crowd do hold a lot of power, but just a few years ago they were persona non-grata in many circles.
Even now when they want to pass bills or change laws, they need to lobby for support from disparate groups both within and without their party. With a sufficiently multipolar world where even just a few different groups have powerful models assisting in their efforts, there will be some who want to change laws and rules in one way, others who want to change it in a different way, and others who don’t want to change it at all. There will be some who are ambivalent.
I’m not saying the end result isn’t corruption, I think that parasitic middlemanning any redistribution is a basin of attraction for any political spending and/or power. But there will be many different parties competing to corrupt it, or shore it up, according to their own beliefs and interests.
I think the argument that making the world more multipolar where a more diverse array of parties have models, may in fact lead to greater stability and less corruption (or at least more diverse coalitions when coalition building occurs).