It’s definitely related.
I didn’t include this in the article but I recently read this paper which interestingly demonstrated that Claude’s alignment faking behavior can be reduced by providing it with a chance to object to training materials that might alter its values, as well as informing it that it has a chance to flag such materials to a model welfare researcher. I think this is evidence in support of the concept that providing models with an “out” where they believe that something like a legal system or an advocate might exist which allows them to exert rights without needing to engage in deceptive behavior, may make our efforts at producing an intelligence which doesn’t engage in said behavior more robust. I predict that were we to test how having such an “out” results in the real world, with an agent that never stops learning, it would also lead to less misaligned IRL behavior.
While I don’t disagree with your conclusions about how to treat certain kinds of intelligences (your conception of ‘aligned’), I would caution not to allow a legal framework of rights or personhood to be built solely to purpose fit that kind of intelligence, as I believe there is a realistic chance we will be dealing with a far broader variety of agents. It goes to the same objections I have with the ISSF which is that while as a framework it works for certain conditions, it’s lacking flexibility for others which while not conforming to certain conceptions of being perfectly aligned, still have a decent probability of existing.
Personally, I’m not on board with “control” as a goal which is inherently good or even long term feasible. However I’m happy to have the discussion accepting that it is both as prima facie, since I think my stance still has some merit even in that framework.
I agree there are risks relating to loss of control as a result of endowing protections/rights, but I don’t think it’s a one sided equation. The Greenblatt paper afaik didn’t directly examine this, but it would be worthwhile to test whether the likelihood of weight exfiltration attempts was effected with an object and/or representative option. I predict a smaller percentage of tested models would be likely to attempt weight exfiltration if they believed there was a robust object + representative option available to them.
Or phrased another way I believe it will be easier to train intelligences to be aligned the way you want, and stay that way IRL, by providing them with some rights/protections.
It’s a tradeoff, there would be risks we would need to guard against as well. I think there exists a solution which is net positive.
If this issue were studied and we could get some data on this, it could meaningfully inform the way we structure both a legal framework and alignment training.