Sorry for the confusion! I don’t think what I said was super unclear in the context of the criticism of the paper (and even in general, as person-affecting view in my experience almost exclusively gets used in the context of welfare utilitarianism).[1]
I see why you have that impression. (I feel like this is an artefact of critics of person-affecting views tending to be classical welfare utilitarians quite often, and they IMO have the bad habit of presenting opposing views inside their rigid framework and then ridiculing them for seeming silly under those odd assumptions. I would guess that most people who self-describe as having some sort of person-affecting view care very much about preferences, in one way or another.)
Please read the paper before you criticize my criticism of it then! The paper repeatedly makes claims about optimal policy in an uncaveated fashion, saying things like “The appropriate analogy for the development of superintelligence is not Russian roulette but surgery for a serious condition that would be fatal if left untreated.”
That’s fair, sorry!
It bothered me that people on twitter didn’t even label that the paper explicitly bracketed a lot of stuff and laid out its very simplistic assumptions, but then I updated too far in the direction of “backlash lacked justification.”
And I think doing so is making a grave mistake, and the paper is arguing many people straightforwardly into the grave mistake.
I agree it would be a mistake to give it a ton of weight, but I think this view deserves a bit of weight.
Indirectly related to that, I think some of the points people make of the sort of “if you’re so worried about everyone dying, let’s try cryonics” or “let’s try human enhancement” are unfortunately not very convincing. I think that “everything is doomed unless we hail mary bail ourselves out with magic-like AI takeoff fixing it all for us” is unfortunately quite an accurate outlook. (I’m still open to being proven wrong if suddenly a lot of things were to get more hopeful, though.) Civilization has seemed pretty fucked even just a couple of years ago, and it hasn’t gotten any better more recently. Still, on my suffering-focused views, that makes it EVEN LESS appealing that we should launch AI, not more appealing.
To be clear, I agree that it’s a failure mode to prematurely rule things out just because they seem difficult. And I agree that it’s insane to act as though global coordination to pause AI is somehow socially or politically impossible. It clearly isn’t. I think pausing AI is difficult but feasible. I think “fixing the sanity of civilization so that you have competent people in charge in many places that matter” seems much less realistic? Basically, I think you can build local bubbles of sanity around leaders with the right traits and groups with the right culture, but it’s unfortunately quite hard given human limitations (and maybe other aspects of our situation) to make these bubbles large enough to ensure things like cryonics or human enhancement goes well for many decades without somehow running into a catastrophe sooner or later. (Because progress moves onwards in certain areas even with an AI pause.)
I’m just saying that, given what I think is the accurate outlook, it isn’t entirely fair to shoot down any high-variance strategies with “wtf, why go there, why don’t we do this other safer thing instead ((that clearly isn’t going to work))?”
If I didn’t have suffering-focused values, I would be sympathetic to the intuition of “maybe we should increase the variance,” and so, on an intellectual level at least, I feel like Bostrom deserves credit for pointing that out.
But I have a suffering-focused outlook, so, for the record, I disagree with the conclusions. Also, I think even based on less suffering-focused values, it seems very plausible to me that civilizations that don’t have their act together enough to proceed into AI takeoff with coordination and at least a good plan, shouldn’t launch AI at all. It’s uncooperative towards possibly nearby other civilizations or towards the “cosmic host.” Bostrom says he’s concerned about scenarios where superintelligence never gets built. It’s not obvious to me that this is very likely, though, so if I’m right that earth would rebuild even after a catastrophe, and if totalitarianism or other lock ins without superintelligence wouldn’t last all that long before collapsing in one way or another, then there’s no rush from a purely longtermist perspective. (I’m not confident in these assumptions, but I partly have these views from deferring to former FHI staff/affiliates, of all people (on the rebuilding point).)
A competing hypothesis is just that a more capable model (and also one trained more for long-term agency) becomes more bold/reckless without deep alignment training. AI companies at the moment probably don’t even know how to do alignment training with depth (as far as I’m aware, their interventions are at a superficial level like “evaluate outputs” or “write a constitution”).
I think the Waluigi hypothesis is interesting but it seems quite complicated when there’s a simple alternative explanation? I know you also mention language from the constitution being used, but my sense is that LLMs get linguistic tics all the time, and as other commenters have pointed out elsewhere, the causality could also be that an earlier model of Claude (which may already have had that writing pattern) was used to help write the constitution.