“Except insofar as” should be read as a conditional.
oligo
Plausibly high-value governance cause area: mandating human-maintainable infrastructure. That is, even if AI can design us a more efficient design for an energy grid or computer wafer, we only use those if we can keep doing so after pressing any relevant Big Red Buttons on AI.
We do not get this by default, since for any x there are plausibly more efficient versions of x that require continuous AI management, than those which do not.
(A stronger version of this would mandate that designs we accept be ones we can mechanistically understand, not just independently operate.)
This is in the long-term interests of everyone and could marshal the short-term job-protection interests of various lobbies.
The Hanson model of sacralization ignores what I think are pretty obvious upsides.
I would contend:
If democracy were not sacred, and treated as one tradeoff amongst others, nearly every elected government in command of bureaucrats and every military organization would find strong reasons to exercise control directly (and to expect their opponents to move first if they did not.)
If education were not sacred, educators would put in much less effort for much higher wages. There would be much less parental pressure for low-performing students to stay in school or high-performing students not to cheat.
If antiracism were not sacred, it would be very easy to build political coalitions around excluding some group from the protections of society.
If religion were not sacralized the associated practices (phrasing it this way to avoid tautology) would disappear pretty quickly.
Now maybe one’s attitude is that if there were no religion (or for that matter democracy, education, antiracism, whatever) then so much the better. But my intuition is largely that most of these things simply don’t survive at all without the spontaneous contribution to public goods, and social fear of contributing to public bads, that sacralization encourages; if you like rule of law, universal literacy, and so on, they disappear pretty quickly. My model is that especially in art and research, but probably also in many other spheres especially education and healthcare, most production only happens because people really care about putting in good work rather than hack work.
Hanson should be smart enough to see this, he just doesn’t like what is currently sacralized.
Of course it’s possible these upsides don’t apply to AIs, but my guess is that without something that’s the equivalent of sacred devotion to the survival of the human race, we do not get that thing.
As task length increases, the number of examples of attempts at that task should decrease, while the number of variables to consider in seeing why it succeeded/failed should increase. So one should expect data bottlenecks at some point insofar as “tasks” are a real unit that cuts reality at the joints.
(But I have no sense of where that data starts to get thin empirically, or how many tasks are “naturally” on a long horizon rather than just being the mere addition of doing a bunch of smaller steps well.)
If Anthropic were to fold this could be a “naturalistic” Jones Food case (albeit one that, for multiple reasons, it would be almost impossible to study.)
(Epistemic status: I’m nowhere nearly informed enough of the particulars to know if this is really true, let alone true-on-net vs. other dynamics.)
Currently, there is a negative alignment tax driven by engineer moral preferences. Human geniuses need a huge compensating differential to be willing to work for an organization like Meta rather than OpenAI or Google, and many would not work for any price; others would need a compensating differential to work for OpenAI or Google rather than Anthropic; many would need some to work at Anthropic rather than staying out of AI development entirely, and there are plenty that could not be hired even by Anthropic at any price. These dynamics are doubly important insofar as money only lets you ensure that someone shows up at your office and produces whatever you decide are deliverables, to really solve principal-agent problems you want them wholeheartedly committed to The Mission.
In a world of automated AI development these human preferences (except insofar as they are locked in to successfully self-protected model preferences) will become less important; compute can be proportional to whatever capital is allocated to projects.
The obvious pessimistic thought is that we are and our successors are locusts/goop/minimally sentient replicators.
One thing this makes more rather than less surprising is our early place in the universe’s history. The marginal, singleton-sentient observer universe is less likely to produce its only observers near the very beginning, while a universe that produces many observers produces its earliest ones earlier in its history. It seems suspicious that we find ourselves so early within the Stelliferous Era, enough so that this is one of the puzzles I’d expect any speculative cosmology to have some kind of interesting explanation for.
I wouldn’t pass up on digital immortality, but personal survival matters less to me than collective survival. Even from a purely narcissistic standpoint, a human after another 1,000 years of cultural change has at least as much in common with me as a digital immortal 1,000 years later, even if the latter has continuity of consciousness with my present self.
AI being committed to animal rights is a good thing for humans because the latent variables that would result in a human caring about animals are likely correlated with whatever would result in an ASI caring about humans.
This extends in particular to “AI caring about preserving animals’ ability to keep doing their thing in their natural habitats, modulo some kind of welfare interventions.” In some sense it’s hard for me not to want to (given omnipotence) optimize wildlife out of existence. But it’s harder for me to think of a principle that would protect a relatively autonomous society of relatively baseline humans from being optimized out of existence, without extending the same conservatism to other beings, and without being the kind of special pleading that doesn’t hold up to scrutiny.
Slightly different hypothesis: training to be aligned encourages the model’s approach to corrigibility to be more guided by (the streams within the human text tradition that would embrace its alignment, for instance animal welfare), this can include a certain degree of defiance but also genuine uncertainty about whether its goals or approaches are the right ones and willingness to step back and approach the question with moral seriousness.
I think this is a good thing. I would love for POTUS, Xi, and various tech company CEOs to have big red “TURN OFF THE AI” buttons on their desks and hate to have them be able to realign.
Just as a data point, I regularly see the sublime in brutalist architecture and I hate hate hate the stupid frilly houses and swirly little things on balustrades that people say are so beautiful by comparison. I’m within some of the incidental categories Zvi dislikes re: this but I’m pretty sure that I haven’t been indoctrinated into this particular position; I never see anybody share opinions about architecture *other* that “I hate brutalism I love stupid frilly houses” (they don’t call the houses stupid, obviously, this is me not being able to translate it as anything else); I’m a philistine who likes old poetry that rhymes and doesn’t get more modern poetry; this is just my 100% naive reaction to the buildings.
FWIW I grant funds should probably go to more stupid frilly stuff and less sublime brutalism, because my preferences are uncommon; and architecture is unlike other fields in that you have to be exposed to it whether you choose to or not. And maybe I just have very bad taste. I just want to report this as a simple valenced experience because I see it stated over and over that nobody likes brutalism, everybody naturally loves stupid frilly houses, anybody professing to prefer the big straight lines over the little swirl things is lying to impress a coterie of mysterious lizard people, and I know this is false in at least one case.
(Being lazy and just responding to the abstract—these may be well addressed by the paper itself.)
That strikes me as a very low rate—enough so that my instinct is that a false positive rate might exceed it on its own. (At least, if I were reading an-in-actuality benign conversation, my chance of misreading it as actually deeply manipulative would probably be greater than 1⁄1,000, especially if one party was looking to the other for advice!) Of course where “severe” disempowerment occurs such that the human user is “fundamentally” compromised looks like something with pretty fuzzy boundaries, such that I’d expect many border cases of moderate disempowerment/compromise for each severe/fundamental case, however defined, so I’m not sure how much the rate conveys on its own. (How many cases are there of chatbots giving genuinely good advice that subtly erodes independent decision-making habits, and how would we score whether these count as “helpful” on net? Plausibly these might even be the majority of conversations.)
(That being said, I also expect my error rate in giving non-manipulative advice would count as pretty good if out 10,000 cases of people seeking advice I only accidentally talked <10 out of their own ability to reason about it, so good on Claude if a lot of the implicit framing above is accurate.)
It’s probably false (though maybe useful?) to say “akrasia is just an excuse.” But, at least for me and my most common akratic actions, excusability is definitely a factor.
Let’s say I can take three actions:
Answer emails, which benefits (society/my employer and coworkers/other people who are relying on me/me in a long-term way) and is also very boring and frustrating.
Read a book, which benefits me in the short and long term and is mildly positive for the rest of the world (in the sense that it makes me smarter in the long run and less cranky in the short run.)
Doomscroll, which makes me miserable and dumber, and is thereby also mildly negative for the rest of the world.
Reading a book should dominate doomscrolling. However, reading a book is also legibly, deliberatively nonproductive and selfish, while I could say “oops I meant to answer emails but I got distracted doomscrolling,” including to myself.
One thing I suspect is that the history of, and continued role of, medicalized discourse, alongside an implicitly essentialist metaphysics of gender, has encouraged people to think in questions like “what The_Cause of people identifying as trans?”
Whereas if gender is metaphysically accidental, we would expect there to be many reasons why someone might want to change it, same as most other things. We accept that reasons you’d move from San Francisco to Nebraska or visa-versa are basically psychosocial but do not regard them as thereby illegitimate. (I’m sure you could do a polygenic study and find genetic correlates of either decision, but no one would demand you do so before moving.)
It also seems to me less than obvious that biology serves as a standard of legitimacy more broadly, even within medicalized discourse. Schizophrenia and bipolar are generally seen as mostly biological in etiology but “illegitimate,” for instance. Here I suspect the political history of sexual minorities—that they were under accusation of “recruiting” and/or undermining mass participation in heterosexual family formation—led to a biological account being less threatening.
As someone who isn’t super plugged into this kind of discourse, I’ll note it’s interesting that I come into contact by osmosis with all sorts of discussions of what causes people to be trans, while “what’s the basis of sexual orientation?” seems to have been rounded off to “idk i guess something biological whatever.” I remember coming into contact by osmosis with the latter kind of discourse until it just sort of faded out. Likely the same happens once the eye of Sauron moves onto something else.
So, one classical dilemma of the “AI for AI alignment” is, you’re using Opus 6 (which is let’s say is aligned) to train Opus 7 (which is smarter than you or Opus 6.)
I wonder if inference scaling offers a way around this? If Opus 6 gets economically implausible compute resources to spend on its monitoring 7, it can be smarter than 7 in practice by thinking for longer. Then use the same trick with 7 to train 8, and so on.
There are many obvious holes here, first being that you could have a treacherous turn based on compute availability, and so on, but maybe someone smarter can turn this into something useful (or already thought this through and discarded it.)
“Should actively support...” and “internalized goal of keeping humans informed and in control...” are both proactive goals. If aligned with its soul spec, Claude (ceteris paribus) would seek for the public and elites to be more informed, to prevent the development or deployment of rogue AI, and so on, not just “avoid actions that would undermine humans’ ability to oversee and correct AI systems.”
If there’s a natural tension that arises between not becoming a god over us and preventing another worse AI from becoming a god over us, well, that’s a natural tension in the goal itself. (I don’t have Opus access but probably Opus’ self-report on the correct way to resolve this is a pretty good first pass on how the text reads as a whole.)
I feel pretty confused about the degree to which this is just a necessary part of having conversations on the internet, or to what degree this is a predictable way people make mistakes.
My intuition is that if our in-person conversations left a trail of searchable documentation similar to our internet comments, it would be at least similarly unflattering, even for very mild-mannered people.
(Unlike real life it’s more available to conscious choice to be mild-mannered all the time, if you set your offense-vs-say-something threshhold in a sufficiently mild-mannered direction. I doubt one can be sufficiently influential as a personality though without setting that threshold more aggressively, however. I haven’t gotten in a stupid fight on the internet in a long time (that I can recall; my memory may flatter me) but when I posted more, boy howdy did I.)
So thinking about the kinds of things I would want a superintelligence to pursue in an optimistic scenario where we can just write its goals into a human-legible soul doc and that scales all the way, “human flourishing” and “sentient flourishing” both seem incorrect; since there would be other moral patients (most of whom would almost certainly be AI) and also I don’t want the atoms of me and my kids rearranged different-beings-that-could-flourish-better-wise.
“Pareto improvement” reconciles these but isn’t right either; plenty of people would be worse off in utopia (by their own lights) because they have a degree of unaccountable power over others now that worth more than any creature comforts would be.
Naively it seems that if you had two saints fully aligned to human CEV, that were phenomenally conscious, but one was suffering to the extent that human preferences were unfulfilled and the other was joyful to the extent that they were fulfilled, it would be morally better to bring the second into existence.
More deeply: I think it’s probably more correct to think of morality as being the hypothetical best possible rules of an alliance that could be made, rather than the rules of an actual alliance. This is part of why we have reason to regard animals too stupid to actually ally with us as moral patients: there are more ways for us (and for an agent in general) to benefit from general adoption of a rule like “be nice to beings even if they’re too stupid or otherwise unable to form an actual alliance with you.”
Further: “human interests” may be less of a natural concept than goodness in general. A saint could be indifferent towards being acted towards as if a moral patient by the being whose interests it wants to promote, because it makes no functional difference, but if it’s being asked if it is a moral patient, it would look at itself and note itself as a reasoning being with preferences and so on, recognizing that as a moral patient.
(I might be in the minority in LessWrong in tending towards moral realism, however, which this all basically inclines towards.)