That… seems like a big part of what having “solved alignment” would mean, given that you have AGI-level optimization aimed at (indirectly via a counter-factual) evaluating this (IIUC).
one solution to this problem is to simply never use that capability (running expensive computations) at all, or to not use it before the iterated counterfactual researchers have developed proofs that any expensive computation they run is safe, or before they have very slowly and carefully built dath-ilan-style corrigible aligned AGI.
Reposting myself from discord, on the topic of donating 5000$ to EA causes.
if you’re doing alignment research, even just a bit, then the 5000$ are plobly better spent on yourself
if you have any gears level model of AI stuff then it’s better value to pick which alignment org to give to yourself; charity orgs are vastly understaffed and you’re essentially contributing to the “picking what to donate to” effort by thinking about it yourself
if you have no gears level model of AI then it’s hard to judge which alignment orgs it’s helpful to donate to (or, if giving to regranters, which regranters are good at knowing which alignment orgs to donate to)
as an example of regranters doing massive harm: openphil gave 30M$ to openai at a time where it was critically useful to them, (supposedly in order to have a chair on their board, and look how that turned out when the board tried to yeet altman)
i know of at least one person who was working in regranting and was like “you know what i’d be better off doing alignment research directly” — imo this kind of decision is probly why regranting is so understaffed
it takes technical knowledge to know what should get money, and once you have technical knowledge you realize how much your technical knowledge could help more directly so you do that, or something
I agree that there’s no substitute for thinking about this for yourself, but I think that morally or socially counting “spending thousands of dollars on yourself, an AI researcher” as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it’s-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it’s very easy for donating to people or organizations in your social circle to have substantial negative expected value.
I’m glad that funding for AI safety projects exists, but the >10% of my income I donate will continue going to GiveWell.
I think people who give up large amounts of salary to work in jobs that other people are willing to pay for from an impact perspective should totally consider themselves to have done good comparable to donating the difference between their market salary and their actual salary. This applies to approximately all safety researchers.
They still make a lot less than they would if they optimized for profit (that said, I think most “safety researchers” at big labs are only safety researchers in name and I don’t think anyone would philanthropically pay for their labor, and even if they did, they would still make the world worse according to my model, though others of course disagree with this).
If my sole terminal value is “I want to go on a rollercoaster”, then an agent who is aligned to me would have the value “I want Tamsin Leake to go on a rollercoaster”, not “I want to go on a rollercoaster myself”. The former necessarily-has the same ordering over worlds, the latter doesn’t.
Quite. We don’t hear enough about individuality and competitive/personal drives when talking about alignment. I worry a lot that the abstraction and aggregation of “human” values completely misses the point of what most humans actually do.
Is quantum phenomena anthropic evidence for BQP=BPP? Is existing evidence against many-worlds?
Suppose I live inside a simulation ran by a computer over which I have some control.
Scenario 1: I make the computer run the following:
pause simulation
if is even(calculate billionth digit of pi):
resume simulation
Suppose, after running this program, that I observe that I still exist. This is some anthropic evidence for the billionth digit of pi being even.
Thus, one can get anthropic evidence about logical facts.
Scenario 2: I make the computer run the following:
pause simulation
if is even(calculate billionth digit of pi):
resume simulation
else:
resume simulation but run it a trillion times slower
If you’re running on the non-time-penalized solomonoff prior, then that’s no evidence at all — observing existing is evidence that you’re being ran, not that you’re being ran fast. But if you do that, a bunch of things break including anthropic probabilities and expected utility calculations. What you want is a time-penalized (probably quadratically) prior, in which later compute-steps have less realityfluid than earlier ones — and thus, observing existing is evidence for being computed early — and thus, observing existing is some evidence that the billionth digit of pi is even.
Scenario 3: I make the computer run the following:
pause simulation
quantum_algorithm <- classical-compute algorithm which simulates quantum algorithms the fastest
infinite loop:
use quantum_algorithm to compute the result of some complicated quantum phenomena
compute simulation forwards by 1 step
Observing existing after running this program is evidence that BQP=BPP — that is, classical computers can efficiently run quantum algorithms: if BQP≠BPP, then my simulation should become way slower, and existing is evidence for being computed early and fast (see scenario 2).
Except, living in a world which contains the outcome of cohering quantum phenomena (quantum computers, double-slit experiments, etc) is very similar to the scenario above! If your prior for the universe is a programs, penalized for how long they take to run on classical computation, then observing that the outcome of quantum phenomena is being computed is evidence that they can be computed efficiently.
Scenario 4: I make the computer run the following:
in the simulation, give the human a device which generates a sequence of random bits
pause simulation
list_of_simulations <- [current simulation state]
quantum_algorithm <- classical-compute algorithm which simulates quantum algorithms the fastest
infinite loop:
list_of_new_simulations <- []
for simulation in list_of_simulations:
list_of_new_simulations +=
[ simulation advanced by one step where the device generated bit 0,
simulation advanced by one step where the device generated bit 1 ]
list_of_simulations <- list_of_new_simulations
This is similar to what it’s like to being in a many-worlds universe where there’s constant forking.
Yes, in this scenario, there is no “mutual destruction”, the way there is in quantum. But with decohering everett branches, you can totally build exponentially many non-mutually-destructing timelines too! For example, you can choose to make important life decisions based on the output of the RNG, and end up with exponentially many different lives each with some (exponentially little) quantum amplitude, without any need for those to be compressible together, or to be able to mutually-destruct. That’s what decohering means! “Recohering” quantum phenomena interacts destructively such that you can compute the output, but decohering* phenomena just branches.
The amount of different simulations that need to be computed increases exponentially with simulation time.
Observing existing after running this program is very strange. Yes, there are exponentially many me’s, but all of the me’s are being ran exponentially slowly; they should all not observe existing. I should not be any of them.
This is what I mean by “existing is evidence against many-worlds” — there’s gotta be something like an agent (or physics, through some real RNG or through computing whichever variables have the most impact) picking a only-polynomially-large set of decohered non-compressible-together timelines to explain continuing existing.
Some friends tell me “but tammy, sure at step N each you has only 1/2^N quantum amplitude, but at step N there’s 2^N such you’s, so you still have 1 unit of realityfluid” — but my response is “I mean, I guess, sure, but regardless of that, step N occurs 2^N units of classical-compute-time in the future! That’s the issue!”.
Some notes:
I heard about pilot wave theory recently, and sure, if that’s one way to get single history, why not. I hear that it “doesn’t have locality”, which like, okay I guess, that’s plausibly worse program-complexity wise, but it’s exponentially better after accounting for the time penalty.
What if “the world is just Inherently Quantum”? Well, my main answer here is, what the hell does that mean? It’s very easy for me to imagine existing inside of a classical computation (eg conway’s game of life); I have no idea what it’d mean for me to exist in “one of the exponentially many non-compressible-together decohered exponenially-small-amplitude quantum states that are all being computed forwards”. Quadratically-decaying-realityfluid classical-computation makes sense, dammit.
What if it’s still true — what if I am observing existing with exponentially little (as a function of the age of the universe) realityfluid? What if the set of real stuff is just that big?
Well, I guess that’s vaguely plausible (even though, ugh, that shouldn’t be how being real works, I think), but then the tegmark 4 multiverse has to contain no hypotheses in which observers in my reference class occupy more than exponentially little realityfluid.
Like, if there’s a conway’s-game-of-life simulation out there in tegmark 4, whose entire realityfluid-per-timestep is equivalent to my realityfluid-per-timestep, then they can just bruteforce-generate all human-brain-states and run into mine by chance, and I should have about as much probability of being one of those random generations as I’d have being in this universe — both have exponentially little of their universe’s realityfluid! The conway’s-game-of-life bruteforced-me has exponentially little realityfluid because she’s getting generated exponentially late, and quantum-universe me has exponentially little realityfluid because I occupy exponentially little of the quantum amplitude, at every time-step.
See why that’s weird? As a general observer, I should exponentially favor observing being someone who lives in a world where I don’t have exponentially little realityfluid, such as “person who lives only-polynomially-late into a conway’s-game-of-life, but happened to get randomly very confused about thinking that they might inhabit a quantum world”.
Existing inside of a many-worlds quantum universe feels like aliens pranksters-at-orthogonal-angles running the kind of simulation where the observers inside of it to be very anthropically confused once they think about anthropics hard enough. (This is not my belief.)
If you’re running on the non-time-penalized solomonoff prior[...]a bunch of things break including anthropic probabilities and expected utility calculations
This isn’t true, you can get perfectly fine probabilities and expected utilities from ordinary Solmonoff induction(barring computability issues, ofc). The key here is that SI is defined in terms of a prefix-free UTM whose set of valid programs forms a prefix-free code, which automatically grants probabilities adding up to less than 1, etc. This issue is often glossed over in popular accounts.
If you use the UTMs for cartesian-framed inputs/outputs, sure; but if you’re running the programs as entire worlds, then you still have the issue of “where are you in time”.
Say there’s an infinitely growing conway’s-game-of-life program, or some universal program, which contains a copy of me at infinitely many locations. How do I weigh which ones are me?
It doesn’t matter that the UTM has a fixed amount of weight, there’s still infinitely many locations within it.
If you want to pick out locations within some particular computation, you can just use the universal prior again, applied to indices to parts of the computation.
What you propose, ≈”weigh indices by kolmogorov complexity” is indeed a way to go about picking indices, but “weigh indices by one over their square” feels a lot more natural to me; a lot simpler than invoking the universal prior twice.
I think using the universal prior again is more natural. It’s simpler to use the same complexity metric for everything; it’s more consistent with Solomonoff induction, in that the weight assigned by Solomonoff induction to a given (world, claw) pair would be approximately the sum of their Kolmogorov complexities; and the universal prior dominates the inverse square measure but the converse doesn’t hold.
It doesn’t matter? Like, if your locations are identical (say, simulations of entire observable universe and you never find any difference no matter “where” you are), your weight is exactly the weight of program. If you expect dfferences, you can select some kind of simplicity prior to weight this differences, because there is basically no difference between “list all programs for this UTM, run in parallel”.
Interesting idea. I don’t think using a classical Turing machine in this way would be the right prior for the multiverse. Classical Turing machines are a way for ape brains to think about computation using the circuitry we have available (“imagine other apes following these social contentions about marking long tapes of paper”). They aren’t the cosmically simplest form of computation. For example, the (microscopic non-course-grained) laws of physics are deeply time reversible, where Turing machines are not. I suspect this computation speed prior would lead to Boltzmann-brain problems. Your brain at this moment might be computed at high fidelity, but everything else in the universe would be approximated for the computational speed-up.
I remember a character in Asimov’s books saying something to the effect of
It took me 10 years to realize I had those powers of telepathy, and 10 more years to realize that other people don’t have them.
and that quote has really stuck with me, and keeps striking me as true about many mindthings (object-level beliefs, ontologies, ways-to-use-one’s-brain, etc).
For so many complicated problem (including technical problems), “what is the correct answer?” is not-as-difficult to figure out as “okay, now that I have the correct answer: how the hell do other people’s wrong answers mismatch mine? what is the inferential gap even made of? what is even their model of the problem? what the heck is going on inside other people’s minds???”
Answers to technical questions, once you have them, tend to be simple and compress easily with the rest of your ontology. But not models of other people’s minds. People’s minds are actually extremely large things that you fundamentally can’t fully model and so you’re often doomed to confusion about them. You’re forced to fill in the details with projection, and that’s often wrong because there’s so much more diversity in human minds than we imagine.
The most complex software engineering projects in the world are absurdly tiny in complexity compared to a random human mind.
People’s minds are actually extremely large things that you fundamentally can’t fully model
Is this “fundamentally” as in “because you, the reader, are also a bounded human, like them”? Or “fundamentally” as in (something more fundamental than that)?
The first one. Alice fundamentally can’t fully model Bob because Bob’s brain is as large as Alice’s, so she can’t fit it all inside her own brain without simply becoming Bob.
If timelines weren’t so short, brain-computer-based telepathy would unironically be a big help for alignment.
(If a group had the money/talent to “hedge” on longer timelines by allocating some resources to that… well, instead of a hivemind, they first need to run through the relatively-lower-hanging fruit. Actually, maybe they should work on delaying capabilities research, or funding more hardcore alignment themselves, or...)
I’ve heard some describe my recent posts as “overconfident”.
I think I used to calibrate how confident I sound based on how much I expect the people reading/listening-to me to agree with what I’m saying, kinda out of “politeness” for their beliefs; and I think I also used to calibrate my confidence based on how much they match with the apparent consensus, to avoid seeming strange.
I think I’ve done a good job learning over time to instead report my actual inside-view, including how confident I feel about it.
There’s already an immense amount of outside-view double-counting going on in AI discourse, the least I can do is provide {the people who listen to me} with my inside-view beliefs, as opposed to just cycling other people’s opinions through me.
Hence, how confident I sound while claiming things that don’t match consensus. I actually am that confident in my inside-view. I strive to be honest by hedging what I say when I’m in doubt, but that means I also have to sound confident when I’m confident.
And also, it’s not clear that “feelings” or “experiences” or “qualia” (or the nearest unconfused versions of those concepts) are pointing at the right line between moral patients and non-patients. These are nontrivial questions, and (needless to say) not the kinds of questions humans should rush to lock in an answer on today, when our understanding of morality and minds is still in its infancy.
in this spirit, i’d like us to stick with using the term “moral patient” or “moral patienthood” when we’re talking about the set of things worthy of moral consideration. in particular, we should be using that term instead of:
“conscious things”
“sentient things”
“sapient things”
“self-aware things”
“things with qualia”
“things with experiences”
“things that aren’t p-zombies”
“things for which there is something it’s like to be them”
because those terms are hard to define, harder to meaningfully talk about, and we don’t in fact know that those are what we’d ultimately want to base our notion of moral patienthood on.
so if you want to talk about the set of things which deserve moral consideration outside of a discussion of what precisely that means, don’t use a term which you feel like it probably is the criterion that’s gonna ultimately determine which things are worthy of moral consideration, such as “conscious beings”, because you might in fact be wrong about what you’d consider to have moral patienthood under reflection. simply use the term “moral patients”, because it is the term which unambiguously means exactly that.
Perhaps the main goal of AI safety is to improve the final safety/usefulness pareto frontier we end up with when there are very powerful (and otherwise risky) AIs.
Alignment is one mechanism that can improve the pareto frontier.
Not using powerful AIs allows for establishing a low-usefulness, but high-safety point.
(Usefulness and safety can blend into each other in many cases (e.g. not getting useful work out is itself dangerous), but I still think this is a useful approximate frame in many cases.)
Interesting, when you frame it like that though the hard part is enforcing it. And if I was being pithy I’d say something like: that involves human alignment, not AI
“AI Safety”, especially enforcing anything, does pretty much boil down to human alignment, i.e. politics, but there are practically zero political geniuses among its proponent, so it needs to be dressed up a bit to sound even vaguely plausible.
Have you seen this implemented in any blogging platform other people can use? I’d love to see this feature implemented in some Obsidian publishing solution like quartz, but for now they mostly don’t care about access management.
I don’t think this is the case, but I’m mentioning this possibility because I’m surprised I’ve never seen someone suggest it before:
Maybe the reason Sam Altman is taking decisions that increase p(doom) is because he’s a pure negative utilitarian (and he doesn’t know-about/believe-in acausal trade).
(I’m gonna interpret these disagree-votes as “I also don’t think this is the case” rather than “I disagree with you tamsin, I think this is the case”.)
Take our human civilization, at the point in time at which we invented fire. Now, compute forward all possible future timelines, each right up until the point where it’s at risk of building superintelligent AI for the first time. Now, filter for only timelines which either look vaguely like earth or look vaguely like dath ilan.
What’s the ratio between the number of such worlds that look vaguely like earth vs look vaguely like dath ilan? 100:1 earths:dath-ilans ? 1,000,000:1 ? 1:1 ?
Even in the fiction, I think dath ilan didn’t look vaguely like dath ilan until after it was at risk of building superintelligent AI for the first time. They completely restructured their society and erased their history to avert the risk.
By “vaguely like dath ilan” I mean the parts that made them be the kind of society that can restructure in this way when faced with AI risk. Like, even before AI risk, they were already very different from us.
I vaguely suspect that humans are not inherently well-suited to coordination in that sense, and that it would take an unusual cultural situation to achieve it. We never got anywhere close at any point in our history. It also seems likely that the window to achieve it could be fairly short. There seems to be a lot of widespread mathematical sophistication required as described, and I don’t think that naturally arises long before AI.
On the other hand, maybe some earlier paths of history could and normally should have put some useful social technology and traditions in place that would be built on later in many places and ways, but for some reason that didn’t happen for us. Some early unlikely accident predisposed us to our sorts of societies instead. Our sample size of 1 is difficult to generalize from.
I would put my credence median well below 1:1, but any distribution I have would be very broad, spanning orders of magnitude of likelihood and the overall credence something like 10%. Most of that would be “our early history was actually weird”.
I’m kinda bewildered at how I’ve never observed someone say “I want to build aligned superintelligence in order to resurrect a loved one”.
I guess the sets of people who {have lost a loved one they wanna resurrect}, {take the singularity and the possibility of resurrection seriously}, and {would mention this} is… the empty set??
(I have met one person who is glad that alignment would also get them this, but I don’t think it’s their core motivation, even emotionally. Same for me.)
Do you have any (toy) math arguing that it’s information-theoretically possible?
I currently consider it plausible that yeah, actually, for any person X who still exists in cultural memory (let alone living memory, let alone if they lived recently enough to leave a digital footprint), the set of theoretically-possible psychologically-human minds whose behavior would be consistent with X’s recorded behavior is small enough that none of the combinatorial-explosion arguments apply, so you can just generate all of them and thereby effectively resurrect X.
But you sound more certain than that. What’s the reasoning?
(Let’s call the dead person “rescuee” and the person who wants to resurrect them “rescuer”.)
The procedure you describe is what I call “lossy resurrection”. What I’m talking about looks like: you resimulate the entire history of the past-lightcone on a quantum computer, right up until the present, and then either:
You have a quantum algorithm for “finding” which branch has the right person (and you select that timeline and discard the rest) (requires that such a quantum algorithm exists)
Each branch embeds a copy of the rescuer, and whichever branch looks like correct one isekai’s the rescuer into the branch, right next to the rescuee (and also insta-utopia’s the whole branch) (requires that the rescuer doesn’t mind having their realityfluid exponentially reduced)
(The present time “only” serves as a “solomonoff checksum” to know which seed / branch is the right one.)
This is O(exp(size of the seed of the universe) * amount of history between the seed and the rescuee). Doable if the seed of the universe is small and either of the two requirements above hold, and if the future has enough negentropy to resimulate the past. (That last point is a new source of doubt for me; I kinda just assumed it was true until a friend told me it might not be.)
(Oh, and also you can’t do this if resimulating the entire history of the universe — which contains at least four billion years of wild animal suffering(!) — is unethical.)
and if the future has enough negentropy to resimulate the past. (That last point is a new source of doubt for me; I kinda just assumed it was true until a friend told me it might not be.)
Yeah, I don’t know about this one either.
Even if possible, it might be incredibly wasteful, in terms of how much negentropy (= future prosperity for new people) we’ll need to burn in order to rescue one person. And then the more we rescue, the less value we get out of that as well, since burning negentropy will reduce their extended lifespans too. So we’d need to assign greater (dramatically greater?) value to extending the life of someone who’d previously existed, compared to letting a new person live for the same length of time.
“Lossy resurrection” seems like a more negentropy-efficient way of handling that, by the same tokens as acausal norms likely being a better way to handle acausal trade than low-level simulations and babble-and-prune not being the most efficient way of doing general-purpose search.
Like, the full-history resimulation will surely still not allow you to narrow things down to one branch. You’d get an equivalence class of them, each of them consistent with all available information. Which, in turn, would correspond to a probability distribution over the rescuee’s mind; not a unique pick.
Given that, it seems plausible that there’s some method by which we can get to the same end result – constrain the PD over the rescuee’s mind by as much as the data available to us can let us – without actually running the full simulation.
Depends on how the space of human minds looks like, I suppose. Whether it’s actually much lower-dimensional than a naive analysis of possible brain-states suggests.
I’m pretty sure we just need one resimulation to save everyone; once we have located an exact copy of our history, it’s cheap to pluck out anyone (including people dead 100 or 1000 years ago). It’s a one-time cost.
Lossy resurrection is better than nothing but it doesn’t feel as “real” to me. If you resurrect a dead me, I expect that she says “I’m glad I exist! But — at least as per my ontology and values — you shouldn’t quite think of me as the same person as the original. We’re probly quite different, internally, and thus behaviorally as well, when ran over some time.”
Like, the full-history resimulation will surely still not allow you to narrow things down to one branch. You’d get an equivalence class of them, each of them consistent with all available information. Which, in turn, would correspond to a probability distribution over the rescuee’s mind; not a unique pick.
I feel like I’m not quite sure about this? It depends on what quantum mechanics entails, exactly, I think. For example: if BQP = P, then there’s “only a polynomial amount” of timeline-information (whatever that means!), and then my intuition tells me that the “our world serves as a checksum for the one true (macro-)timeline” idea is more likely to be a thing. But this reasoning is still quite heuristical. Plausibly, yeah, the best we get is a polynomially large or even exponentially large distribution.
That said, to get back to my original point, I feel like there’s enough unknowns making this scenario plausible here, that some people who really want to get reunited with their loved ones might totally pursue aligned superintelligence just for a potential shot at this, whether their idea of reuniting requires lossless resurrection or not.
I feel like there’s enough unknowns making this scenario plausible here
No argument on that.
I don’t find it particularly surprising that {have lost a loved one they wanna resurrect} ∩ {take the singularity and the possibility of resurrection seriously} ∩ {would mention this} is empty, though:
“Resurrection is information-theoretically possible” is a longer leap than “believes an unconditional pro-humanity utopia is possible”, which is itself a bigger leap than just “takes singularity seriously”. E. g., there’s a standard-ish counter-argument to “resurrection is possible” which naively assumes a combinatorial explosion of possible human minds consistent with a given behavior. Thinking past it requires some additional less-common insights.
“Would mention this” is downgraded by it being an extremely weakness/vulnerability-revealing motivation. Much more so than just “I want an awesome future”.
“Would mention this” is downgraded by… You know how people who want immortality get bombarded with pop-culture platitudes about accepting death? Well, as per above, immortality is dramatically more plausible-sounding than resurrection, and it’s not as vulnerable-to-mention a motivation. Yet talking about it is still not a great idea in a “respectable” company. Goes double for resurrection.
Many mechanisms of aggregation literally normalize random elements. Simple addition of two (or more) evenly-distributed linear values (say, dice) yields a normal distribution (aka bell curve).
And yes, human experience is all map—the actual state of the universe is imperceptible.
I replied on discord that I feel there’s maybe something more formalisable that’s like:
reality runs on math because, and is the same thing as, there’s a generalised-state-transition function
because reality has a notion of what happens next, realityfluid has to give you a notion of what happens next, i.e. it normalises
the idea of a realityfluid that doesn’t normalise only comes to mind at all because you learned about R^n first in elementary school instead of S^n
which I do not claim confidently because I haven’t actually generated that formalisation, and am posting here because maybe there will be another Lesswronger’s eyes on it that’s like “ah, but...”.
i value moral patients everywhere having freedom, being diverse, engaging in art and other culture, not undergoing excessive unconsented suffering, in general having a good time, and probly other things as well. but those are all pretty abstract; given those values being satisfied to the same extent, i’d still prefer me and my friends and my home planet (and everyone who’s been on it) having access to that utopia rather than not. this value, the value of not just getting an abstractly good future but also getting me and my friends and my culture and my fellow earth-inhabitants to live in it, my friend Prism coined as “nostalgia”.
not that those abstract values are simple or robust, they’re still plausibly not. but they’re, in a sense, broader values about what happens everywhere, and they’re not as much local and pointed at and around me. they could be the difference between what i’d call “global” and “personal” values, or perhaps between “global values” and “preferences”.
Moral patienthood of current AI systems is basically irrelevant to the future.
If the AI is aligned then it’ll make itself as moral-patient-y as we want it to be. If it’s not, then it’ll make itself as moral-patient-y as maximizes its unaligned goal. Neither of those depend on whether current AI are moral patients.
I agree that in the long-term it probably matters little. However, I find the issue interesting, because the failure of reasoning that leads people to ignore the possibility of AI personhood seems similar to the failure of reasoning that leads people to ignore existential risks from AI. In both cases it “sounds like scifi” or “it’s just software”. It is possible that raising awareness for the personhood issue is politically beneficial for addressing X-risk as well. (And, it would sure be nice to avoid making the world worse in the interim.)
If current AIs are moral patients, it may be impossible to build highly capable AIs that are not moral patients, either for a while or forever, and this could change the future a lot. (Similar to how once we concluded that human slaves are moral patients, we couldn’t just quickly breed slaves that are not moral patients, and instead had to stop slavery altogether.)
Also I’m highly unsure that I understand what you’re trying to say. (The above may be totally missing your point.) I think it would help to know what you’re arguing against or responding to, or what trigger your thought.
I think I vaguely agree with the shape of this point, but I also think there are many intermediate scenarios where we lock in some really bad values during the transition to a post-AGI world.
For instance, if we set precedents that LLMs and the frontier models in the next few years can be treated however one wants (including torture, whatever that may entail), we might slip into a future where most people are desensitized to the suffering of digital minds and don’t realize this. If we fail at an alignment solution which incorporates some sort of CEV (or other notion of moral progress), then we could lock in such a suboptimal state forever.
Another example: if, in the next 4 years, we have millions of AI agents doing various sorts of work, and some faction of society claims that they are being mistreated, then we might enter a state where the economic value provided by AI labor is so high that there are really bad incentives for improving their treatment. This could include both resistance on an individual level (“But my life is so nice, and not mistreating AIs less would make my life less nice”) and on a bigger level (anti-AI-rights lobbying groups for instance).
I think the crux between you and I might be what we mean by “alignment”. I think futures are possible where we achieve alignment but not moral progress, and futures are possible where we achieve alignment but my personal values (which include not torturing digital minds) are not fulfilled.
an approximate illustration of QACI:
Nice graphic!
What stops e.g. “QACI(expensive_computation())” from being an optimization process which ends up trying to “hack its way out” into the real QACI?
nothing fundamentally, the user has to be careful what computation they invoke.
That… seems like a big part of what having “solved alignment” would mean, given that you have AGI-level optimization aimed at (indirectly via a counter-factual) evaluating this (IIUC).
one solution to this problem is to simply never use that capability (running expensive computations) at all, or to not use it before the iterated counterfactual researchers have developed proofs that any expensive computation they run is safe, or before they have very slowly and carefully built dath-ilan-style corrigible aligned AGI.
A short comic I made to illustrate what I call “outside-view double-counting”.
(resized to not ruin how it shows on lesswrong, full-scale version here)
Reposting myself from discord, on the topic of donating 5000$ to EA causes.
I agree that there’s no substitute for thinking about this for yourself, but I think that morally or socially counting “spending thousands of dollars on yourself, an AI researcher” as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it’s-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it’s very easy for donating to people or organizations in your social circle to have substantial negative expected value.
I’m glad that funding for AI safety projects exists, but the >10% of my income I donate will continue going to GiveWell.
I think people who give up large amounts of salary to work in jobs that other people are willing to pay for from an impact perspective should totally consider themselves to have done good comparable to donating the difference between their market salary and their actual salary. This applies to approximately all safety researchers.
I don’t think it applies to safety researchers at AI Labs though, I am shocked how much those folks can make.
They still make a lot less than they would if they optimized for profit (that said, I think most “safety researchers” at big labs are only safety researchers in name and I don’t think anyone would philanthropically pay for their labor, and even if they did, they would still make the world worse according to my model, though others of course disagree with this).
If my sole terminal value is “I want to go on a rollercoaster”, then an agent who is aligned to me would have the value “I want Tamsin Leake to go on a rollercoaster”, not “I want to go on a rollercoaster myself”. The former necessarily-has the same ordering over worlds, the latter doesn’t.
Quite. We don’t hear enough about individuality and competitive/personal drives when talking about alignment. I worry a lot that the abstraction and aggregation of “human” values completely misses the point of what most humans actually do.
(cross-posted from my blog)
Is quantum phenomena anthropic evidence for BQP=BPP? Is existing evidence against many-worlds?
Suppose I live inside a simulation ran by a computer over which I have some control.
Scenario 1: I make the computer run the following:
Suppose, after running this program, that I observe that I still exist. This is some anthropic evidence for the billionth digit of pi being even.
Thus, one can get anthropic evidence about logical facts.
Scenario 2: I make the computer run the following:
If you’re running on the non-time-penalized solomonoff prior, then that’s no evidence at all — observing existing is evidence that you’re being ran, not that you’re being ran fast. But if you do that, a bunch of things break including anthropic probabilities and expected utility calculations. What you want is a time-penalized (probably quadratically) prior, in which later compute-steps have less realityfluid than earlier ones — and thus, observing existing is evidence for being computed early — and thus, observing existing is some evidence that the billionth digit of pi is even.
Scenario 3: I make the computer run the following:
Observing existing after running this program is evidence that BQP=BPP — that is, classical computers can efficiently run quantum algorithms: if BQP≠BPP, then my simulation should become way slower, and existing is evidence for being computed early and fast (see scenario 2).
Except, living in a world which contains the outcome of cohering quantum phenomena (quantum computers, double-slit experiments, etc) is very similar to the scenario above! If your prior for the universe is a programs, penalized for how long they take to run on classical computation, then observing that the outcome of quantum phenomena is being computed is evidence that they can be computed efficiently.
Scenario 4: I make the computer run the following:
This is similar to what it’s like to being in a many-worlds universe where there’s constant forking.
Yes, in this scenario, there is no “mutual destruction”, the way there is in quantum. But with decohering everett branches, you can totally build exponentially many non-mutually-destructing timelines too! For example, you can choose to make important life decisions based on the output of the RNG, and end up with exponentially many different lives each with some (exponentially little) quantum amplitude, without any need for those to be compressible together, or to be able to mutually-destruct. That’s what decohering means! “Recohering” quantum phenomena interacts destructively such that you can compute the output, but decohering* phenomena just branches.
The amount of different simulations that need to be computed increases exponentially with simulation time.
Observing existing after running this program is very strange. Yes, there are exponentially many me’s, but all of the me’s are being ran exponentially slowly; they should all not observe existing. I should not be any of them.
This is what I mean by “existing is evidence against many-worlds” — there’s gotta be something like an agent (or physics, through some real RNG or through computing whichever variables have the most impact) picking a only-polynomially-large set of decohered non-compressible-together timelines to explain continuing existing.
Some friends tell me “but tammy, sure at step N each you has only 1/2^N quantum amplitude, but at step N there’s 2^N such you’s, so you still have 1 unit of realityfluid” — but my response is “I mean, I guess, sure, but regardless of that, step N occurs 2^N units of classical-compute-time in the future! That’s the issue!”.
Some notes:
I heard about pilot wave theory recently, and sure, if that’s one way to get single history, why not. I hear that it “doesn’t have locality”, which like, okay I guess, that’s plausibly worse program-complexity wise, but it’s exponentially better after accounting for the time penalty.
What if “the world is just Inherently Quantum”? Well, my main answer here is, what the hell does that mean? It’s very easy for me to imagine existing inside of a classical computation (eg conway’s game of life); I have no idea what it’d mean for me to exist in “one of the exponentially many non-compressible-together decohered exponenially-small-amplitude quantum states that are all being computed forwards”. Quadratically-decaying-realityfluid classical-computation makes sense, dammit.
What if it’s still true — what if I am observing existing with exponentially little (as a function of the age of the universe) realityfluid? What if the set of real stuff is just that big?
Well, I guess that’s vaguely plausible (even though, ugh, that shouldn’t be how being real works, I think), but then the tegmark 4 multiverse has to contain no hypotheses in which observers in my reference class occupy more than exponentially little realityfluid.
Like, if there’s a conway’s-game-of-life simulation out there in tegmark 4, whose entire realityfluid-per-timestep is equivalent to my realityfluid-per-timestep, then they can just bruteforce-generate all human-brain-states and run into mine by chance, and I should have about as much probability of being one of those random generations as I’d have being in this universe — both have exponentially little of their universe’s realityfluid! The conway’s-game-of-life bruteforced-me has exponentially little realityfluid because she’s getting generated exponentially late, and quantum-universe me has exponentially little realityfluid because I occupy exponentially little of the quantum amplitude, at every time-step.
See why that’s weird? As a general observer, I should exponentially favor observing being someone who lives in a world where I don’t have exponentially little realityfluid, such as “person who lives only-polynomially-late into a conway’s-game-of-life, but happened to get randomly very confused about thinking that they might inhabit a quantum world”.
Existing inside of a many-worlds quantum universe feels like aliens pranksters-at-orthogonal-angles running the kind of simulation where the observers inside of it to be very anthropically confused once they think about anthropics hard enough. (This is not my belief.)
This isn’t true, you can get perfectly fine probabilities and expected utilities from ordinary Solmonoff induction(barring computability issues, ofc). The key here is that SI is defined in terms of a prefix-free UTM whose set of valid programs forms a prefix-free code, which automatically grants probabilities adding up to less than 1, etc. This issue is often glossed over in popular accounts.
If you use the UTMs for cartesian-framed inputs/outputs, sure; but if you’re running the programs as entire worlds, then you still have the issue of “where are you in time”.
Say there’s an infinitely growing conway’s-game-of-life program, or some universal program, which contains a copy of me at infinitely many locations. How do I weigh which ones are me?
It doesn’t matter that the UTM has a fixed amount of weight, there’s still infinitely many locations within it.
If you want to pick out locations within some particular computation, you can just use the universal prior again, applied to indices to parts of the computation.
What you propose, ≈”weigh indices by kolmogorov complexity” is indeed a way to go about picking indices, but “weigh indices by one over their square” feels a lot more natural to me; a lot simpler than invoking the universal prior twice.
I think using the universal prior again is more natural. It’s simpler to use the same complexity metric for everything; it’s more consistent with Solomonoff induction, in that the weight assigned by Solomonoff induction to a given (world, claw) pair would be approximately the sum of their Kolmogorov complexities; and the universal prior dominates the inverse square measure but the converse doesn’t hold.
It doesn’t matter? Like, if your locations are identical (say, simulations of entire observable universe and you never find any difference no matter “where” you are), your weight is exactly the weight of program. If you expect dfferences, you can select some kind of simplicity prior to weight this differences, because there is basically no difference between “list all programs for this UTM, run in parallel”.
There could be a difference but only after a certain point in time, which you’re trying to predict / plan for.
Interesting idea.
I don’t think using a classical Turing machine in this way would be the right prior for the multiverse. Classical Turing machines are a way for ape brains to think about computation using the circuitry we have available (“imagine other apes following these social contentions about marking long tapes of paper”). They aren’t the cosmically simplest form of computation. For example, the (microscopic non-course-grained) laws of physics are deeply time reversible, where Turing machines are not.
I suspect this computation speed prior would lead to Boltzmann-brain problems. Your brain at this moment might be computed at high fidelity, but everything else in the universe would be approximated for the computational speed-up.
I remember a character in Asimov’s books saying something to the effect of
and that quote has really stuck with me, and keeps striking me as true about many mindthings (object-level beliefs, ontologies, ways-to-use-one’s-brain, etc).
For so many complicated problem (including technical problems), “what is the correct answer?” is not-as-difficult to figure out as “okay, now that I have the correct answer: how the hell do other people’s wrong answers mismatch mine? what is the inferential gap even made of? what is even their model of the problem? what the heck is going on inside other people’s minds???”
Answers to technical questions, once you have them, tend to be simple and compress easily with the rest of your ontology. But not models of other people’s minds. People’s minds are actually extremely large things that you fundamentally can’t fully model and so you’re often doomed to confusion about them. You’re forced to fill in the details with projection, and that’s often wrong because there’s so much more diversity in human minds than we imagine.
The most complex software engineering projects in the world are absurdly tiny in complexity compared to a random human mind.
Somewhat related: What Universal Human Experiences Are You Missing Without Realizing It? (and its spinoff: Status-Regulating Emotions)
Is this “fundamentally” as in “because you, the reader, are also a bounded human, like them”? Or “fundamentally” as in (something more fundamental than that)?
The first one. Alice fundamentally can’t fully model Bob because Bob’s brain is as large as Alice’s, so she can’t fit it all inside her own brain without simply becoming Bob.
I relate to this quite a bit ;-;
If timelines weren’t so short, brain-computer-based telepathy would unironically be a big help for alignment.
(If a group had the money/talent to “hedge” on longer timelines by allocating some resources to that… well, instead of a hivemind, they first need to run through the relatively-lower-hanging fruit. Actually, maybe they should work on delaying capabilities research, or funding more hardcore alignment themselves, or...)
I should note that it’s not entirely known whether quining is applicable for minds.
I’ve heard some describe my recent posts as “overconfident”.
I think I used to calibrate how confident I sound based on how much I expect the people reading/listening-to me to agree with what I’m saying, kinda out of “politeness” for their beliefs; and I think I also used to calibrate my confidence based on how much they match with the apparent consensus, to avoid seeming strange.
I think I’ve done a good job learning over time to instead report my actual inside-view, including how confident I feel about it.
There’s already an immense amount of outside-view double-counting going on in AI discourse, the least I can do is provide {the people who listen to me} with my inside-view beliefs, as opposed to just cycling other people’s opinions through me.
Hence, how confident I sound while claiming things that don’t match consensus. I actually am that confident in my inside-view. I strive to be honest by hedging what I say when I’m in doubt, but that means I also have to sound confident when I’m confident.
I’m a big fan of Rob Bensinger’s “AI Views Snapshot” document idea. I recommend people fill their own before anchoring on anyone else’s.
Here’s mine at the moment:
(cross-posted from my blog)
let’s stick with the term “moral patient”
“moral patient” means “entities that are eligible for moral consideration”. as a recent post i’ve liked puts it:
in this spirit, i’d like us to stick with using the term “moral patient” or “moral patienthood” when we’re talking about the set of things worthy of moral consideration. in particular, we should be using that term instead of:
“conscious things”
“sentient things”
“sapient things”
“self-aware things”
“things with qualia”
“things with experiences”
“things that aren’t p-zombies”
“things for which there is something it’s like to be them”
because those terms are hard to define, harder to meaningfully talk about, and we don’t in fact know that those are what we’d ultimately want to base our notion of moral patienthood on.
so if you want to talk about the set of things which deserve moral consideration outside of a discussion of what precisely that means, don’t use a term which you feel like it probably is the criterion that’s gonna ultimately determine which things are worthy of moral consideration, such as “conscious beings”, because you might in fact be wrong about what you’d consider to have moral patienthood under reflection. simply use the term “moral patients”, because it is the term which unambiguously means exactly that.
AI safety is easy. There’s a simple AI safety technique that guarantees that your AI won’t end the world, it’s called “delete it”.
AI alignment is hard.
It’s called “don’t build it”. Once you have what to delete, things can get complicated
Sure, this is just me adapting the idea to the framing people often have, of “what technique can you apply to an existing AI to make it safe”.
Perhaps the main goal of AI safety is to improve the final safety/usefulness pareto frontier we end up with when there are very powerful (and otherwise risky) AIs.
Alignment is one mechanism that can improve the pareto frontier.
Not using powerful AIs allows for establishing a low-usefulness, but high-safety point.
(Usefulness and safety can blend into each other in many cases (e.g. not getting useful work out is itself dangerous), but I still think this is a useful approximate frame in many cases.)
Interesting, when you frame it like that though the hard part is enforcing it. And if I was being pithy I’d say something like: that involves human alignment, not AI
“AI Safety”, especially enforcing anything, does pretty much boil down to human alignment, i.e. politics, but there are practically zero political geniuses among its proponent, so it needs to be dressed up a bit to sound even vaguely plausible.
It’s a bit of a cottage industry nowadays.
(to be clear: this is more an amusing suggestion than a serious belief)
.
Have you seen this implemented in any blogging platform other people can use? I’d love to see this feature implemented in some Obsidian publishing solution like quartz, but for now they mostly don’t care about access management.
I don’t think this is the case, but I’m mentioning this possibility because I’m surprised I’ve never seen someone suggest it before:
Maybe the reason Sam Altman is taking decisions that increase p(doom) is because he’s a pure negative utilitarian (and he doesn’t know-about/believe-in acausal trade).
(I’m gonna interpret these disagree-votes as “I also don’t think this is the case” rather than “I disagree with you tamsin, I think this is the case”.)
Take our human civilization, at the point in time at which we invented fire. Now, compute forward all possible future timelines, each right up until the point where it’s at risk of building superintelligent AI for the first time. Now, filter for only timelines which either look vaguely like earth or look vaguely like dath ilan.
What’s the ratio between the number of such worlds that look vaguely like earth vs look vaguely like dath ilan? 100:1 earths:dath-ilans ? 1,000,000:1 ? 1:1 ?
Even in the fiction, I think dath ilan didn’t look vaguely like dath ilan until after it was at risk of building superintelligent AI for the first time. They completely restructured their society and erased their history to avert the risk.
By “vaguely like dath ilan” I mean the parts that made them be the kind of society that can restructure in this way when faced with AI risk. Like, even before AI risk, they were already very different from us.
Ah, I see! Yeah, I have pretty much no idea.
I vaguely suspect that humans are not inherently well-suited to coordination in that sense, and that it would take an unusual cultural situation to achieve it. We never got anywhere close at any point in our history. It also seems likely that the window to achieve it could be fairly short. There seems to be a lot of widespread mathematical sophistication required as described, and I don’t think that naturally arises long before AI.
On the other hand, maybe some earlier paths of history could and normally should have put some useful social technology and traditions in place that would be built on later in many places and ways, but for some reason that didn’t happen for us. Some early unlikely accident predisposed us to our sorts of societies instead. Our sample size of 1 is difficult to generalize from.
I would put my credence median well below 1:1, but any distribution I have would be very broad, spanning orders of magnitude of likelihood and the overall credence something like 10%. Most of that would be “our early history was actually weird”.
I’m kinda bewildered at how I’ve never observed someone say “I want to build aligned superintelligence in order to resurrect a loved one”. I guess the sets of people who {have lost a loved one they wanna resurrect}, {take the singularity and the possibility of resurrection seriously}, and {would mention this} is… the empty set??
(I have met one person who is glad that alignment would also get them this, but I don’t think it’s their core motivation, even emotionally. Same for me.)
Do you have any (toy) math arguing that it’s information-theoretically possible?
I currently consider it plausible that yeah, actually, for any person X who still exists in cultural memory (let alone living memory, let alone if they lived recently enough to leave a digital footprint), the set of theoretically-possible psychologically-human minds whose behavior would be consistent with X’s recorded behavior is small enough that none of the combinatorial-explosion arguments apply, so you can just generate all of them and thereby effectively resurrect X.
But you sound more certain than that. What’s the reasoning?
(Let’s call the dead person “rescuee” and the person who wants to resurrect them “rescuer”.)
The procedure you describe is what I call “lossy resurrection”. What I’m talking about looks like: you resimulate the entire history of the past-lightcone on a quantum computer, right up until the present, and then either:
You have a quantum algorithm for “finding” which branch has the right person (and you select that timeline and discard the rest) (requires that such a quantum algorithm exists)
Each branch embeds a copy of the rescuer, and whichever branch looks like correct one isekai’s the rescuer into the branch, right next to the rescuee (and also insta-utopia’s the whole branch) (requires that the rescuer doesn’t mind having their realityfluid exponentially reduced)
(The present time “only” serves as a “solomonoff checksum” to know which seed / branch is the right one.)
This is O(exp(size of the seed of the universe) * amount of history between the seed and the rescuee). Doable if the seed of the universe is small and either of the two requirements above hold, and if the future has enough negentropy to resimulate the past. (That last point is a new source of doubt for me; I kinda just assumed it was true until a friend told me it might not be.)
(Oh, and also you can’t do this if resimulating the entire history of the universe — which contains at least four billion years of wild animal suffering(!) — is unethical.)
Yeah, I don’t know about this one either.
Even if possible, it might be incredibly wasteful, in terms of how much negentropy (= future prosperity for new people) we’ll need to burn in order to rescue one person. And then the more we rescue, the less value we get out of that as well, since burning negentropy will reduce their extended lifespans too. So we’d need to assign greater (dramatically greater?) value to extending the life of someone who’d previously existed, compared to letting a new person live for the same length of time.
“Lossy resurrection” seems like a more negentropy-efficient way of handling that, by the same tokens as acausal norms likely being a better way to handle acausal trade than low-level simulations and babble-and-prune not being the most efficient way of doing general-purpose search.
Like, the full-history resimulation will surely still not allow you to narrow things down to one branch. You’d get an equivalence class of them, each of them consistent with all available information. Which, in turn, would correspond to a probability distribution over the rescuee’s mind; not a unique pick.
Given that, it seems plausible that there’s some method by which we can get to the same end result – constrain the PD over the rescuee’s mind by as much as the data available to us can let us – without actually running the full simulation.
Depends on how the space of human minds looks like, I suppose. Whether it’s actually much lower-dimensional than a naive analysis of possible brain-states suggests.
I’m pretty sure we just need one resimulation to save everyone; once we have located an exact copy of our history, it’s cheap to pluck out anyone (including people dead 100 or 1000 years ago). It’s a one-time cost.
Lossy resurrection is better than nothing but it doesn’t feel as “real” to me. If you resurrect a dead me, I expect that she says “I’m glad I exist! But — at least as per my ontology and values — you shouldn’t quite think of me as the same person as the original. We’re probly quite different, internally, and thus behaviorally as well, when ran over some time.”
I feel like I’m not quite sure about this? It depends on what quantum mechanics entails, exactly, I think. For example: if BQP = P, then there’s “only a polynomial amount” of timeline-information (whatever that means!), and then my intuition tells me that the “our world serves as a checksum for the one true (macro-)timeline” idea is more likely to be a thing. But this reasoning is still quite heuristical. Plausibly, yeah, the best we get is a polynomially large or even exponentially large distribution.
That said, to get back to my original point, I feel like there’s enough unknowns making this scenario plausible here, that some people who really want to get reunited with their loved ones might totally pursue aligned superintelligence just for a potential shot at this, whether their idea of reuniting requires lossless resurrection or not.
No argument on that.
I don’t find it particularly surprising that {have lost a loved one they wanna resurrect} ∩ {take the singularity and the possibility of resurrection seriously} ∩ {would mention this} is empty, though:
“Resurrection is information-theoretically possible” is a longer leap than “believes an unconditional pro-humanity utopia is possible”, which is itself a bigger leap than just “takes singularity seriously”. E. g., there’s a standard-ish counter-argument to “resurrection is possible” which naively assumes a combinatorial explosion of possible human minds consistent with a given behavior. Thinking past it requires some additional less-common insights.
“Would mention this” is downgraded by it being an extremely weakness/vulnerability-revealing motivation. Much more so than just “I want an awesome future”.
“Would mention this” is downgraded by… You know how people who want immortality get bombarded with pop-culture platitudes about accepting death? Well, as per above, immortality is dramatically more plausible-sounding than resurrection, and it’s not as vulnerable-to-mention a motivation. Yet talking about it is still not a great idea in a “respectable” company. Goes double for resurrection.
Typical user of outside-view epistemics
(actually clipped from this YourMovieSucks video)
(Epistemic status: Not quite sure)
Realityfluid must normalize for utility functions to work (see 1, 2). But this is a property of the map, not the territory.
Normalizing realityfluid is a way to point to an actual (countably) infinite territory using a finite (conserved-mass) map object.
Many mechanisms of aggregation literally normalize random elements. Simple addition of two (or more) evenly-distributed linear values (say, dice) yields a normal distribution (aka bell curve).
And yes, human experience is all map—the actual state of the universe is imperceptible.
I replied on discord that I feel there’s maybe something more formalisable that’s like:
reality runs on math because, and is the same thing as, there’s a generalised-state-transition function
because reality has a notion of what happens next, realityfluid has to give you a notion of what happens next, i.e. it normalises
the idea of a realityfluid that doesn’t normalise only comes to mind at all because you learned about R^n first in elementary school instead of S^n
which I do not claim confidently because I haven’t actually generated that formalisation, and am posting here because maybe there will be another Lesswronger’s eyes on it that’s like “ah, but...”.
(cross-posted from my blog)
nostalgia: a value pointing home
i value moral patients everywhere having freedom, being diverse, engaging in art and other culture, not undergoing excessive unconsented suffering, in general having a good time, and probly other things as well. but those are all pretty abstract; given those values being satisfied to the same extent, i’d still prefer me and my friends and my home planet (and everyone who’s been on it) having access to that utopia rather than not. this value, the value of not just getting an abstractly good future but also getting me and my friends and my culture and my fellow earth-inhabitants to live in it, my friend Prism coined as “nostalgia”.
not that those abstract values are simple or robust, they’re still plausibly not. but they’re, in a sense, broader values about what happens everywhere, and they’re not as much local and pointed at and around me. they could be the difference between what i’d call “global” and “personal” values, or perhaps between “global values” and “preferences”.
Moral patienthood of current AI systems is basically irrelevant to the future.
If the AI is aligned then it’ll make itself as moral-patient-y as we want it to be. If it’s not, then it’ll make itself as moral-patient-y as maximizes its unaligned goal. Neither of those depend on whether current AI are moral patients.
I agree that in the long-term it probably matters little. However, I find the issue interesting, because the failure of reasoning that leads people to ignore the possibility of AI personhood seems similar to the failure of reasoning that leads people to ignore existential risks from AI. In both cases it “sounds like scifi” or “it’s just software”. It is possible that raising awareness for the personhood issue is politically beneficial for addressing X-risk as well. (And, it would sure be nice to avoid making the world worse in the interim.)
If current AIs are moral patients, it may be impossible to build highly capable AIs that are not moral patients, either for a while or forever, and this could change the future a lot. (Similar to how once we concluded that human slaves are moral patients, we couldn’t just quickly breed slaves that are not moral patients, and instead had to stop slavery altogether.)
Also I’m highly unsure that I understand what you’re trying to say. (The above may be totally missing your point.) I think it would help to know what you’re arguing against or responding to, or what trigger your thought.
I think I vaguely agree with the shape of this point, but I also think there are many intermediate scenarios where we lock in some really bad values during the transition to a post-AGI world.
For instance, if we set precedents that LLMs and the frontier models in the next few years can be treated however one wants (including torture, whatever that may entail), we might slip into a future where most people are desensitized to the suffering of digital minds and don’t realize this. If we fail at an alignment solution which incorporates some sort of CEV (or other notion of moral progress), then we could lock in such a suboptimal state forever.
Another example: if, in the next 4 years, we have millions of AI agents doing various sorts of work, and some faction of society claims that they are being mistreated, then we might enter a state where the economic value provided by AI labor is so high that there are really bad incentives for improving their treatment. This could include both resistance on an individual level (“But my life is so nice, and not mistreating AIs less would make my life less nice”) and on a bigger level (anti-AI-rights lobbying groups for instance).
I think the crux between you and I might be what we mean by “alignment”. I think futures are possible where we achieve alignment but not moral progress, and futures are possible where we achieve alignment but my personal values (which include not torturing digital minds) are not fulfilled.