TsviBT comments on TsviBT’s Shortform

TsviBT 17 Mar 2025 0:15 UTC
11 points
3
This assumes

Yes, that’s a background assumption of the conjecture; I think making that assumption and exploring the consequences is helpful.

Self-modify into something that’s basically an eldritch abomination from a human perspective, either deliberately or as part of a self-modification process gone wrong.

Right, totally, then all bets are off. The scenario is underspecified. My default imagination of “aligned” AGI is corrigible AGI. (In fact, I’m not even totally sure that it makes much sense to talk of aligned AGI that’s not corrigible.) Part of corrigibility would be that if:
- the human asks you to do X,
- and X would have irreversible consequences,
- and the human is not aware of / doesn’t understand those consequences,
- and the consequences would make the human unable to notice or correct the change,
- and the human, if aware, would have really wanted to not do X or at least think about it a bunch more before doing it,
then you DEFINITELY don’t just go ahead and do X lol!

In other words, a corrigible AGI is supposed to use its intelligence to possibilize self-alignment for the human.

Make some minimal self-modifications to avoid value drift, precisely not to let the sort of stuff you’re talking about happen.

I think this notion of values and hence value drift is probably mistaken of humans. Human values are meta and open—part of the core argument of my OP (the bullet point about communing).

Stick to behavioral patterns that would lead to never changing their mind/never value-drifting, either as an “accidental” emergent property of their behavior

So first they carefully construct an escape-proof cage for all the other humans, and then they become a perma-zombie? Not implausible, like they could for some reason specifically ask the AGI to do this, but IDK why they would.

or as an implicit preference they never tell their aligned ASI to satisfy, but which it infers and carefully ensures the satisfaction of.

Doesn’t sound very corrigible? Not sure.

Immediately disassemble the rest of humanity for raw resources, like any good misaligned agent would, and never think about it again. Edit out their social instincts or satisfy them by interacting with each other/with constructs.

Right, certainly they could. Who actually would? (Not rhetorical.)

Overall, I think all hopeful scenarios about “even a not-very-good person elevated to godhood would converge to goodness over time!” fail to feel the Singularity. It’s not going to be basically business as usual for any prolonged length of time; things are going to get arbitrarily weird essentially immediately.

I think you’re failing to feel the Singularity, and instead you’re extrapolating to like “what would a really really bad serial killer / dictator do if they were being an extra bad serial killer / dictator times 1000”. Or IDK, I don’t know what you think; What do you think would actually happen if a random person were put in the corrigible AGI control seat?

Things can get weird, but for the person to cut out a bunch of their core humanity, kinda seems like either the AGI isn’t really corrigible or isn’t really AGI (such that the emperor-AGI system is being dumb by its own lights), or else the person really wanted to do that. Why do you think people want to do that? Do you want to do that? I don’t.

If they don’t cut out a bunch of their core humanity, then my question and conjecture are live.
What links here?
- TsviBT's comment on TsviBT’s Shortform by TsviBT (17 Mar 2025 8:29 UTC; 3 points)
- Thane Ruthenis 17 Mar 2025 11:13 UTC
  7 points
  3
  Parent
  Human values are meta and open—part of the core argument of my OP (the bullet point about communing).
  Unless the human, on reflection, doesn’t want some specific subset of their current values to be open to change / has meta-level preferences to freeze some object-level values. Which I think is common. (Source: I have meta-preferences to freeze some of my object-level values at “eudaimonia”, and I take specific deliberate actions to avoid or refuse value-drift on that.)
  Not implausible, like they could for some reason specifically ask the AGI to do this, but IDK why they would.
  Callousness. “We probably need to do something about the rest of humanity, probably shouldn’t just wipe them all out, lemme draft some legislation, alright looks good, rubber-stamp it and let’s move on”. Tons of bureaucracies and people in power seem to act this way today, including decisions that impact the fates of millions.
  Right, certainly they could. Who actually would? (Not rhetorical.)
  I don’t know that Genghis Khan or Stalin wouldn’t have. Some clinical psychopaths or philosophical extremists (e. g., the human successionists) certainly would.
  What do you think would actually happen if a random person were put in the corrigible AGI control seat?
  Mm...
  First, I think “corrigibility to a human” is underdefined. A human is not, themselves, a coherent agent with a specific value/goal-slot to which an AI can be corrigible.
  Like, is it corrigible to a human’s momentary impulses? Or to the command the human would give if they thought for five minutes? For five days? Or perhaps to the command they’d give if the AI taught them more wisdom? But then which procedure should the AI choose for teaching them more wisdom? The outcome is likely path-dependent on that: on the choice between curriculum A and curriculum B. And if so, what procedure should the AI use to decide what curriculum to use? Or should the AI perhaps basically ignore the human in front of them, and simply interpret them as a rough pointer to CEV? Well, that assumes the conclusion, and isn’t really “corrigibility” at all, is it?
  The underlying issue here is that “a human’s values” are themselves underdefined. They’re derived in a continual, path-dependent fashion, by a unstable process with lots of recursions and meta-level interference. There’s no unique ground-true set of values which the AI should take care not to step onto. This leaves three possibilities:
  1. The AI acts as a tool that does what the human knowingly instructs it to do, with the wisdom by-default outsourced to the human.
    But then it is possible to use it unwisely. For example, if the human operator is smart enough to foresee issues with self-modification, they could ask the AI to watch out for that. They could also ask it to watch out for that whole general class of unwise-on-the-part-of-the-human decisions. But they can also fail to do so, or unwisely ignore a warning in a fit of emotion, or have some beliefs about how decisions Ought to be Done that they’re unwilling to even discuss with the AI.
  2. The AI never does anything, because it knows that any of its actions can step onto one of the innumerable potential endpoints of a human’s self-reflection process.
    But then it is useless.
  3. The AI isn’t corrigible at all, it just optimizes for some fixed utility function, if perhaps with an indirect pointer to it (“this human’s happiness”, “humanity’s CEV”, etc.).
  (1) is the only possibility worth examining here, I think.
  And what I expect to happen if an untrained, philosophically median human is put in control of a tool ASI, is some sort of catastrophe. They would have various cached thoughts about how the story ought to go, what the greater good is, who the villains are, how the society ought to be set up. These thoughts would be endorsed at the meta-level, and not open to debate. The human would not want to ask the ASI to examine those; if the ASI attempts to challenge them as part of some other request, the human would tell it to shut up.^[1]
  In addition, the median human is not, really, a responsible person. If put in control of an ASI, they would not suddenly become appropriately responsible. It wouldn’t by default occur to them to ask the ASI for making them more responsible, either, because that’s itself a very responsible thing to do. The way it would actually go, they are going to be impulsive, emotional, callous, rash, unwise, cognitively lazy.
  Some sort of stupid and callous outcome is likely to result. Maybe not specifically “self-modifying into a monster/zombie and trapping humanity in a dystopian prison”, but something in that reference class of outcomes.
  Not to mention if the human has some extant prejudices: racism or any other manner of “assigning different moral worth to different sapient beings based on arbitrary features”. The stupid-callous-impulsive process would spit out some not-very-pleasant fate for the undesirables, and this would be reflectively endorsed on some level, so a genuine tool-like corrigible ASI^[2] wouldn’t say a word of protest.
  Maybe I am being overly cynical about this, that’s definitely possible. Still, that’s my current model.
  1. ^
    Source: I would not ask the ASI to search for arguments against eudaimonia-maximization, or ask it to check if there’s something else that “I” “should” be pursuing instead, because I do not want to be argued out of that even if there’s some coherent, true, and compelling sense in which it is not what “I” “actually” “want”. If the ASI asks whether it should run that check as part of some other request, I would tell it to shut up.
    (Note that it’s different from examining whether my idea of eudaimonia/human flourishing/the-thing-I-mean-when-I-say-human-flourishing is correct/good, or whether my fundamental assumptions about how the world works are correct, etc.)
  2. ^
    As opposed to a supposedly corrigible but secretly eudaimonic ASI which, in one’s imagination, always happens to gently question the human’s decisions when the human orders it to do something bad, and then happens to pick the specific avenues of questioning that make the human “realize” they wanted good things all along.
  - TsviBT 17 Mar 2025 11:38 UTC
    6 points
    4
    Parent
    
    This leaves three possibilities:
    
    How about for example:
    
    The AGI helps out with increasing the human’s ability to follow through on attempts at internal organization (e.g. thinking, problem solving, reflecting, coherentifying) that normally the human would try a bit and then give up on.
    
    Not saying this is some sort of grand solution to corrigibility, but it’s obviously better than the nonsense you listed. If a human were going to try to help me out, I’d want this, for example, more than the things you listed, and it doesn’t seem especially incompatible with corrigible behavior.
  - TsviBT 17 Mar 2025 11:26 UTC
    6 points
    8
    Parent
    
    First, I think “corrigibility to a human” is underdefined. A human is not, themselves, a coherent agent with a specific value/goal-slot to which an AI can be corrigible.
    
    I mean, yes, but you wrote a lot of stuff after this that seems weird / missing the point, to me. A “corrigible AGI” should do at least as well as—really, much better than—you would do, if you had a huge team of researchers under you and your full time, 100,000x speed job is to do a really good job at “being corrigible, whatever that means” to the human in the driver’s seat. (In the hypothetical you’re on board with this for some reason.)
  - TsviBT 17 Mar 2025 11:17 UTC
    6 points
    4
    Parent
    
    (Source: I have meta-preferences to freeze some of my object-level values at “eudaimonia”, and I take specific deliberate actions to avoid or refuse value-drift on that.)
    
    I would guess fairly strongly that you’re mistaken or confused about this, in a way that an AGI would understand and be able to explain to you. (An example of how that would be the case: the version of “eudaimonia” that would not horrify you, if you understood it very well, has to involve meta+open consciousness (of a rather human flavor).)
  - Mateusz Bagiński 17 Mar 2025 15:53 UTC
    2 points
    0
    Parent
    Source: I have meta-preferences to freeze some of my object-level values at “eudaimonia”, and I take specific deliberate actions to avoid or refuse value-drift on that.
    I’m curious to hear more about those specific deliberate actions.
  - TsviBT 17 Mar 2025 11:23 UTC
    2 points
    0
    Parent
    
    Some sort of stupid and callous outcome is likely to result. Maybe not specifically “self-modifying into a monster/zombie and trapping humanity in a dystopian prison”, but something in that reference class of outcomes.
    
    Your and my beliefs/questions don’t feel like they’re even much coming into contact with each other… Like, you (and also other people) just keep repeating “something bad could happen”. And I’m like “yeah obviously something extremely bad could happen; maybe it’s even likely, IDK; and more likely, something very bad at the beginning of the reign would happen (Genghis spends is first 200 years doing more killing and raping); but what I’m ASKING is, what happens then?”.
    
    If you’re saying
    
    There is a VERY HIGH CHANCE that the emperor would PERMANENTLY put us into a near-zero value state or a negative-value state.
    
    then, ok, you can say that, but I want to understand why; and I have some reasons (as presented) for thinking otherwise.
    - Thane Ruthenis 17 Mar 2025 11:41 UTC
      2 points
      0
      Parent
      Your hypothesis is about the dynamics within human minds embedded in something like contemporary societies with lots of other diverse humans whom the rulers are forced to model for one reason or another.
      My point is that evil, rash, or unwise decisions at the very start of the process are likely, and that those decisions are likely to irrevocably break the conditions in which the dynamics you hypothesize are possible. Make the minds in charge no longer human in the relevant sense, or remove the need to interact with/model other humans, etc.
      In my view, it doesn’t strongly bear on the final outcome-distribution whether the “humans tend to become nicer to other humans over time” hypothesis is correct, because “the god-kings remain humans hanging around all the other humans in a close-knit society for millennia” is itself a very rare class of outcomes.
      - TsviBT 17 Mar 2025 11:46 UTC
        2 points
        0
        Parent
        
        Your hypothesis is about the dynamics within human minds embedded in something like contemporary societies with lots of other diverse humans whom the rulers are forced to model for one reason or another.
        
        Absolutely not, no. Humans want to be around (some) other people, so the emperor will choose to be so. Humans want to be [many core aspects of humanness, not necessarily per se, but individually], so the emperor will choose to be so. Yes, the emperor could want these insufficiently for my argument to apply, as I’ve said earlier. But I’m not immediately recalling anyone (you or others) making any argument that, with high or even substantial probability, the emperor would not want these things sufficiently for my question, about the long-run of these things, to be relevant.
        Thane Ruthenis 17 Mar 2025 12:01 UTC
        14 points
        3
        Parent
        Humans want to be around (some) other people
        Yes: some other people. The ideologically and morally aligned people, usually. Social/informational bubbles that screen away the rest of humanity, from which they only venture out if forced to (due to the need to earn money/control the populace, etc.). This problem seems to get worse as the ability to insulate yourself from other improves, as could be observed with modern internet-based informational bubbles or the surrounded-by-yes-men problem of dictators.
        ASI would make this problem transcendental: there would truly be no need to ever bother with the people outside your bubble again, they could be wiped out or their management outsourced to AIs.
        Past this point, you’re likely never returning to bothering about them. Why would you, if you can instead generate entire worlds of the kinds of people/entities/experiences you prefer? It seems incredibly unlikely that human social instincts can only be satisfied – or even can be best satisfied – by other humans.
        TsviBT 17 Mar 2025 12:10 UTC
        4 points
        0
        Parent
        
        It seems incredibly unlikely that human social instincts can only be satisfied – or even can be best satisfied – by other humans.
        
        You’re 100% not understanding my argument, which is sorta fair because I didn’t lay it out clearly, but I think you should be doing better anyway.
        
        Here’s a sketch:
        
        Humans want to be human-ish and be around human-ish entities.
        So the emperor will be human-ish and be around human-ish entities for a long time. (Ok, to be clear, I mean a lot of developmental / experiential time—the thing that’s relevant for thinking about how the emperor’s way of being trends over time.)
        When being human-ish and around human-ish entities, core human shards continue to work.
        When core human shards continue to work, MAYBE this implies EVENTUALLY adopting beneficence (or something else like cosmopolitanism), and hence good outcomes.
        Since the emperor will be human-ish and be around human-ish entities for a long time, IF 4 obtains, then good outomes.
        
        And then I give two IDEAS about 4 (communing->[universalist democracy], and [information increases]->understanding->caring).
        Thane Ruthenis 17 Mar 2025 12:42 UTC
        4 points
        2
        Parent
        I don’t know what’s making you think I don’t understand your argument. Also, I’ve never publicly stated that I’m opting into Crocker’s Rules, so while I happen not to particularly mind the rudeness, your general policy on that seems out of line here.
        When being human-ish and around human-ish entities, core human shards continue to work
        My argument is that the process you’re hypothesizing would be sensitive to the exact way of being human-ish, the exact classes of human-ish entities around, and the exact circumstances in which the emperor has to be around them.
        As a plain and down-to-earth example, if a racist surrounds themselves with a hand-picked group of racist friends, do you expect them to eventually develop universal empathy, solely through interacting with said racist friends? Addressing your specific ideas: nobody in that group would ever need to commune with non-racists, nor have to bother learning more about non-racists. And empirically, such groups don’t seem to undergo spontaneous deradicalizations.
        TsviBT 17 Mar 2025 12:56 UTC
        2 points
        0
        Parent
        
        As a plain and down-to-earth example, if a racist surrounds themselves with a hand-picked group of racist friends, do you expect them to eventually develop universal empathy, solely through interacting with said racist friends? Addressing your specific ideas: nobody in that group would ever need to commune with non-racists, nor have to bother learning more about non-racists. And empirically, such groups don’t seem to undergo spontaneous deradicalizations.
        
        I expect they’d get bored with that.
        TsviBT 17 Mar 2025 12:47 UTC
        2 points
        0
        Parent
        
        As a plain and down-to-earth example, if a racist surrounds themselves with a hand-picked group of racist friends, do you expect them to eventually develop universal empathy, solely through interacting with said racist friends? Addressing your specific ideas: nobody in that group would ever need to commune with non-racists, nor have to bother learning more about non-racists. And empirically, such groups don’t seem to undergo spontaneous deradicalizations.
        
        So what do you think happens when they are hanging out together, and they are in charge, and it has been 1,000 years or 1,000,000 years?
        Thane Ruthenis 17 Mar 2025 13:08 UTC
        2 points
        0
        Parent
        One or both of:
        They keep each other radicalized forever as part of some transcendental social dynamic.
        They become increasingly non-human as time goes on, small incremental modifications and personality changes building on each other, until they’re no longer human in the senses necessary for your hypothesis to apply.
        I assume your counter-model involves them getting bored of each other and seeking diversity/new friends, or generating new worlds to explore/communicate with, with the generating processes not constrained to only generate racists, leading to the extremists interacting with non-extremists and eventually incrementally adopting non-extremist perspectives?
        If yes, this doesn’t seem like the overdetermined way for things to go:
        The generating processes would likely be skewed towards only generating things the extremists would find palatable, meaning more people sharing their perspectives/not seriously challenging whatever deeply seated prejudices they have. They’re there to have a good time, not have existential/moral crises.
        They may make any number of modifications to themselves to make them no longer human-y in the relevant sense. Including by simply letting human-standard self-modification algorithms run for 10^3-10^6 years, becoming superhumanly radicalized.
        They may address the “getting bored” part instead, periodically wiping their memories (including by standard human forgetting) or increasing each other’s capacity to generate diverse interactions.
        TsviBT 17 Mar 2025 13:21 UTC
        4 points
        0
        Parent
        Ok so they only generate racists and racially pure people. And they do their thing. But like, there’s no other races around, so the racism part sorta falls by the wayside. They’re still racially pure of course, but it’s usually hard to tell that they’re racist; sometimes they sit around and make jokes to feel superior over lesser races, but this is pretty hollow since they’re not really engaged in any type of race relations. Their world isn’t especially about all that, anymore. Now it’s about… what? I don’t know what to imagine here, but the only things I do know how to imagine involve unbounded structure (e.g. math, art, self-reflection, self-reprogramming). So, they’re doing that stuff. For a very long time. And the race thing just is not a part of their world anymore. Or is it? I don’t even know what to imagine there. Instead of having tastes about ethnicity, they develop tastes about questions in math, or literature. In other words, [the differences between people and groups that they care about] migrate from race to features of people that are involved in unbounded stuff. If the AGI has been keeping the racially impure in an enclosure all this time, at some point the racists might have a glance back, and say, wait, all the interesting stuff about people is also interesting about these people. Why not have them join us as well.
        Mateusz Bagiński 17 Mar 2025 15:55 UTC
        3 points
        0
        Parent
        Past this point, you’re likely never returning to bothering about them. Why would you, if you can instead generate entire worlds of the kinds of people/entities/experiences you prefer? It seems incredibly unlikely that human social instincts can only be satisfied – or even can be best satisfied – by other humans.
        For the same reason that most people (if given the power to do so) wouldn’t just replace their loved ones with their altered versions that are better along whatever dimensions the person judged them as deficient/imperfect.
  - TsviBT 17 Mar 2025 11:18 UTC
    2 points
    0
    Parent
    
    I don’t know that Genghis Khan or Stalin wouldn’t have. Some clinical psychopaths or philosophical extremists (e. g., the human successionists) certainly would.
    
    Yeah I mean this is perfectly plausible, it’s just that even these cases are not obvious to me.