Anirandis

Karma: 131

Anirandis 4 Jun 2020 22:14 UTC
1 point
in reply to: lsusr’s comment on: Pessimism over AGI/ASI causing psychological distress?
The American government is shit, don’t get me wrong, but in modern times it’s much better than some others. The US government isn’t currently ethnically cleansing and hasn’t done for a while now. I’m a lot less worried about them doing anything than certain other governments.

[Question] Likelihood of hyperexistential catastrophe from a bug?

Anirandis18 Jun 2020 16:23 UTC

14 points

27 comments1 min readLW link

Anirandis 18 Jun 2020 21:22 UTC
1 point
in reply to: ESRogs’s comment on: Likelihood of hyperexistential catastrophe from a bug?
What do you think the difference would be between an AGI’s reward function, and that of GPT-2 during the error it experienced?

[Question] ‘Maximum’ level of suffering?

Anirandis20 Jun 2020 14:05 UTC

6 points

16 comments1 min readLW link

Anirandis 25 Jul 2020 2:46 UTC
3 points
on: Open & Welcome Thread—July 2020
As anyone could tell from my posting history, I’ve been obsessing & struggling psychologically recently when evaluating a few ideas surrounding AI (what if we make a sign error on the utility function, malevolent actors creating a sadistic AI, AI blackmail scenarios, etc.) It’s predominantly selfishly worrying about things like s-risks happening to me, or AI going wrong so I have to live in a dystopia and can’t commit suicide. I don’t worry about human extinction (although I don’t think that’d be a good outcome, either!)

I’m wondering if anyone’s gone through similar anxieties and have found a way to help control them? I’m diagnosed ASD and I wouldn’t consider it unlikely that I’ve got OCD or something similar on top of it, so it’s possibly just that playing up.

Anirandis 25 Jul 2020 14:37 UTC
3 points
in reply to: Zack_M_Davis’s comment on: Open & Welcome Thread—July 2020
Thanks for your response, just a few of my thoughts on your points:
If you *can* stop doing philosophy and futurism
To be honest, I’ve never really *wanted* to be involved with this. I only really made an account here *because* of my anxieties and wanted to try to talk myself through them.
If an atom-for-atom identical copy of you, *is* you, and an *almost* identical copy is *almost* you, then in a sufficiently large universe where all possible configurations of matter are realized, it makes more sense to think about the relative measure of different configurations rather than what happens to “you”.
I don’t buy that theory of personal-identity personally. It seems to me that if the biological me that’s sitting here right now isn’t *feeling* the pain, that’s not worth worrying about as much. Like, I can *imagine* that a version of me might be getting tortured horribly or experiencing endless bliss, but my consciousness doesn’t (as far as I can tell) “jump” over to those versions. Similarly, were *I* to get tortured it’d be unlikely that I care about what’s happening to the “other” versions of me. The “continuity of consciousness” theory *seems* stronger to me, although admittedly it’s not something I’ve put a lot of thought into. I wouldn’t want to use a teleporter for the same reasons.
*And* there are evolutionary reasons for a creature like you to be *more* unable to imagine the scope of the great things.
Yes, I agree that it’s possible that the future could be just as good as an infinite torture future would be bad. And that my intuitions are somewhat lopsided. But I do struggle to find that comforting. Were an infinite-torture future realised (whether it be a SignFlip error, an insane neuromorph, etc.) the fact that I could’ve ended up in a utopia wouldn’t console me one bit.

Anirandis 6 Aug 2020 14:18 UTC
10 points
on: Open & Welcome Thread—August 2020
Is it plausible that an AGI could have some sort of exploit (buffer overflow maybe?) that could be exploited (maybe by an optimization daemon…?) and cause a sign flip in the utility function?
How about an error during self-improvement that leads to the same sort of outcome? Should we expect an AGI to sanity-check its successors, even if it’s only at or below human intelligence?
Sorry for the dumb questions, I’m just still nervous about this sort of thing.

Anirandis 6 Aug 2020 15:45 UTC
2 points
in reply to: ChristianKl’s comment on: Open & Welcome Thread—August 2020
Would it be likely for the utility function to flip *completely*, though? There’s a difference between some drift in the utility function and the AI screwing up and designing a successor with the complete opposite of its utility function.

Anirandis 7 Aug 2020 17:33 UTC
4 points
in reply to: ChristianKl’s comment on: Open & Welcome Thread—August 2020
The scenario I’m imagining isn’t an AGI that merely “gets rid of” humans. See SignFlip.

Anirandis 15 Aug 2020 22:52 UTC
4 points
in reply to: gwern’s comment on: Open & Welcome Thread—August 2020
Do you think that this type of thing could plausibly occur *after* training and deployment?

Anirandis 16 Aug 2020 0:30 UTC
2 points
in reply to: gwern’s comment on: Open & Welcome Thread—August 2020
Interesting. Terrifying, but interesting.
Forgive me for my stupidity (I’m not exactly an expert in machine learning), but it seems to me that building an AGI linked to some sort of database like that in such a fashion (that some random guy’s screw-up can effectively reverse the utility function completely) is a REALLY stupid idea. Would there not be a safer way of doing things?

Anirandis 19 Aug 2020 0:51 UTC
2 points
in reply to: mako yass’s comment on: Open & Welcome Thread—August 2020
If we actually built an AGI that optimised to maximise a loss function, wouldn’t we notice long before deploying the thing?

I’d imagine that this type of thing would be sanity-checked and tested intensively, so signflip-type errors would predominantly be scenarios where the error occurs *after* deployment, like the one Gwern mentioned (“A programmer flips the meaning of a boolean flag in a database somewhere while not updating all downstream callers, and suddenly an online learner is now actively pessimizing their target metric.”)

Anirandis 19 Aug 2020 1:38 UTC
4 points
in reply to: gwern’s comment on: Open & Welcome Thread—August 2020
Wouldn’t any configuration errors or updates be caught with sanity-checking tools though? Maybe the way I’m visualising this is just too simplistic, but any developers capable of creating an *aligned* AGI are going to be *extremely* careful not to fuck up. Sure, it’s possible, but the most plausible cause of a hyperexistential catastrophe to me seems to be where a SignFlip-type error occurs once the system has been deployed.

Hopefully a system as crucially important as an AGI isn’t going to have just one guy watching it who “takes a quick bathroom break”. When the difference is literally Heaven and Hell (minimising human values), I’d consider only having one guy in a basement monitoring it to be gross negligence.

Anirandis 19 Aug 2020 2:18 UTC
4 points
in reply to: gwern’s comment on: Open & Welcome Thread—August 2020
Sure, but the *specific* type of error I’m imagining would surely be easier to pick up than most other errors. I have no idea what sort of sanity checking was done with GPT-2, but the fact that the developers were asleep when it trained is telling: they weren’t being as careful as they could’ve been.
For this type of bug (a sign error in the utility function) to occur *before* the system is deployed and somehow persist, it’d have to make it past all sanity-checking tools (which I imagine would be used extensively with an AGI) *and* somehow not be noticed at all while the model trains *and* whatever else. Yes, these sort of conjunctions occur in the real world but the error is generally more subtle than “system does the complete opposite of what it was meant to do”.
I made a question post about this specific type of bug occurring before deployment a while ago and think my views have shifted significantly; it’s unlikely that a bug as obvious as one that flips the sign of the utility function won’t be noticed before deployment. Now I’m more worried about something like this happening *after* the system has been deployed.
I think a more robust solution to all of these sort of errors would be something like the separation from hyperexistential risk article that I linked in my previous response. I optimistically hope that we’re able to come up with a utility function that doesn’t do anything worse than death when minimised, just in case.

Anirandis 19 Aug 2020 2:53 UTC
4 points
in reply to: habryka’s comment on: Open & Welcome Thread—August 2020
I’m under the impression that an AGI would be monitored *during* training as well. So you’d effectively need the system to turn “evil” (utility function flipped) during the training process, and the system to be smart enough to conceal that the error occurred. So it’d need to happen a fair bit into the training process. I guess that’s possible, but IDK how likely it’d be.

Anirandis 21 Aug 2020 15:20 UTC
3 points
in reply to: mako yass’s comment on: Open & Welcome Thread—August 2020
I asked Rohin Shah about that possibility in a question thread about a month ago. I think he’s probably right that this type of thing would only plausibly make it through the training process if the system’s *already* smart enough to be able to think about this type of thing. And then on top of that there are still things like sanity checks which, while unlikely to pick up numerous errors, would probably notice a sign error. See also this comment:
Furthermore, if an AGI design has an actually-serious flaw, the likeliest consequence that I expect is not catastrophe; it’s just that the system doesn’t work. Another likely consequence is that the system is misaligned, but in an obvious ways that makes it easy for developers to recognize that deployment is a very bad idea.
IMO it’s incredibly important that we find a way to prevent this type of thing from occurring *after* the system has been trained, whether that be hyperexistential separation or something else. I think that a team that’s safety-conscious enough to come up with a (reasonably) aligned AGI design is going to put a considerable amount of effort into fixing bugs & one as obvious as a sign error would be unlikely to make it through. And hopefully—even better, they would have come up with a utility function that can’t be easily reversed by a single bit flip or doesn’t cause outcomes worse than death when minimised. That’d (hopefully?) solve the SignFlip issue *regardless* of what causes it.

Anirandis 22 Aug 2020 18:25 UTC
3 points
in reply to: gwern’s comment on: Open & Welcome Thread—August 2020
Do you think that this specific risk could be mitigated by some variant of Eliezer’s separation from hyperexistential risk or Stuart Armstrong’s idea here:
Let B1 and B2 be excellent, bestest outcomes. Define U(B1) = 1, U(B2) = −1, and U = 0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes.
Or, more usefully, let X be some trivial feature that the agent can easily set to −1 or 1, and let U be a utility function with values in [0, 1]. Have the AI maximise or minimise XU. Then the AI will always aim for the same best world, just with a different X value.
Or at least prevent sign flip errors from causing something worse than paperclipping?

Anirandis 24 Aug 2020 0:17 UTC
2 points
in reply to: ChristianKl’s comment on: Open & Welcome Thread—August 2020
Sure, but I’d expect that a system as important as this would have people monitoring it ²⁴⁄₇.

Anirandis 29 Aug 2020 20:23 UTC
5 points
on: Anirandis’s Shortform
It seems to me that ensuring we can separate an AI in design space from worse-than-death scenarios is perhaps the most crucial thing in AI alignment. I don’t at all feel comfortable with AI systems that are one cosmic ray: or, perhaps more plausibly, one human screw-up (e.g. this sort of thing) away from a fate far worse than death. Or maybe a human-level AI makes a mistake and creates a sign flipped successor. Perhaps there’s some sort of black swan possibility that nobody realises. I think that it’s absolutely critical that we have a robust mechanism in place to prevent something like this from happening regardless of the cause; sure, we can sanity-check the system, but that won’t help when the issue is caused after we’ve sanity-checked it, as is the case with cosmic rays or some human errors (like Gwern’s example, which I linked). We need ways to prevent this sort of thing from happening *regardless* of the source.
Some propositions seem promising. Eliezer’s suggestion of assigning a sort of “surrogate goal” that the AI hates more than torture, but not enough to devote all of its energy to attempt to prevent, seems promising. But this would only work when the *entire* reward is what gets flipped; with how much confidence can we rule out, say, a localised sign flip in some specific part of the AI that leads to the system terminally valuing something bad but that doesn’t change anything else (so the sign on the “surrogate” goal remains negative). Can we even be confident that the AI’s development team is going to implement something like this, and that it will work as intended?
An FAI that’s one software bug or screw-up in a database away from AM is a far scarier possibility than a paperclipper, IMO.
What links here?
- Anirandis's comment on Open & Welcome Thread—August 2020 by habryka (1 Sep 2020 14:21 UTC; 2 points)
- Anirandis's comment on Anirandis’s Shortform by Anirandis (9 Sep 2020 2:53 UTC; 2 points)

Anirandis 30 Aug 2020 16:05 UTC
6 points
in reply to: Dach’s comment on: Open & Welcome Thread—August 2020
I don’t really know what the probability is. It seems somewhat low, but I’m not confident that it’s *that* low. I wrote a shortform about it last night (tl;dr it seems like this type of error could occur in a disjunction of ways and we need a good way of separating the AI in design space.)

I think I’d stop worrying about it if I were convinced that its probability is extremely low. But I’m not yet convinced of that. Something like the example Gwern provided elsewhere in this thread seems more worrying than the more frequently discussed cosmic ray scenarios to me.

Anirandis

[Question] Like­li­hood of hy­per­ex­is­ten­tial catas­tro­phe from a bug?

[Question] ‘Max­i­mum’ level of suffer­ing?

[Question] Likelihood of hyperexistential catastrophe from a bug?

[Question] ‘Maximum’ level of suffering?