Anirandis

Karma: 131

Anirandis 20 Mar 2022 0:03 UTC
10 points
on: Anirandis’s Shortform
Since it’s my shortform, I’d quite like to just vent about some stuff.
I’m still pretty scared about a transhumanist future going quite wrong. It simply seems to me that there’s quite the conjunction of paths to “s-risk” scenarios: generally speaking, any future agent that wants to cause disvalue to us—or an empathetic agent—would bring about an outcome that’s Pretty Bad by my lights. Like, it *really* doesn’t seem impossible that some AI decides to pre-commit to doing Bad if we don’t co-operate with it; or our AI ends up in some horrifying conflict-type scenario, which could lead to Bad outcomes as hinted at here; etc. etc.
Naturally, this kind of outcome is going to be salient because it’s scary—but even then, I struggle to believe that I’m more than moderately biased. The distribution of possibilities seems somewhat trimodal: either we maintain control and create a net-positive world (hopefully we’d be able to deal with the issue of people abusing uploads of each other); we all turn to dust; or something grim happens. And the fact that some very credible people (within this community at least) also conclude that this kind of thing has reasonable probability further makes me conclude that I just need to somehow deal with these scenarios being plausible, rather than trying to convince myself that they’re unlikely. But I remain deeply uncomfortable trying to do that.
Some commentators who seem to consider such scenarios plausible, such as Paul Christiano, also subscribe to the naive view regarding energy-efficiency arguments over pleasure and suffering: that the worst possible suffering is likely no worse than the greatest possible pleasure is good. And that this may also be the case for humans. Even if this is the case, and I’m sceptical, I still feel that I’m too risk-averse. In that world I wouldn’t accept a 90% chance of eternal bliss with a 10% chance of eternal suffering. I don’t think I hold suffering-focused views; I think there’s a level of happiness that can “outweigh” even extreme suffering. But when you translate it to probabilities, I become deeply uncomfortable with even a 0.01% chance of bad stuff happening to me. Particularly when the only way to avoid this gamble is to permanently stop existing. Perhaps something an OOM or two lower and I’d be more comfortable.

I’m not immediately suicidal, to be clear. I wouldn’t classify myself as ‘at-risk’. But I nonetheless find it incredibly hard to find solace. There’s a part of me that hopes things get nuclear, just so that a worse outcome is averted. I find it incredibly hard to care about other aspects of my life; I’m totally apathetic. I started to improve and got mid-way through the first year of my computer science degree, but I’m starting to feel like it’s gotten worse. I’d quite like to finish my degree and actually meaningfully contribute to the EA movement, but I don’t know if I can at this stage. I’m guessing it’s a result of me becoming more pessimistic about the worst outcomes resulting in my personal torture, since that’s the only real change that’s occurred recently. Even before I became more pessimistic I still thought about these outcomes constantly, so I don’t think just a case of me thinking about them more.
I take sertraline but it’s beyond useless. Alcohol helps, so at least there’s that. I’ve tried quitting thinking about this kind of thing—I’ve spent weeks trying to shut down any instance where I thought about it. I failed.
I don’t want to hear any over-optimistic perspectives on these issues. I’d greatly appreciate any genuine, sincerely held opinions on them (good or bad), or advice on dealing with the anxiety. But I don’t necessarily need or expect a reply; I just wanted to get this out there. Even if nobody reads it. Also, thanks a fuckton to everyone who was willing to speak to me privately about this stuff.
Sorry if this type of post isn’t allowed here, I just wanted to articulate some stuff for my own sake somewhere that I’m not going to be branded a lunatic. Hopefully LW/singularitarian views are wrong, but some of these scenarios aren’t hugely dependent on an imminent & immediate singularity. I’m glad I’ve written all of this down. I’m probably going to down a bottle or two of rum and try to forget about it all now.

Anirandis 6 Aug 2020 14:18 UTC
10 points
on: Open & Welcome Thread—August 2020
Is it plausible that an AGI could have some sort of exploit (buffer overflow maybe?) that could be exploited (maybe by an optimization daemon…?) and cause a sign flip in the utility function?
How about an error during self-improvement that leads to the same sort of outcome? Should we expect an AGI to sanity-check its successors, even if it’s only at or below human intelligence?
Sorry for the dumb questions, I’m just still nervous about this sort of thing.

Anirandis 1 Apr 2023 18:33 UTC
6 points
5
in reply to: Droopyhammock’s comment on: AI: Practical Advice for the Worried
I’m a little confused by the agreement votes with this comment—it seems to me that the consensus around here is that s-risks in which currently-existing humans suffer maximally are very unlikely to occur. This seems an important practical question; could the people who agreement-upvoted elaborate on why they find this kind of thing plausible?
The examples discussed in e.g. the Kaj Sotala interview linked later down the chain tend to regard things like “suffering subroutines”, for example.

Anirandis 9 Jun 2022 0:21 UTC
6 points
on: There’s probably a tradeoff between AI capability and safety, and we should act like it
Related: alignment tax

Anirandis 12 Feb 2022 23:07 UTC
6 points
on: February Open Thread
I’m not sure if this is the right place to ask this, but does anyone know what point Paul’s trying to make in the following part of this podcast? (Relevant section starts around 1:44:00)
Suppose you have a P probability of the best thing you can do and a one-minus P probably the worst thing you can do, what does P have to be so it’s the difference between that and the barren universe. I think most of my probability is distributed between you would need somewhere between 50% and 99% chance of good things and then put some probability or some credence on views where that number is a quadrillion times larger or something in which case it’s definitely going to dominate. A quadrillion is probably too big a number, but very big numbers. Numbers easily large enough to swamp the actual probabilities involved
[ . . . ]
I think that those arguments are a little bit complicated, how do you get at these? I think to clarify the basic position, the reason that you end up concluding it’s worse is just like conceal your intuition about how bad the worst thing that can happen to a person is vs the best thing or damn, the worst thing seems pretty bad and then the like first-pass responses, sort of have this debunking understanding, or we understand causally how it is that we ended up with this kind of preference with respect to really bad stuff versus really good stuff.
If you look at what happens over evolutionary history. What is the range of things that can happen to an organism and how should an organism be trading off like best possible versus worst possible outcomes. Then you end up into well, to what extent is that a debunking explanation that explains why humans in terms of their capacity to experience joy and suffering are unbiased but the reality is still biased versus to what extent is this then fundamentally reflected in our preferences about good and bad things. I think it’s just a really hard set of questions. I could easily imagine maybe shifting on them with much more deliberation.
It seems like an important topic but I’m a bit confused by what he’s saying here. Is the perspective he’s discussing (and puts non-negligible probability on) one that states that the worst possible suffering is a bajillion times worse than the best possible pleasure, and wouldn’t that suggest every human’s life is net-negative (even if your credence on this being the case is ~.1%)? Or is this just discussing the energy-efficiency of ‘hedonium’ and ‘dolorium’, in which case it’s of solely altruistic concern & can be dealt with by strictly limiting compute?
Also, I’m not really sure if this set of views is more “a broken bone/waterboarding is a million times as morally pressing as making a happy person”, or along the more empirical lines of “most suffering (e.g. waterboarding) is extremely light, humans can experience far far far far far^99 times worse; and pleasure doesn’t scale to the same degree.” Even a tiny chance of the second one being true is awful to contemplate.
What links here?
- Anirandis's comment on Open Thread: Spring 2022 by Aaron Gertler (EA Forum; 24 Feb 2022 23:55 UTC; 1 point)

Anirandis 30 Aug 2020 16:05 UTC
6 points
in reply to: Dach’s comment on: Open & Welcome Thread—August 2020
I don’t really know what the probability is. It seems somewhat low, but I’m not confident that it’s *that* low. I wrote a shortform about it last night (tl;dr it seems like this type of error could occur in a disjunction of ways and we need a good way of separating the AI in design space.)

I think I’d stop worrying about it if I were convinced that its probability is extremely low. But I’m not yet convinced of that. Something like the example Gwern provided elsewhere in this thread seems more worrying than the more frequently discussed cosmic ray scenarios to me.

Anirandis 5 Feb 2022 14:53 UTC
5 points
on: Paperclippers, s-risks, hope
I’m way more scared about the electrode-produced smiley faces for eternity and the rest. That’s way, way worse than dying.
FWIW, it seems kinda weird to me that such an AI would keep you alive… if you had a “smile-maximiser” AI, wouldn’t it be indifferent to humans being braindead, as long as it’s able to keep them smiling?
I’d like to have Paul Christiano’s view that the “s-risk-risk” is ¹⁄₁₀₀ and that AGI is 30 years off
I think Paul’s view is along the lines of “1% chance of some non-insignificant amount of suffering being intentionally created”, not a 1% chance of this type of scenario.^[1]
Could AGI arrive tomorrow in its present state?
I guess. But we’d need to come up with some AI model tomorrow, and this model suddenly becomes agentive and rapidly grows in power, and this model is designed with a utility function that values keeping humans alive but does not value humans flourishing… and even then, there’d likely be better ways to e.g. maximise the number of smiles in the universe, by using artificially created minds.
Eliezer has written a bit about this, but I think he considers it a mostly solved problem.
What can I do as a 30 year old from Portugal with no STEM knowledge? Start learning math and work on alignment from home?
Probably get treatment for the anxiety and try to stop thinking about scenarios that are very unlikely, albeit salient in your mind. (I know, speaking from experience, that it’s hard to do so!)
1. ^
  I did, coincidentally, cold e-mail Paul a while ago to try to get his model on this type of stuff & got the following response:
  “I think these scenarios are plausible but not particularly likely. I don’t think that cryonics makes a huge difference to your personal probabilities, but I could imagine it increasing them a tiny bit. If you cared about suffering-maximizing outcomes a thousand times as much as extinction, then I think it would be plausible for considerations along these lines to tip the balance against cryonics (and if you cared a million times more I would expect them to dominate). I think these risks are larger if you are less scope sensitive since the main protection is the small expected fraction of resources controlled by actors who are inclined to make such threats.”
  TBH it’s difficult to infer a particular probability estimate for one’s individual probability without cryonics or voluntary uploading here; it’s not completely clear just how bad a scenario would have to be (for a typical biological human) in order to fall within the class of scenarios described as ‘plausible but not particularly likely’.

Anirandis 8 Aug 2021 3:41 UTC
5 points
on: Anirandis’s Shortform
Lurker here; I’m still very distressed after thinking about some futurism/AI stuff & worrying about possibilities of being tortured. If anyone’s willing to have a discussion on this stuff, please PM!

Anirandis 29 Aug 2020 20:23 UTC
5 points
on: Anirandis’s Shortform
It seems to me that ensuring we can separate an AI in design space from worse-than-death scenarios is perhaps the most crucial thing in AI alignment. I don’t at all feel comfortable with AI systems that are one cosmic ray: or, perhaps more plausibly, one human screw-up (e.g. this sort of thing) away from a fate far worse than death. Or maybe a human-level AI makes a mistake and creates a sign flipped successor. Perhaps there’s some sort of black swan possibility that nobody realises. I think that it’s absolutely critical that we have a robust mechanism in place to prevent something like this from happening regardless of the cause; sure, we can sanity-check the system, but that won’t help when the issue is caused after we’ve sanity-checked it, as is the case with cosmic rays or some human errors (like Gwern’s example, which I linked). We need ways to prevent this sort of thing from happening *regardless* of the source.
Some propositions seem promising. Eliezer’s suggestion of assigning a sort of “surrogate goal” that the AI hates more than torture, but not enough to devote all of its energy to attempt to prevent, seems promising. But this would only work when the *entire* reward is what gets flipped; with how much confidence can we rule out, say, a localised sign flip in some specific part of the AI that leads to the system terminally valuing something bad but that doesn’t change anything else (so the sign on the “surrogate” goal remains negative). Can we even be confident that the AI’s development team is going to implement something like this, and that it will work as intended?
An FAI that’s one software bug or screw-up in a database away from AM is a far scarier possibility than a paperclipper, IMO.
What links here?
- Anirandis's comment on Open & Welcome Thread—August 2020 by habryka (1 Sep 2020 14:21 UTC; 2 points)
- Anirandis's comment on Anirandis’s Shortform by Anirandis (9 Sep 2020 2:53 UTC; 2 points)

Anirandis 17 Aug 2023 13:12 UTC
4 points
1
on: My views on “doom”
What does the distribution of these non-death dystopias look like? There’s an enormous difference between 1984 and maximally efficient torture; for example, do you have a rough guess of what the probability distribution looks like if you condition on an irreversibly messed up but non-death future?

Anirandis 30 Mar 2022 21:04 UTC
4 points
on: Meta wants to use AI to write Wikipedia articles; I am Nervous™
Presumably it’d take less manpower to review each article that the AI’s written (i.e. read the citations & make sure the article accurately describes the subjects) than it would to write articles from scratch. I’d guess this is the case even if the claims seem plausible & fact-checking requires a somewhat detailed reading through of the sources.

Anirandis 20 Sep 2020 0:36 UTC
4 points
on: Open & Welcome Thread—September 2020
If anyone happens to be willing to privately discuss some potentially infohazardous stuff that’s been on my mind (and not in a good way) involving acausal trade, I’d appreciate it—PM me. It’d be nice if I can figure out whether I’m going batshit.

Anirandis 11 Sep 2020 16:02 UTC
4 points
in reply to: Zack_M_Davis’s comment on: How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe?
My anxieties over this stuff tend not to be so bad late at night, TBH.

Anirandis 10 Sep 2020 13:36 UTC
4 points
in reply to: Steven Byrnes’s comment on: How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe?
I see. I’m somewhat unsure how likely AGI is to be built with a neuromorphic architecture though.

I don’t think that’s an example of (3), more like (1) or (2), or actually “none of the above because GPT-2 doesn’t have this kind of architecture”.
I just raised GPT-2 to indicate that flipping the goal sign suddenly can lead to optimising for bad behavior without the AI neglecting to consider new strategies. Presumably that’d suggest it’s also a possibility with cosmic ray/other errors.

Anirandis 3 Sep 2020 0:01 UTC
4 points
in reply to: Dach’s comment on: Open & Welcome Thread—August 2020
As an almost entirely inapplicable analogy . . . it’s just doing something weird.
If we inverted the utility function . . . tiling the universe with smiley faces, i.e. paperclipping.
Interesting analogy. I can see what you’re saying, and I guess it depends on what specifically gets flipped. I’m unsure about the second example; something like exploring new strategies doesn’t seem like something an AGI would terminally value. It’s instrumental to optimising the reward function/model, but I can’t see it getting flipped *with* the reward function/model.
Can you clarify what you mean by this? Also, I get what you’re going for, but paperclips is still extremely negative utility because it involves the destruction of humanity and the reconfiguration of the universe into garbage.
My thinking was that a signflipped AGI designed as a positive utilitarian (i.e. with a minimum at 0 human utility) would prefer paperclipping to torture because the former provides 0 human utility (as there aren’t any humans), whereas the latter may produce a negligible amount. I’m not really sure if it makes sense tbh.
The reward modelling system would need to be very carefully engineered, definitely.
Even if we engineered it carefully, that doesn’t rule out screw-ups. We need robust failsafe measures *just in case*, imo.
I thought of this as well when I read the post. I’m sure there’s something clever you can do to avoid this but we also need to make sure that these sorts of critical components are not vulnerable to memory corruption. I may try to find a better strategy for this later, but for now I need to go do other things.
I wonder if you could feasibly make it a part of the reward model. Perhaps you could train the reward model itself to disvalue something arbitrary (like paperclips) even more than torture, which would hopefully mitigate it. You’d still need to balance it in a way such that the system won’t spend all of its resources preventing this thing from happening at the neglect of actual human values, but that doesn’t seem too difficult. Although, once again, we can’t really have high confidence (>90%) that the AGI developers are going to think to implement something like this.
There was also an interesting idea I found in a Facebook post about this type of thing that got linked somewhere (can’t remember where). Stuart Armstrong suggested that a utility function could be designed as such:
Let B1 and B2 be excellent, bestest outcomes. Define U(B1)=1, U(B2)=-1, and U=0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes. Or, more usefully, let X be some trivial feature that the agent can easily set to −1 or 1, and let U be a utility function with values in [0,1]. Have the AI maximisise or minimise XU. Then the AI will always aim for the same best world, just with a different X value.
Even if we solve any issues with these (and actually bother to implement them), there’s still the risk of an error like this happening in a localised part of the reward function such that *only* the part specifying something bad gets flipped, although I’m a little confused about this one. It could very well be the case that the system’s complex enough that there isn’t just one bit indicating whether “pain” or “suffering” is good or bad. And we’d presumably (hopefully) have checksums and whatever else thrown in. Maybe this could be mitigated by assigning more positive utility to good outcomes than negative utility to bad outcomes? (I’m probably speaking out of my rear end on this one.)

Memory corruption seems to be another issue. Perhaps if we have more than one measure we’d be less vulnerable to memory corruption. Like, if we designed an AGI with a reward model that disvalues two arbitrary things rather than just one, and memory corruption screwed with *both* measures, then something probably just went *very* wrong in the AGI and it probably won’t be able to optimise for suffering anyway.

Anirandis 2 Sep 2020 15:53 UTC
4 points
in reply to: Dach’s comment on: Open & Welcome Thread—August 2020
Thanks for the detailed response. A bit of nitpicking (from someone who doesn’t really know what they’re talking about):
However, the vast majority of these mistakes would probably buff out or result in paper-clipping.
I’m slightly confused by this one. If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be *no* human utility (i.e. paperclips). But most attempts at an aligned AI would have a minimum at “I have no mouth, and I must scream”. So any sign-flipping error would be expected to land there.
If humans are making changes to the critical software/hardware of an AGI (And we’ll assume you figured out how to let the AGI allow you to do this in a way that has no negative side effects), *while that AGI is already running*, something bizarre and beyond my abilities of prediction is already happening.
In the example, the AGI was using online machine learning, which, as I understand it, would probably require the system to be hooked up to a database that humans have access to in order for it to learn properly. And I’m unsure as to how easy it’d be for things like checksums to pick up an issue like this (a boolean flag getting flipped) in a database.
Perhaps there’ll be a reward function/model intentionally designed to disvalue some arbitrary “surrogate” thing in an attempt to separate it from hyperexistential risk. So “pessimizing the target metric” would look more like paperclipping than torture. But I’m unsure as to (1) whether the AGI’s developers would actually bother to implement it, and (2) whether it’d actually work in this sort of scenario.
Also worth noting is that an AGI based on reward modelling is going to have to be linked to another neural network, which is going to have constant input from humans. If that reward model isn’t designed to be separated in design space from AM, someone could screw up with the model somehow. If we were to, say, have U = V + W (where V is the reward given by the reward model and W is some arbitrary thing that the AGI disvalues, as is the case in Eliezer’s Arbital post that I linked,) a sign flip-type error in V (rather than a sign flip in U) would lead to a hyperexistential catastrophe.
It will not be possible to flip the sign of the utility function or the direction of the updates to the reward model, even if several of the researchers on the project are actively trying to sabotage the effort and cause a hyperexistential disaster.
I think this is somewhat likely to be the case, but I’m not sure that I’m confident enough about it. Flipping the direction of updates to the reward model seems harder to prevent than a bit flip in a utility function, which could be prevent through error-correcting code memory (as you mentioned earlier.)

Despite my confusions, your response has definitely decreased my credence in this sort of thing from happening.

Anirandis 19 Aug 2020 2:53 UTC
4 points
in reply to: habryka’s comment on: Open & Welcome Thread—August 2020
I’m under the impression that an AGI would be monitored *during* training as well. So you’d effectively need the system to turn “evil” (utility function flipped) during the training process, and the system to be smart enough to conceal that the error occurred. So it’d need to happen a fair bit into the training process. I guess that’s possible, but IDK how likely it’d be.

Anirandis 19 Aug 2020 2:18 UTC
4 points
in reply to: gwern’s comment on: Open & Welcome Thread—August 2020
Sure, but the *specific* type of error I’m imagining would surely be easier to pick up than most other errors. I have no idea what sort of sanity checking was done with GPT-2, but the fact that the developers were asleep when it trained is telling: they weren’t being as careful as they could’ve been.
For this type of bug (a sign error in the utility function) to occur *before* the system is deployed and somehow persist, it’d have to make it past all sanity-checking tools (which I imagine would be used extensively with an AGI) *and* somehow not be noticed at all while the model trains *and* whatever else. Yes, these sort of conjunctions occur in the real world but the error is generally more subtle than “system does the complete opposite of what it was meant to do”.
I made a question post about this specific type of bug occurring before deployment a while ago and think my views have shifted significantly; it’s unlikely that a bug as obvious as one that flips the sign of the utility function won’t be noticed before deployment. Now I’m more worried about something like this happening *after* the system has been deployed.
I think a more robust solution to all of these sort of errors would be something like the separation from hyperexistential risk article that I linked in my previous response. I optimistically hope that we’re able to come up with a utility function that doesn’t do anything worse than death when minimised, just in case.

Anirandis 19 Aug 2020 1:38 UTC
4 points
in reply to: gwern’s comment on: Open & Welcome Thread—August 2020
Wouldn’t any configuration errors or updates be caught with sanity-checking tools though? Maybe the way I’m visualising this is just too simplistic, but any developers capable of creating an *aligned* AGI are going to be *extremely* careful not to fuck up. Sure, it’s possible, but the most plausible cause of a hyperexistential catastrophe to me seems to be where a SignFlip-type error occurs once the system has been deployed.

Hopefully a system as crucially important as an AGI isn’t going to have just one guy watching it who “takes a quick bathroom break”. When the difference is literally Heaven and Hell (minimising human values), I’d consider only having one guy in a basement monitoring it to be gross negligence.

Anirandis 15 Aug 2020 22:52 UTC
4 points
in reply to: gwern’s comment on: Open & Welcome Thread—August 2020
Do you think that this type of thing could plausibly occur *after* training and deployment?