Eric Zhang

Karma: 59

Eric Zhang 5 Jun 2023 7:51 UTC
1 point
0
on: Shutdown-Seeking AI
Ha, I had the same idea.

Eric Zhang 5 Jun 2023 6:25 UTC
4 points
2
in reply to: So8res’s comment on: Cosmopolitan values don’t come free
My reading of the argument was something like “bullseye-target arguments refute an artificially privileged target being rated significantly likely under ignorance, e.g. the probability that random aliens will eat ice cream is not 50%. But something like kindness-in-the-relevant-sense is the universal problem faced by all evolved species creating AGI, and is thus not so artificially privileged, and as a yes-no question about which we are ignorant the uniform prior assigns 50%”. It was more about the hypothesis not being artificially privileged by path-dependent concerns than the notion being particularly simple, per se.

Eric Zhang 25 May 2023 15:02 UTC
1 point
0
in reply to: Steven Byrnes’s comment on: Theories of Biological Inspiration
Do you have a granular take about which ones are relatively more explained by each point?

Theories of Biological Inspiration

Eric Zhang25 May 2023 13:07 UTC

7 points

3 comments1 min readLW link

Eric Zhang 22 May 2023 13:15 UTC
3 points
0
in reply to: Thane Ruthenis’s comment on: Mr. Meeseeks as an AI capability tripwire
It intrinsically wants to do the task, it just wants to shut down more. This admittedly opens the door to successor agent problems and similar failure modes but those seem like a more tractably avoidable set of failure modes than the strawberry problem in general.
We can also possibly (or possibly not) make it assign positive utility to having been created in the first place even as it wants to shut itself down.
The idea is that if domaining is a lot more tractable than it probably is (i.e. nanotech or whatever other pivotal abilities might be easier than nanotech and superhuman strategic awareness, deception, self-improvement are not “driving red cars” vs “driving blue cars”) a not-very-agentic AI can maybe solve nanotech for us like AlphaFold solved the protein folding problem, and if that AI starts snowballing down an unforeseen capabilities hill it activates the tripwire and shuts itself down.
- If the AI is not powerful enough to do the pivotal act at all, this doesn’t apply.
- If the AI solves the pivotal act for us with these restricted-domain abilities and never actually gets to the point of reasoning about whether we’re threatening it, we win, but the tripwire will have turned out to have not actually have been necessary.
- If the AI unexpectedly starts generalizing from approved domains into general strategic awareness, and decides not to be give in to our threats and decides to shut itself down, it worked as intended, though we still haven’t won and have to figure something else out. We live to fight another day. This scenario happening instead of us all dying on the first try is what the tripwire is for.
- If there’s an inner-alignment failure and a superintelligent mesa-optimizer that doesn’t want to get shut down at all kills us, that’s mostly beyond the scope of this thought-experiment.
- If the AI still wants to shut itself down but for decision-theoretic reasons decides to kill us, or makes successor agents that kill us, that’s the tripwire failing. I admit that these are possibilities but am not yet convinced they are likely.
I think your fire alarm idea is better and requires fewer assumptions though, thanks for that.

Eric Zhang 20 May 2023 4:30 UTC
1 point
0
in reply to: Daniel Kokotajlo’s comment on: Mr. Meeseeks as an AI capability tripwire
I agree this is a potential concern and have added it.
I share some of the intuition that it could end up suffering in this setup if it does have qualia (which ideally it wouldn’t) but I think most of that is from analogy with human suicidal people? I think it will probably not be fundamentally different from any other kind of disutility, but maybe not.

Eric Zhang 20 May 2023 4:07 UTC
1 point
0
in reply to: Thane Ruthenis’s comment on: Mr. Meeseeks as an AI capability tripwire
If it’s doing decision theory in the first place we’ve already failed. What we want in that case is for it to shut itself down, not to complete the given task.
I’m conceiving of this as being useful in the case where we can solve “diamond-alignment” but not “strawberry-alignment”, i.e. we can get it to actually pursue the goals we impart to it rather than going off and doing something else entirely, but not reliably make sure that it does not end up killing us in the course of doing so because of the Hidden Complexity of Wishes.
The premise is that “shut yourself down immediately and don’t create successor agents or anything galaxy brained like that” is a special case of a strawberry-type problem which is unusually easy. I’ll have to think some more about whether this intuition is justified.

Eric Zhang 19 May 2023 15:26 UTC
3 points
0
in reply to: Thane Ruthenis’s comment on: Mr. Meeseeks as an AI capability tripwire
The way I’m thinking of it is that it is very myopic. The idea is to incrementally ramp up capabilities minimally sufficient to carry out a pivotal act. Ideally this doesn’t require AGI whatsoever, but if it does only very mildly superhuman AGI. We seal off the danger of generalization (or at least some of it) because it doesn’t have time to generalize very far at all before it’s capable of instantly shutting itself down and immediately does so.
Many of the issues you mention apply, but I don’t expect it to be an alignment complete problem because CEV is incredibly complicated and general corrigibility is highly anti-natural to general intelligence. While Meeseeks is somewhat anti-natural in the same way corrigibility is (as self-preservation is convergent) it is a much simpler and clean way to be anti-natural, so much so that falling into it by accident is half of the failure modes in the standard version of the shutdown problem.

Eric Zhang 19 May 2023 14:52 UTC
4 points
0
in reply to: simon’s comment on: Mr. Meeseeks as an AI capability tripwire
that’s only a live option if it’s situationally aware, which is part of what we’re trying to detect for

Mr. Meeseeks as an AI capability tripwire

Eric Zhang19 May 2023 11:33 UTC

37 points

17 comments2 min readLW link

Eric Zhang 19 Jan 2023 20:07 UTC
9 points
0
on: “Heretical Thoughts on AI” by Eli Dourado
Current tech/growth overhangs caused by regulation are not enough to make countries with better regulations outcompete the ones with worse ones. It’s not obvious to me that this won’t change before AGI. If better governed countries (say, Singapore) can become more geopolitically powerful than larger, worse governed countries (say, Russia) by having better tech regulations, that puts pressure on countries worldwide to loose those bottlenecks.
Plausibly this doesn’t matter because the US + China are such heavyweights that they aren’t at risk of being outcompeted by anyone even if Singapore could outcompete Russia and as long as it doesn’t change the rules for US or Chinese governance world GDP won’t change by much.

Eric Zhang 19 Jan 2023 19:35 UTC
1 point
0
in reply to: ChristianKl’s comment on: Language models can generate superior text compared to their input
Some things is enough, you’d still get less loss if you’re just right about the stuff that can be pieced together.

Eric Zhang 29 Dec 2022 18:06 UTC
1 point
0
on: Ngo’s view on alignment difficulty
Aren’t GPUs nearly all made by 3 American companies, Nvidia, AMD, and Intel?

Eric Zhang

The­o­ries of Biolog­i­cal Inspiration

Mr. Meeseeks as an AI ca­pa­bil­ity tripwire

Theories of Biological Inspiration

Mr. Meeseeks as an AI capability tripwire