quetzal_rainbow

Karma: 918

They are made of repeating patterns

quetzal_rainbow13 Nov 2023 18:17 UTC

49 points

4 comments2 min readLW link

[Question] What specific thing would you do with AI Alignment Research Assistant GPT?

quetzal_rainbow8 Jan 2023 19:24 UTC

45 points

9 comments1 min readLW link

quetzal_rainbow 26 Mar 2023 21:38 UTC
41 points
14
on: Don’t take bad options away from people
Basically, because the world where kidney selling is legal is not the world where mothers won’t see their kids dying, it’s the world where people are forced to sell their kidneys to pay their student loans.

Useful heuristic for deontology-violation: this shit usually doesn’t have good consequences in the end.

quetzal_rainbow 9 Dec 2023 21:28 UTC
34 points
26
on: The Offense-Defense Balance Rarely Changes
AI is not a matter of historical evidence. Correct models here are invasive species, new pathogens, Great Oxidation Event. 99% of species ended up extinct.

quetzal_rainbow 11 Mar 2024 20:32 UTC
24 points
3
in reply to: Blacknsilver’s comment on: “How could I have thought that faster?”
I think difference between what you are describing and what is meant here is captured in this comment:
There’s a phenomenon where a gambler places their money on 32, and then the roulette wheel comes up 23, and they say “I’m such a fool; I should have bet 23”.
More useful would be to say “I’m such a fool; I should have noticed that the EV of this gamble is negative.” Now at least you aren’t asking for magic lottery powers.
Even more useful would be to say “I’m such a fool; I had three chances to notice that this bet was bad: when my partner was trying to explain EV to me; when I snuck out of the house and ignored a sense of guilt; and when I suppressed a qualm right before placing the bet. I should have paid attention in at least one of those cases and internalized the arguments about negative EV, before gambling my money.” Now at least you aren’t asking for magic cognitive powers.

quetzal_rainbow 4 Mar 2024 21:43 UTC
22 points
3
in reply to: TurnTrout’s comment on: TurnTrout’s shortform feed
The problem is not that you can “just meditate and come to good conclusions”, the problem is that “technical knowledge about actual machine learning results” doesn’t seem like good path either.

Like, we can get from NN trained to do modular addition the fact that it performs Fourier transform, because we clearly know what Fourier transform is, but I don’t see any clear path to get from neural network the fact that its output is both useful and safe, because we don’t have any practical operationalization of what “useful and safe” is. If we had solution to MIRI problem “which program being run on infinitely large computer produces aligned outcome”, we could try to understand how good NN in approximating this program, using aforementioned technical knowledge, and have substantial hope, for example.

Where Does Adversarial Pressure Come From?

quetzal_rainbow14 Dec 2023 22:31 UTC

16 points

1 comment2 min readLW link

Training of superintelligence is secretly adversarial

quetzal_rainbow7 Feb 2024 13:38 UTC

15 points

2 comments5 min readLW link

quetzal_rainbow 2 Dec 2023 7:24 UTC
15 points
7
on: Thoughts on “AI is easy to control” by Pope & Belrose
It’s striking how well these black box alignment methods work
I should note that human alignment methods works only with respect of the fact that no human in history could suddenly start to design nanotech in their head or treat other humans as buggy manipulable machinery. I think there are plenty of humans around who would want to become mind-controlling dictators given possibility or who are generally nice but would give in temptation.

quetzal_rainbow 19 Nov 2023 16:19 UTC
15 points
5
in reply to: Roko’s comment on: “Why can’t you just turn it off?”
“A source close to Altman” means “Altman” and I’m pretty sure that he is not very trustworthy party at the moment.

quetzal_rainbow 23 Dec 2023 21:06 UTC
14 points
3
on: AI Girlfriends Won’t Matter Much
I think that you are missing what unique condition of recent past has changed with deepfakes. In medieval times there was two levels of evidence: you could see something with your own eyes or someone could tell you about something. Book saying there are people with brown skin in Africa was no different in evidential power than book saying there are people with dog heads, you were left with your priors. When photo and video was invented, we got new level of evidence, not as good as seeing something personally, but better than anything else. With deepfakes we are returning into medieval situation where “I’ve seen source about event X” has very little difference from “someone told me about event X”.

quetzal_rainbow 27 Oct 2023 9:10 UTC
LW: 14 AF: 11
0
AF
in reply to: Thomas Kwa’s comment on: AI as a science, and three obstacles to alignment strategies
This is a meta-point, but I find it weird that you ask what is “caring about something” according to CS but don’t ask what “corrigibility” is, despite the fact of existence of multiple examples of goal-oriented systems and some relatively-good formalisms (we disagree whether expected utility maximization is a good model of real goal-oriented systems, but we all agree that if we met expected utility maximizer we would find its behavior pretty much goal-oriented), while corrigibility is a pure product of imagination of one particular Eliezer Yudkowsky, born in attempt to imagine system that doesn’t care about us but still behaves nicely under some vaguely-restricted definition of niceness. We don’t have any examples of corrigible systems in nature and we have constant failure of attempts to formalize even relatively simple instances of corrigibility, like shutdownability. I think likely answer to “why I should expect corrigibility to be unlikely” sounds like “there is no simple description of corrigibility to which our learning systems can easily generalize and there are no reasons to expect simple description to exist”.

quetzal_rainbow 26 Feb 2023 16:47 UTC
14 points
10
on: The Preference Fulfillment Hypothesis
The difficulty with preference fullfilment is that we need to target AI altruistic utility function exactly at humans, not, for example, all sentient/agentic beings. Superintelligent entity can use some acausal decision theory and discover some Tegmark-like theory of the multiverse, decide that there is much more paperclip maximizers than human-adjusted entities and fulfill their preferences, not ours.

quetzal_rainbow 14 Apr 2024 21:08 UTC
12 points
5
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
I feel like the whole “subagent” framework suffers from homunculus problem: we fail to explain behavior using the abstraction of coherent agent, so we move to the abstraction of multiple coherent agents, and while it can be useful, I don’t think it displays actual mechanistic truth about minds.

When I plan something and then fail to execute plan it’s mostly not like “failure to bargain”. It’s just when I plan something I usually have good consequences of plan in my imagination and this consequences make me excited and then I start plan execution and get hit by multiple unpleasant details of reality. Coherent structure emerges from multiple not-really-agentic pieces.

quetzal_rainbow 26 Dec 2023 4:21 UTC
12 points
2
in reply to: Noosphere89’s comment on: Against Almost Every Theory of Impact of Interpretability
It’s literally point −2 in List of Lethalities that we don’t need “perfect” alignment solution, we just don’t have any.

quetzal_rainbow 30 Mar 2023 11:50 UTC
12 points
5
in reply to: ShardPhoenix’s comment on: Pausing AI Developments Isn’t Enough. We Need to Shut it All Down by Eliezer Yudkowsky
Eliezer has clear beliefs about interpretability and bets on it: https://manifold.markets/EliezerYudkowsky/by-the-end-of-2026-will-we-have-tra

quetzal_rainbow 3 Dec 2023 10:28 UTC
11 points
10
in reply to: Nora Belrose’s comment on: Thoughts on “AI is easy to control” by Pope & Belrose
You can just have model with capabilities of smartest human hacker that exfiltrates itself, hacks 1-10% of world computing power with shittiest protection, distributes itself in Rosetta@home style and bootstrap whatever takeover plan using sheer bruteforce. Thus said, I see no reason for capabilities to land exactly on point “smartest human hacker”, because there is nothing special at this point, and it can be 2x, 5x, 10x, without any necessity to become 1000000x within a second.
I’m pretty optimistic about our white box alignment methods generalizing fine.
And I still don’t get why! I would like to see your theory of generalization in DL that allows to have such level of optimism, and “gradient descent is powerful” simply doesn’t catch that.

quetzal_rainbow 25 Nov 2023 13:31 UTC
11 points
11
in reply to: AnnaSalamon’s comment on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
I think the problem here is distinguishing between terminal and instrumental goals? Most of people probably don’t run apple pie business because they have terminal goals about apple pie business. They probably want money, status, want to be useful and provide for their families and I expect this goals to be very persistent and self-preseving.

quetzal_rainbow 1 Apr 2023 12:07 UTC
11 points
0
on: Maze-solving agents: Add a top-right vector, make the agent go to the top-right
Scattered thoughts:
I think that observed behavior is fairly consistent with non-linear functions that have sort-of-linear parts. Let’s take ReLU. If you subtract large enough number, it doesn’t matter if you subtract more, because you will always get zero, but before that you will observe sort-of-linear change of behavior.
Speculative part: neural networks learn linear representations and condations of switching between them which are expressed in non-linear part of internal mechanisms. If you add too much number to some component, model hits the region of state space that doesn’t have linear representation and crumbles.
Predictions:
1. Algebraic value editing works (for at least one “X vector”) in LMs: 95%
2. Algebraic value editing works better for larger models, all else equal: 55%
3. If value edits work well, they are also composable: 60%
4. If value edits work at all, they are hard to make without substantially degrading capabilities: 25%
5. We will claim we found an X-vector which qualitatively modifies completions in a range of situations, for X =
  1. “truth-telling” 25%
  2. “love” 70%
  3. “accepting death” 70%
  4. “speaking French” 95%
The main obstacle, according to my model: if model works by switching between different linear representations, it is possible that “niceness” vector exists only for some specific layer of model which decides whether completion will be nice or not, so you can’t take random layer in the middle and calculate “niceness” vector for it.

quetzal_rainbow 22 Jan 2023 11:23 UTC
11 points
4
on: quetzal_rainbow’s Shortform
I noticed that for a huge amount of reasoning about the nature of values, I want to hand over a printed copy of “Three Worlds Collide” and run away, laughing nervously

quetzal_rainbow

They are made of re­peat­ing patterns

[Question] What spe­cific thing would you do with AI Align­ment Re­search As­sis­tant GPT?

Where Does Ad­ver­sar­ial Pres­sure Come From?

Train­ing of su­per­in­tel­li­gence is se­cretly adversarial

They are made of repeating patterns

[Question] What specific thing would you do with AI Alignment Research Assistant GPT?

Where Does Adversarial Pressure Come From?

Training of superintelligence is secretly adversarial