AI is not a matter of historical evidence. Correct models here are invasive species, new pathogens, Great Oxidation Event. 99% of species ended up extinct.
quetzal_rainbow
They are made of repeating patterns
[Question] What specific thing would you do with AI Alignment Research Assistant GPT?
I think difference between what you are describing and what is meant here is captured in this comment:
There’s a phenomenon where a gambler places their money on 32, and then the roulette wheel comes up 23, and they say “I’m such a fool; I should have bet 23”.
More useful would be to say “I’m such a fool; I should have noticed that the EV of this gamble is negative.” Now at least you aren’t asking for magic lottery powers.
Even more useful would be to say “I’m such a fool; I had three chances to notice that this bet was bad: when my partner was trying to explain EV to me; when I snuck out of the house and ignored a sense of guilt; and when I suppressed a qualm right before placing the bet. I should have paid attention in at least one of those cases and internalized the arguments about negative EV, before gambling my money.” Now at least you aren’t asking for magic cognitive powers.
The problem is not that you can “just meditate and come to good conclusions”, the problem is that “technical knowledge about actual machine learning results” doesn’t seem like good path either.
Like, we can get from NN trained to do modular addition the fact that it performs Fourier transform, because we clearly know what Fourier transform is, but I don’t see any clear path to get from neural network the fact that its output is both useful and safe, because we don’t have any practical operationalization of what “useful and safe” is. If we had solution to MIRI problem “which program being run on infinitely large computer produces aligned outcome”, we could try to understand how good NN in approximating this program, using aforementioned technical knowledge, and have substantial hope, for example.
Where Does Adversarial Pressure Come From?
Training of superintelligence is secretly adversarial
It’s striking how well these black box alignment methods work
I should note that human alignment methods works only with respect of the fact that no human in history could suddenly start to design nanotech in their head or treat other humans as buggy manipulable machinery. I think there are plenty of humans around who would want to become mind-controlling dictators given possibility or who are generally nice but would give in temptation.
“A source close to Altman” means “Altman” and I’m pretty sure that he is not very trustworthy party at the moment.
I think that you are missing what unique condition of recent past has changed with deepfakes. In medieval times there was two levels of evidence: you could see something with your own eyes or someone could tell you about something. Book saying there are people with brown skin in Africa was no different in evidential power than book saying there are people with dog heads, you were left with your priors. When photo and video was invented, we got new level of evidence, not as good as seeing something personally, but better than anything else. With deepfakes we are returning into medieval situation where “I’ve seen source about event X” has very little difference from “someone told me about event X”.
This is a meta-point, but I find it weird that you ask what is “caring about something” according to CS but don’t ask what “corrigibility” is, despite the fact of existence of multiple examples of goal-oriented systems and some relatively-good formalisms (we disagree whether expected utility maximization is a good model of real goal-oriented systems, but we all agree that if we met expected utility maximizer we would find its behavior pretty much goal-oriented), while corrigibility is a pure product of imagination of one particular Eliezer Yudkowsky, born in attempt to imagine system that doesn’t care about us but still behaves nicely under some vaguely-restricted definition of niceness. We don’t have any examples of corrigible systems in nature and we have constant failure of attempts to formalize even relatively simple instances of corrigibility, like shutdownability. I think likely answer to “why I should expect corrigibility to be unlikely” sounds like “there is no simple description of corrigibility to which our learning systems can easily generalize and there are no reasons to expect simple description to exist”.
The difficulty with preference fullfilment is that we need to target AI altruistic utility function exactly at humans, not, for example, all sentient/agentic beings. Superintelligent entity can use some acausal decision theory and discover some Tegmark-like theory of the multiverse, decide that there is much more paperclip maximizers than human-adjusted entities and fulfill their preferences, not ours.
I feel like the whole “subagent” framework suffers from homunculus problem: we fail to explain behavior using the abstraction of coherent agent, so we move to the abstraction of multiple coherent agents, and while it can be useful, I don’t think it displays actual mechanistic truth about minds.
When I plan something and then fail to execute plan it’s mostly not like “failure to bargain”. It’s just when I plan something I usually have good consequences of plan in my imagination and this consequences make me excited and then I start plan execution and get hit by multiple unpleasant details of reality. Coherent structure emerges from multiple not-really-agentic pieces.
It’s literally point −2 in List of Lethalities that we don’t need “perfect” alignment solution, we just don’t have any.
Eliezer has clear beliefs about interpretability and bets on it: https://manifold.markets/EliezerYudkowsky/by-the-end-of-2026-will-we-have-tra
You can just have model with capabilities of smartest human hacker that exfiltrates itself, hacks 1-10% of world computing power with shittiest protection, distributes itself in Rosetta@home style and bootstrap whatever takeover plan using sheer bruteforce. Thus said, I see no reason for capabilities to land exactly on point “smartest human hacker”, because there is nothing special at this point, and it can be 2x, 5x, 10x, without any necessity to become 1000000x within a second.
I’m pretty optimistic about our white box alignment methods generalizing fine.
And I still don’t get why! I would like to see your theory of generalization in DL that allows to have such level of optimism, and “gradient descent is powerful” simply doesn’t catch that.
I think the problem here is distinguishing between terminal and instrumental goals? Most of people probably don’t run apple pie business because they have terminal goals about apple pie business. They probably want money, status, want to be useful and provide for their families and I expect this goals to be very persistent and self-preseving.
Scattered thoughts:
I think that observed behavior is fairly consistent with non-linear functions that have sort-of-linear parts. Let’s take ReLU. If you subtract large enough number, it doesn’t matter if you subtract more, because you will always get zero, but before that you will observe sort-of-linear change of behavior.
Speculative part: neural networks learn linear representations and condations of switching between them which are expressed in non-linear part of internal mechanisms. If you add too much number to some component, model hits the region of state space that doesn’t have linear representation and crumbles.
Predictions:
Algebraic value editing works (for at least one “X vector”) in LMs: 95%
Algebraic value editing works better for larger models, all else equal: 55%
If value edits work well, they are also composable: 60%
If value edits work at all, they are hard to make without substantially degrading capabilities: 25%
We will claim we found an X-vector which qualitatively modifies completions in a range of situations, for X =
“truth-telling” 25%
“love” 70%
“accepting death” 70%
“speaking French” 95%
The main obstacle, according to my model: if model works by switching between different linear representations, it is possible that “niceness” vector exists only for some specific layer of model which decides whether completion will be nice or not, so you can’t take random layer in the middle and calculate “niceness” vector for it.
I noticed that for a huge amount of reasoning about the nature of values, I want to hand over a printed copy of “Three Worlds Collide” and run away, laughing nervously
Basically, because the world where kidney selling is legal is not the world where mothers won’t see their kids dying, it’s the world where people are forced to sell their kidneys to pay their student loans.
Useful heuristic for deontology-violation: this shit usually doesn’t have good consequences in the end.