Also, there is a big difference between “Calling for violence”, and “calling for the establishment of an international treaty, which is to be enforced by violence if necessary”. I don’t understand why so many people are muddling this distinction.
Chris van Merwijk(Chris van Merwijk)
Moloch games
Subspace optima
Are human imitators superhuman models with explicit constraints on capabilities?
Manhattan project for aligned AI
[Question] How are compute assets distributed in the world?
Straw-Steelmanning
A paradox of existence
What kinds of algorithms do multi-human imitators learn?
“For instance, personally I think the reason so few people take AI alignment seriously is that we haven’t actually seen anything all that scary yet. “
And if this “actually scary” thing happens, people will know that Yudkowsky wrote the article beforehand, and they will know who the people are that mocked it.
I suggest renaming this to “countably factored spaces”. Countably being a property of the factorization rather than the space.
Also I suggest adding an actual self-contained definition of countable factored space to make it more readable.
Datapoint: median 10% AI x-risk mentioned on Dutch public TV channel
Maybe you have made a gestalt-switch I haven’t made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.
Is there a difference between saying:
A reward function is an objective function, but the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn’t actually encode in any way the “goal” of the model itself.
A reward function is not an objective function, and the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn’t actually encode in any way the “goal” of the model itself.
It seems to me that once you acknowledge the point about reinforcement, the additional statement that reward is not an objective doesn’t actually imply anything further about the mechanistic properties of deep reinforcement learners? It is just a way to put a high-level conceptual story on top of it, and in this sense it seems to me that this point is already known (and in particular, contained within RFLO), even though we talked of the base objective still as an “objective”.
However, it might be that while RFLO pointed out the same mechanistic understanding that you have in mind, but calling it an objective tends in practice to not fully communicate that mechanistic understanding.
Or it might be that I am really not yet understanding that there is an actual diferrence in mechanistic understanding, or that my intuitions are still being misled by the wrong high-level concept even if I have the lower-level mechanistic understanding right.
(On the other hand, one reason to still call it an objective is because we really can think of the selection process, i.e. evolution/the learning algorithm of an RL agent, as having an objective but making imperfect choices, or we can think of the training objective as encoding a task that humans have in mind).
Very late reply, sorry.
“even though reward is not a kind of objective”, this is a terminological issue. In my view, calling a “antecedent-computation reinforcement criterion” an “objective” matches my definition of “objective”, and this is just a matter of terminology. The term “objective” is ill-defined enough that “even though reward is not a kind of objective” is a terminological claim about objective, not a claim about math/the world.
The idea that RL agents “reinforce antecedent computations” is completely core to our story of deception. You could not make sense of our argument for deception if you didn’t look at RL systems in this way. Viewing the base optimizer as “trying” to achieve an “objective” but “failing” because it is being “deceived” by the mesa optimizer is purely a metaphorical/terminological choice. It doesn’t negate the fact that we all understood that the base optimizer is just reinforcing “antecedent computations”. How else could you make sense of the story of deception, where an existing model, which represents the mesa optimizer, is being reinforced by the base optimizer because that existing model understands the base optimizer’s optimization process?
I am not claiming that the RFLO communicated this point well, just that it was understood and absolutely was core to the paper, and large parts of the paper wouldn’t even make sense if you didn’t have this insight. (Certainly the fact that we called it an objective doesn’t communicate the point, and it isn’t meant to).
An AI defense-offense symmetry thesis
“But yeah, I wish this hadn’t happened.”
Who else is gonna write the article? My sense is that no one (including me) is starkly stating publically the seriousness of the situation.“Yudkowsky is obnoxious, arrogant, and most importantly, disliked, so the more he intertwines himself with the idea of AI x-risk in the public imagination, the less likely it is that the public will take those ideas seriously”
I’m worried about people making character attacks on Yudkowsky (or other alignment researchers) like this. I think the people who think they can probably solve alignment by just going full-speed ahead and winging it, they are arrogant. Yudkowsky’s arrogant-sounding comments about how we need to be very careful and slow, are negligible in comparison. I’m guessing you agree with this (not sure) and we should be able to criticise him for his communication style, but I am a little worried about people publically undermining Yudkowsky’s reputation in that context. This seems like not what we would do if we were trying to coordinate well.
“But I don’t think you even need Eliezer-levels-of-P(doom) to think the situation warrants that sort of treatment.”
Agreed. If a new state develops nuclear weapons, this isn’t even close to creating a 10% x-risk, yet the idea of airstrikes on nuclear enrichment facillities, even though it is very controversial, has for a long time very much been an option on the table.
You are muddling the meaning of “pre-emptive war”, or even “war”. I’m not trying to diminish the gravity of Yudkowsky’s proposal, but a missile strike on a specific compound known to contain WMD-developing technology is not a “pre-emptive war” or “war”. Again I’m not trying to diminish the gravity, but this seems like an incorrect use of the term.
I agree. Though is it just the limited context window that causes the effect? I may be mistaken, but from my memory it seems like they emerge sooner than you would expect if this was the only reason (given the size of the context window of gpt3).
It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You’ve certainly phrased things differently and made some specific points that we didn’t, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?
(Note I am still surprised sometimes that people still think certain wireheading scenario’s make sense despite them having read RFLO, so it’s plausible to me that we really didn’t communicate everyrhing that’s in my head about this).