Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes
- Critiques of prominent AI safety labs: Conjecture by 12 Jun 2023 5:52 UTC; 150 points) (EA Forum;
- Conjecture: A standing offer for public debates on AI by 16 Jun 2023 14:33 UTC; 29 points) (
- Critiques of prominent AI safety labs: Conjecture by 12 Jun 2023 1:32 UTC; 12 points) (
- Conjecture: A standing offer for public debates on AI by 16 Jun 2023 14:33 UTC; 8 points) (EA Forum;
Quick admin note: by default, lines that are bold create a Table of Contents heading (which resulted in the ToC having a whole bunch of spurious [Christiano] and [GA] lines). There’s a cute hack to get around this, by inserting a space in italics at the end of the bolded line. I just used my admin powers to add the space-with-italics to all the “[Christiano]”, “[GA]”, etc, so that the ToC is more readable.
Just to check my sanity, this used to be two posts, and now has been combined into one?
Yes, some people mentioned it was confusing to have two posts (I had originally posted two separate ones for Summary and Transcript due to them being very lengthy) so I merged them in one, and added headers pointing to Summary and Transcript for easier navigation.
Thanks, I was looking for a way to do that but didn’t know the space in italics hack!
Another formatting question: how do I make headers and sections collapsible? It would be great to have the “Summary” and “Transcript” sections as collapsible, considering how long the post is.
Thanks for sharing the debate and including a good summary.
A highly compressed version of what the disagreements are about in my ontology of disagreements about AI safety...
crux about continuity; here GA mostly has the intuition “things will be discontinuous” and this manifests in many guesses (phase shifts, new ways of representing data, possibility to demonstrate overpowering the overseer, …); Paul assumes things will be mostly continuous, with a few exceptions which may be dangerous
this seems similar to typical cruxes between Paul and e.g. Eliezer (also in my view this is actually decent chunk of disagreements: my model of Eliezer predicts Eliezer would actually update toward more optimistic views if he believed “we will have more tries to solve the actual problems, and they will show in a lab setting”)
possible crux about x-risk from the broader system (e.g. AI powered cultural evolution); here it’s unclear who is exactly where in this debate
I don’t think there is any neat public debate on this, but I usually disagree with Eliezer’s and similar “orthodox” views about the relative difficulty & expected neglectedness (I expect narrow single ML system “alignment” to be difficult but solvable and likely solved by default, because incentives to do so; whole-world-alignment / multi-multi to be difficult and with bad results by default)
(there are also many points of agreement)
I was reading (listening) to this and I think I’ve got some good reasons to expect failed AI coups to happen.
In general we probably expect “Value is Fragile” and this will probably apply to AI goals too (and it will think this) this will mean a Consequentialist AI will expect that if there is a high chance of another AI taking over soon then all value in the universe (according to it’s definition of value) then even though there is a low probability of a particular coup working it will still want to try it because if it doesn’t succeed then almost all the value will be destroyed. So for example this would mean if there are 4 similarly situated AI labs then an AI at one of them will reason they only have a 25% chance of getting control of all value in the universe so as soon as it can come up with a coup attempt that it believes has a greater than around a 25% chance it will probably want to go for it (maybe this is more complex but I think the qualitative point stands)
Secondly because “Value is Fragile” not only will AI’s be worried about other labs AI’s they will probably also be pretty worried about the next iteration of themselves after an SGD update, obviously there will be some correlation in beliefs about what is valuable between a similarly weighted Neural Network, but I don’t think there’s much reason to believe that NN weights will have been optimised to make this consistent.
So I think in conclusion to the extent the doom scenario is a runaway consequentialist AI I think unless ease of coup attempts succeeding jumps massively from around 0% to around 100% for some reason, there will be good reasons to expect that we will see failed coup attempts first.