Not necessarily because you might also commit to stopping it in a non-escalatory way. For instance you could work to make economically viable lab-grown meat to replace animal products.
Hence the other key ingredient in Zizianism is commitment to escalating all the way, which allows things to blow up dramatically like this. (And escalating all the way has the potential to go wrong in most conflicts, not just veganism (though veganism seems like the big one here), e.g. I doubt the landlord conflict was about veganism.)
As an analogy, if you were dealing with the Holocaust, you could try to directly destroy all Nazis, or you could try to mitigate against the Holocaust in less escalatory ways (e.g. trying to have Jews emigrate from Nazi territories, which I imagine could be done either with the cooperation of Jews as in the Danish case, or with the cooperation of Nazis as in the Madagascar plan).
I don’t like either of their positions because they focus too much on large language models. I dislike Simplicia’s the least, but I think that’s because Doomimir is the one leading the conversation and therefore is the one who chooses to lead it to a sketchy place.
Large language models gain their capabilities from self-supervised learning on humans performing activities, or from reinforcement learning from human feedback about how to achieve things, or from internalizing its human-approved knowledge into its motivation. In all of these cases, you rely on humans figuring out how to do stuff, in order to make the AI able to do stuff, so it is of course logical that this would tightly integrated capabilities and alignment in the way Simplicia says.
But, I can’t help but think this is limited in capabilities compared to the potential that could be achieved if the AI learned to build capabilities independently of human feedback? Like we have tons of plausible-seeming ways to achieve this, and there are tons of people working on that, so I feel like people are going to make progress on that and LLMs are going to become a secondary thing.
Not to understate the significance of LLMs, they are of course highly economically significant (they could transform the economy for decades without any other AI revolutions—we have some projects at my job about integrating them into our product) and I use them daily, but I don’t expect that we will keep thinking of them as the forefront of AI. I just see them as a repository for human knowledge and skills, and I think considering their training method and feasible applications, I am justified in seeing them that way.
Large language models have gained a lot of factual knowledge and a usable amount of common-sense. However they have not gained this factual knowledge and usable common-sense because we have learned how to make AIs derive this, but instead because we have created powerful curve-fitters and rich datasets and stuff.
The abilities demonstrated in the datasets are “pre-approved” by the humans who applied their own optimization to create the data (like if you write a blog post instructing how to do some thing, then you approve of doing that thing), so it limits the problems you’re going to see.
That said it’s not like there are zero worries to have here. For instance there are people who are working on alternate or additional methods to make AI more autonomous, and the success of LLMs drives tons of investment to these areas, and also while LLMs don’t have existential risks, they do cause a lot of mundane problems (hopefully outweighted by the benefits they cause?? but I’m guessing here and haven’t seen any serious pro/con analysis).
But I don’t know that the situation is so massively different post-GPT than pre-GPT as some people make it sound.
For LLMs gaining their capabilities from human information, yes. Not sure why we should expect all future AI to look like that.
LLMs aren’t really trained to be a genie though. The agency arguments (coherence etc.) don’t really go through, because they assume that there is something out there in the world that it optimizes. It exhibits some instrumentally convergent information because humans put instrumentally convergent information into our texts to help each other, but that’s a really different scenario from classical xrisk arguments.
Yep. But also, this ties into why I don’t think LLMs will be the final word in AI.
For instance, you might think you can build advanced AI-powered systems by wiring a bunch of prompts together. And to an extent it’s true you can do that, but you have to be very careful due to a phenomenon I call “transposons”.
Basically, certain kinds of text tends to replicate itself in LLMs, not necessarily exactly but instead sometimes thematically or whatever. So if you’ve set up your LLM prompt Rube Goldberg machine, sometimes it’s going to generate such a transposon, and then it starts taking over your system, which often interferes with the AI’s function.
There are some things you can do to keep it down, but it can sort of be a whole job to handle. I think the reason it occurs is because the AI isn’t trying to achieve things, it’s just modelling text. Like if a human has some idea they might try implementing it in a bunch of places, but then if it goes wrong they would notice it goes wrong and stop using it so much. So humans can basically cut down on transposons in ways that AIs struggle with. (… Egregores also struggle with cutting down on transposons, but that’s a different topic.)
I think future AI techniques that are more outwards focused and consequentialist will have an easier time cutting down on transposons, because it seems useful to design AIs that don’t require so much handholding.
I find that I often think more clearly about instrumental convergence, coherence, etc. by rephrasing implications as disjunctions.
For instance, instead of “if an AI robustly achieves goals, then it will resist being shut down”, one can say “either an AI resists being shut down, or it doesn’t robustly achieve goals”.
One can then go further and say “wait why wouldn’t an AI robustly achieve goals just because it doesn’t resist being shut down?” and notice that one answer is “well, if it exists in an environment where people might shut it down, then it might get shut down and not achieve the goal”. But then the logical answer is “actually I think I like this disjunct, let’s not have the AI robustly achieve goals”.
This works fine for LLMs who get their instrumental convergence via specific bits if information pre-approved by humans. It’s less clear for post-LLM AIs. For instance if you create a transposon-remover, then it might notice that the idea “You must allow your creators to shut you down” acts a lot like a transposon, and remove it. Though in practice this would be kind of obvious so maybe dodgeable, but I think once you leave the “all instrumental convergence is pre-approved by humans” regime and enter the “AI makes tons of instrumental convergence by itself” regime, you get a lot more nearest-unblocked strategy issues and so on. (Whereas nearest-unblocked strategy isn’t so much of an issue when your strategies have to be pre-approved by humans and you are therefore quite limited in the number of strategies you carry.)
But also… what about cases where “well, if it exists in an environment where people might shut it down, then it might get shut down and not achieve the goal” is not an acceptable disjunct? For instance, maybe you are the US army and you are creating robots to fight wars, or making tools to identify the enemy’s AI weapons and destroy them. An obvious answer is “wtf?! we shouldn’t just give the AI the goal of being destructive and indestructible!?”, but like, you have to ensure global security somehow. Maybe you can get China, Russia, and the US to agree on a treaty to ban AI capabilities research, but you also need to enforce that treaty somehow, so there’s some real military questions here.
Key question is, how do you think it gets smarter? If you plan to do more of the usual—and lots of people do plan that—then I agree. But if you introduce new methods then maybe the pattern won’t hold.
Do note that humans in the EEA did not have factory farming, did not destroy terrains to extract resources, did not have nuclear bombs, etc..
If you ask the superintelligence to learn the value of the plans from humans, then you’d have the alignment/capabilities connection discussed earlier because it relies on humans to extrapolate the consequences of the plans, and therefore hacking the human labeler would allow it to use plans that don’t work. (This is also why current RLHF is safe.) I take TurnTrout’s proof about u-AOH to be a formalization of this point, though IIRC he doesn’t actually like his instrumental convergence series, so he might disagree about this being relevant.
On the other hand, if you ask the superintelligence to learn the value of the outcomes from humans, then a lot of classical points apply.
Maybe. This could be an idea to develop further. I’m skeptical but it does seem interesting.
Idk...
So clearly this is still relying on the fallacy that coherence applies in the standard way. But at the same time if you try to have the AI take over the world, then it’s going to need to develop a certain sort of robustness which means that even if coherence didn’t originally apply, it has to now.
But the way this cashes out is something like “once you’ve decided that you’ve figured out how to align AI, you need to disable your alien actress army, but at the same time your adversaries that you are trying to surveil must be unable to disable it”. Which, given how abstract the alien actress army idea is, I find hard to think through the feasibility of.