tailcalled 21 Oct 2023 18:09 UTC
52 points
26
on: Alignment Implications of LLM Successes: a Debate in One Act
I don’t like either of their positions because they focus too much on large language models. I dislike Simplicia’s the least, but I think that’s because Doomimir is the one leading the conversation and therefore is the one who chooses to lead it to a sketchy place.
Large language models gain their capabilities from self-supervised learning on humans performing activities, or from reinforcement learning from human feedback about how to achieve things, or from internalizing its human-approved knowledge into its motivation. In all of these cases, you rely on humans figuring out how to do stuff, in order to make the AI able to do stuff, so it is of course logical that this would tightly integrated capabilities and alignment in the way Simplicia says.
But, I can’t help but think this is limited in capabilities compared to the potential that could be achieved if the AI learned to build capabilities independently of human feedback? Like we have tons of plausible-seeming ways to achieve this, and there are tons of people working on that, so I feel like people are going to make progress on that and LLMs are going to become a secondary thing.
Not to understate the significance of LLMs, they are of course highly economically significant (they could transform the economy for decades without any other AI revolutions—we have some projects at my job about integrating them into our product) and I use them daily, but I don’t expect that we will keep thinking of them as the forefront of AI. I just see them as a repository for human knowledge and skills, and I think considering their training method and feasible applications, I am justified in seeing them that way.
Doomimir: Humanity has made no progress on the alignment problem. Not only do we have no clue how to align a powerful optimizer to our “true” values, we don’t even know how to make AI “corrigible”—willing to let us correct it. Meanwhile, capabilities continue to advance by leaps and bounds. All is lost.
Large language models have gained a lot of factual knowledge and a usable amount of common-sense. However they have not gained this factual knowledge and usable common-sense because we have learned how to make AIs derive this, but instead because we have created powerful curve-fitters and rich datasets and stuff.
The abilities demonstrated in the datasets are “pre-approved” by the humans who applied their own optimization to create the data (like if you write a blog post instructing how to do some thing, then you approve of doing that thing), so it limits the problems you’re going to see.
That said it’s not like there are zero worries to have here. For instance there are people who are working on alternate or additional methods to make AI more autonomous, and the success of LLMs drives tons of investment to these areas, and also while LLMs don’t have existential risks, they do cause a lot of mundane problems (hopefully outweighted by the benefits they cause?? but I’m guessing here and haven’t seen any serious pro/con analysis).
But I don’t know that the situation is so massively different post-GPT than pre-GPT as some people make it sound.
Simplicia: Why, Doomimir Doomovitch, you’re such a sourpuss! It should be clear by now that advances in “alignment”—getting machines to behave in accordance with human values and intent—aren’t cleanly separable from the “capabilities” advances you decry. Indeed, here’s an example of GPT-4 being corrigible to me just now in the OpenAI Playground:
For LLMs gaining their capabilities from human information, yes. Not sure why we should expect all future AI to look like that.
Doomimir: The alignment problem was never about superintelligence failing to understand human values. The genie knows, but doesn’t care. The fact that a large language model trained to predict natural language text can generate that dialogue, has no bearing on the AI’s actual motivations, even if the dialogue is written in the first person and notionally “about” a corrigible AI assistant character. It’s just roleplay. Change the system prompt, and the LLM could output tokens “claiming” to be a cat—or a rock—just as easily, and for the same reasons.
LLMs aren’t really trained to be a genie though. The agency arguments (coherence etc.) don’t really go through, because they assume that there is something out there in the world that it optimizes. It exhibits some instrumentally convergent information because humans put instrumentally convergent information into our texts to help each other, but that’s a really different scenario from classical xrisk arguments.
Simplicia: My point was that the repetition trap is a case of “capabilities” failing to generalize along with “alignment”. The repetition behavior isn’t competently optimizing a malign goal; it’s just degenerate. A for loop could give you the same output.
Yep. But also, this ties into why I don’t think LLMs will be the final word in AI.
For instance, you might think you can build advanced AI-powered systems by wiring a bunch of prompts together. And to an extent it’s true you can do that, but you have to be very careful due to a phenomenon I call “transposons”.
Basically, certain kinds of text tends to replicate itself in LLMs, not necessarily exactly but instead sometimes thematically or whatever. So if you’ve set up your LLM prompt Rube Goldberg machine, sometimes it’s going to generate such a transposon, and then it starts taking over your system, which often interferes with the AI’s function.
There are some things you can do to keep it down, but it can sort of be a whole job to handle. I think the reason it occurs is because the AI isn’t trying to achieve things, it’s just modelling text. Like if a human has some idea they might try implementing it in a bunch of places, but then if it goes wrong they would notice it goes wrong and stop using it so much. So humans can basically cut down on transposons in ways that AIs struggle with. (… Egregores also struggle with cutting down on transposons, but that’s a different topic.)
I think future AI techniques that are more outwards focused and consequentialist will have an easier time cutting down on transposons, because it seems useful to design AIs that don’t require so much handholding.
Doomimir: GPT-4 isn’t a superintelligence, Simplicia. [rehearsedly, with a touch of annoyance, as if resenting how often he has to say this] Coherent agents have a convergent instrumental incentive to prevent themselves from being shut down, because being shut down predictably leads to world-states with lower values in their utility function. Moreover, this isn’t just a fact about some weird agent with an “instrumental convergence” fetish. It’s a fact about reality: there are truths of the matter about which “plans”—sequences of interventions on a causal model of the universe, to put it in a Cartesian way—lead to what outcomes. An “intelligent agent” is just a physical system that computes plans. People have tried to think of clever hacks to get around this, and none of them work.
I find that I often think more clearly about instrumental convergence, coherence, etc. by rephrasing implications as disjunctions.
For instance, instead of “if an AI robustly achieves goals, then it will resist being shut down”, one can say “either an AI resists being shut down, or it doesn’t robustly achieve goals”.
One can then go further and say “wait why wouldn’t an AI robustly achieve goals just because it doesn’t resist being shut down?” and notice that one answer is “well, if it exists in an environment where people might shut it down, then it might get shut down and not achieve the goal”. But then the logical answer is “actually I think I like this disjunct, let’s not have the AI robustly achieve goals”.
This works fine for LLMs who get their instrumental convergence via specific bits if information pre-approved by humans. It’s less clear for post-LLM AIs. For instance if you create a transposon-remover, then it might notice that the idea “You must allow your creators to shut you down” acts a lot like a transposon, and remove it. Though in practice this would be kind of obvious so maybe dodgeable, but I think once you leave the “all instrumental convergence is pre-approved by humans” regime and enter the “AI makes tons of instrumental convergence by itself” regime, you get a lot more nearest-unblocked strategy issues and so on. (Whereas nearest-unblocked strategy isn’t so much of an issue when your strategies have to be pre-approved by humans and you are therefore quite limited in the number of strategies you carry.)
But also… what about cases where “well, if it exists in an environment where people might shut it down, then it might get shut down and not achieve the goal” is not an acceptable disjunct? For instance, maybe you are the US army and you are creating robots to fight wars, or making tools to identify the enemy’s AI weapons and destroy them. An obvious answer is “wtf?! we shouldn’t just give the AI the goal of being destructive and indestructible!?”, but like, you have to ensure global security somehow. Maybe you can get China, Russia, and the US to agree on a treaty to ban AI capabilities research, but you also need to enforce that treaty somehow, so there’s some real military questions here.
Simplicia: I thought of that, too. I’ve spent a lot of time with the model and done some other experiments, and it looks like it understands natural language means-ends reasoning about goals: tell it to be an obsessive pizza chef and ask if it minds if you turn off the oven for a week, and it says it minds. But it also doesn’t look like Omohundro’s monster: when I command it to obey, it obeys. And it looks like there’s room for it to get much, much smarter without that breaking down.
Key question is, how do you think it gets smarter? If you plan to do more of the usual—and lots of people do plan that—then I agree. But if you introduce new methods then maybe the pattern won’t hold.
I meant the intentional stance implied in “went for evolution”. True, the generalization from inclusive genetic fitness to human behavior looks terrible—no visible relation, as you say. But the generalization from human behavior in the EEA, to human behavior in civilization … looks a lot better? Humans in the EEA ate food, had sex, made friends, told stories—and we do all those things, too. As AI designers—
Do note that humans in the EEA did not have factory farming, did not destroy terrains to extract resources, did not have nuclear bombs, etc..
That relates to another objection I have. Even if you could make ML systems that imitate human reasoning, that doesn’t help you align more powerful systems that work in other ways. The reason—one of the reasons—that you can’t train a superintelligence by using humans to label good plans, is because at some power level, your planner figures out how to hack the human labeler.
If you ask the superintelligence to learn the value of the plans from humans, then you’d have the alignment/capabilities connection discussed earlier because it relies on humans to extrapolate the consequences of the plans, and therefore hacking the human labeler would allow it to use plans that don’t work. (This is also why current RLHF is safe.) I take TurnTrout’s proof about u-AOH to be a formalization of this point, though IIRC he doesn’t actually like his instrumental convergence series, so he might disagree about this being relevant.
On the other hand, if you ask the superintelligence to learn the value of the outcomes from humans, then a lot of classical points apply.
Simplicia: Do you need more powerful systems? If you can get an army of cheap IQ 140 alien actresses who stay in character, that sounds like a game-changer. If you have to take over the world and institute a global surveillance regime to prevent the emergence of unfriendlier, more powerful forms of AI, they could help you do it.
Maybe. This could be an idea to develop further. I’m skeptical but it does seem interesting.
Doomimir: I fundamentally disbelieve in this wildly implausible scenario, but granting it for the sake of argument … I think you’re failing to appreciate that in this story, you’ve already handed off the keys to the universe. Your AI’s weird-alien-goal-misgeneralization-of-obedience might look like obedience when weak, but if it has the ability to predict the outcomes of its actions, it would be in a position to choose among those outcomes—and in so choosing, it would be in control. The fate of the galaxies would be determined by its will, even if the initial stages of its ascension took place via innocent-looking actions that stayed within the edges of its concepts of “obeying orders” and “asking clarifying questions”. Look, you understand that AIs trained on human data are not human, right?
Idk...
So clearly this is still relying on the fallacy that coherence applies in the standard way. But at the same time if you try to have the AI take over the world, then it’s going to need to develop a certain sort of robustness which means that even if coherence didn’t originally apply, it has to now.
But the way this cashes out is something like “once you’ve decided that you’ve figured out how to align AI, you need to disable your alien actress army, but at the same time your adversaries that you are trying to surveil must be unable to disable it”. Which, given how abstract the alien actress army idea is, I find hard to think through the feasibility of.

[Question] What sorts of preparations ought I do in case of further escalation in Ukraine?

tailcalled1 Oct 2022 16:44 UTC

47 points

7 comments1 min readLW link

🤔 Coordination explosion before intelligence explosion...?

tailcalled5 Mar 2023 20:48 UTC

47 points

8 comments2 min readLW link

tailcalled 25 Feb 2023 13:26 UTC
47 points
28
in reply to: Ratios’s comment on: [Link] A community alert about Ziz
Not necessarily because you might also commit to stopping it in a non-escalatory way. For instance you could work to make economically viable lab-grown meat to replace animal products.
Hence the other key ingredient in Zizianism is commitment to escalating all the way, which allows things to blow up dramatically like this. (And escalating all the way has the potential to go wrong in most conflicts, not just veganism (though veganism seems like the big one here), e.g. I doubt the landlord conflict was about veganism.)
As an analogy, if you were dealing with the Holocaust, you could try to directly destroy all Nazis, or you could try to mitigate against the Holocaust in less escalatory ways (e.g. trying to have Jews emigrate from Nazi territories, which I imagine could be done either with the cooperation of Jews as in the Danish case, or with the cooperation of Nazis as in the Madagascar plan).

tailcalled 27 Nov 2021 9:12 UTC
47 points
on: Almost everyone should be less afraid of lawsuits

But the total value of medical malpractice claims is around $5 billion per year in the US

You shouldn’t compare things to the current cost of medical malpractice claims, you should compare it to the counterfactual cost where people did less to avoid lawsuits.

tailcalled 17 Apr 2023 20:44 UTC
46 points
40
in reply to: jacob_cannell’s comment on: grey goo is unlikely
Another merit of the OP might be in pointing out bullshit by Eliezer Yudkowsky/Eric Drexler?
It’s kind of unfortunate if key early figures in the rationalist community introduce some bullshit to the memespace and we never get around to purging it and end up tanking our reputation by regularly appealing to it. Having this sort of post around helps get rid of it.

tailcalled 16 Apr 2023 21:50 UTC
46 points
36
on: Top lesson from GPT: we will probably destroy humanity “for the lulz” as soon as we are able.
Counterargument: People will be able to cause significant destruction far before they are able to cause the end of the world, and if people start using powerful AI to do significant destruction for the lulz then that will motivate a lot of lockdown on AI access.

I think Michael Bailey’s dismissal of my autogynephilia questions for Scott Alexander and Aella makes very little sense

tailcalled10 Jul 2023 17:39 UTC

45 points

45 comments2 min readLW link

Autogynephilia discourse is so absurdly bad on all sides

tailcalled23 Jul 2023 13:12 UTC

43 points

24 comments2 min readLW link