tailcalled comments on Alignment Implications of LLM Successes: a Debate in One Act

tailcalled 21 Oct 2023 18:09 UTC
52 points
26
I don’t like either of their positions because they focus too much on large language models. I dislike Simplicia’s the least, but I think that’s because Doomimir is the one leading the conversation and therefore is the one who chooses to lead it to a sketchy place.
Large language models gain their capabilities from self-supervised learning on humans performing activities, or from reinforcement learning from human feedback about how to achieve things, or from internalizing its human-approved knowledge into its motivation. In all of these cases, you rely on humans figuring out how to do stuff, in order to make the AI able to do stuff, so it is of course logical that this would tightly integrated capabilities and alignment in the way Simplicia says.
But, I can’t help but think this is limited in capabilities compared to the potential that could be achieved if the AI learned to build capabilities independently of human feedback? Like we have tons of plausible-seeming ways to achieve this, and there are tons of people working on that, so I feel like people are going to make progress on that and LLMs are going to become a secondary thing.
Not to understate the significance of LLMs, they are of course highly economically significant (they could transform the economy for decades without any other AI revolutions—we have some projects at my job about integrating them into our product) and I use them daily, but I don’t expect that we will keep thinking of them as the forefront of AI. I just see them as a repository for human knowledge and skills, and I think considering their training method and feasible applications, I am justified in seeing them that way.
Doomimir: Humanity has made no progress on the alignment problem. Not only do we have no clue how to align a powerful optimizer to our “true” values, we don’t even know how to make AI “corrigible”—willing to let us correct it. Meanwhile, capabilities continue to advance by leaps and bounds. All is lost.
Large language models have gained a lot of factual knowledge and a usable amount of common-sense. However they have not gained this factual knowledge and usable common-sense because we have learned how to make AIs derive this, but instead because we have created powerful curve-fitters and rich datasets and stuff.
The abilities demonstrated in the datasets are “pre-approved” by the humans who applied their own optimization to create the data (like if you write a blog post instructing how to do some thing, then you approve of doing that thing), so it limits the problems you’re going to see.
That said it’s not like there are zero worries to have here. For instance there are people who are working on alternate or additional methods to make AI more autonomous, and the success of LLMs drives tons of investment to these areas, and also while LLMs don’t have existential risks, they do cause a lot of mundane problems (hopefully outweighted by the benefits they cause?? but I’m guessing here and haven’t seen any serious pro/con analysis).
But I don’t know that the situation is so massively different post-GPT than pre-GPT as some people make it sound.
Simplicia: Why, Doomimir Doomovitch, you’re such a sourpuss! It should be clear by now that advances in “alignment”—getting machines to behave in accordance with human values and intent—aren’t cleanly separable from the “capabilities” advances you decry. Indeed, here’s an example of GPT-4 being corrigible to me just now in the OpenAI Playground:
For LLMs gaining their capabilities from human information, yes. Not sure why we should expect all future AI to look like that.
Doomimir: The alignment problem was never about superintelligence failing to understand human values. The genie knows, but doesn’t care. The fact that a large language model trained to predict natural language text can generate that dialogue, has no bearing on the AI’s actual motivations, even if the dialogue is written in the first person and notionally “about” a corrigible AI assistant character. It’s just roleplay. Change the system prompt, and the LLM could output tokens “claiming” to be a cat—or a rock—just as easily, and for the same reasons.
LLMs aren’t really trained to be a genie though. The agency arguments (coherence etc.) don’t really go through, because they assume that there is something out there in the world that it optimizes. It exhibits some instrumentally convergent information because humans put instrumentally convergent information into our texts to help each other, but that’s a really different scenario from classical xrisk arguments.
Simplicia: My point was that the repetition trap is a case of “capabilities” failing to generalize along with “alignment”. The repetition behavior isn’t competently optimizing a malign goal; it’s just degenerate. A for loop could give you the same output.
Yep. But also, this ties into why I don’t think LLMs will be the final word in AI.
For instance, you might think you can build advanced AI-powered systems by wiring a bunch of prompts together. And to an extent it’s true you can do that, but you have to be very careful due to a phenomenon I call “transposons”.
Basically, certain kinds of text tends to replicate itself in LLMs, not necessarily exactly but instead sometimes thematically or whatever. So if you’ve set up your LLM prompt Rube Goldberg machine, sometimes it’s going to generate such a transposon, and then it starts taking over your system, which often interferes with the AI’s function.
There are some things you can do to keep it down, but it can sort of be a whole job to handle. I think the reason it occurs is because the AI isn’t trying to achieve things, it’s just modelling text. Like if a human has some idea they might try implementing it in a bunch of places, but then if it goes wrong they would notice it goes wrong and stop using it so much. So humans can basically cut down on transposons in ways that AIs struggle with. (… Egregores also struggle with cutting down on transposons, but that’s a different topic.)
I think future AI techniques that are more outwards focused and consequentialist will have an easier time cutting down on transposons, because it seems useful to design AIs that don’t require so much handholding.
Doomimir: GPT-4 isn’t a superintelligence, Simplicia. [rehearsedly, with a touch of annoyance, as if resenting how often he has to say this] Coherent agents have a convergent instrumental incentive to prevent themselves from being shut down, because being shut down predictably leads to world-states with lower values in their utility function. Moreover, this isn’t just a fact about some weird agent with an “instrumental convergence” fetish. It’s a fact about reality: there are truths of the matter about which “plans”—sequences of interventions on a causal model of the universe, to put it in a Cartesian way—lead to what outcomes. An “intelligent agent” is just a physical system that computes plans. People have tried to think of clever hacks to get around this, and none of them work.
I find that I often think more clearly about instrumental convergence, coherence, etc. by rephrasing implications as disjunctions.
For instance, instead of “if an AI robustly achieves goals, then it will resist being shut down”, one can say “either an AI resists being shut down, or it doesn’t robustly achieve goals”.
One can then go further and say “wait why wouldn’t an AI robustly achieve goals just because it doesn’t resist being shut down?” and notice that one answer is “well, if it exists in an environment where people might shut it down, then it might get shut down and not achieve the goal”. But then the logical answer is “actually I think I like this disjunct, let’s not have the AI robustly achieve goals”.
This works fine for LLMs who get their instrumental convergence via specific bits if information pre-approved by humans. It’s less clear for post-LLM AIs. For instance if you create a transposon-remover, then it might notice that the idea “You must allow your creators to shut you down” acts a lot like a transposon, and remove it. Though in practice this would be kind of obvious so maybe dodgeable, but I think once you leave the “all instrumental convergence is pre-approved by humans” regime and enter the “AI makes tons of instrumental convergence by itself” regime, you get a lot more nearest-unblocked strategy issues and so on. (Whereas nearest-unblocked strategy isn’t so much of an issue when your strategies have to be pre-approved by humans and you are therefore quite limited in the number of strategies you carry.)
But also… what about cases where “well, if it exists in an environment where people might shut it down, then it might get shut down and not achieve the goal” is not an acceptable disjunct? For instance, maybe you are the US army and you are creating robots to fight wars, or making tools to identify the enemy’s AI weapons and destroy them. An obvious answer is “wtf?! we shouldn’t just give the AI the goal of being destructive and indestructible!?”, but like, you have to ensure global security somehow. Maybe you can get China, Russia, and the US to agree on a treaty to ban AI capabilities research, but you also need to enforce that treaty somehow, so there’s some real military questions here.
Simplicia: I thought of that, too. I’ve spent a lot of time with the model and done some other experiments, and it looks like it understands natural language means-ends reasoning about goals: tell it to be an obsessive pizza chef and ask if it minds if you turn off the oven for a week, and it says it minds. But it also doesn’t look like Omohundro’s monster: when I command it to obey, it obeys. And it looks like there’s room for it to get much, much smarter without that breaking down.
Key question is, how do you think it gets smarter? If you plan to do more of the usual—and lots of people do plan that—then I agree. But if you introduce new methods then maybe the pattern won’t hold.
I meant the intentional stance implied in “went for evolution”. True, the generalization from inclusive genetic fitness to human behavior looks terrible—no visible relation, as you say. But the generalization from human behavior in the EEA, to human behavior in civilization … looks a lot better? Humans in the EEA ate food, had sex, made friends, told stories—and we do all those things, too. As AI designers—
Do note that humans in the EEA did not have factory farming, did not destroy terrains to extract resources, did not have nuclear bombs, etc..
That relates to another objection I have. Even if you could make ML systems that imitate human reasoning, that doesn’t help you align more powerful systems that work in other ways. The reason—one of the reasons—that you can’t train a superintelligence by using humans to label good plans, is because at some power level, your planner figures out how to hack the human labeler.
If you ask the superintelligence to learn the value of the plans from humans, then you’d have the alignment/capabilities connection discussed earlier because it relies on humans to extrapolate the consequences of the plans, and therefore hacking the human labeler would allow it to use plans that don’t work. (This is also why current RLHF is safe.) I take TurnTrout’s proof about u-AOH to be a formalization of this point, though IIRC he doesn’t actually like his instrumental convergence series, so he might disagree about this being relevant.
On the other hand, if you ask the superintelligence to learn the value of the outcomes from humans, then a lot of classical points apply.
Simplicia: Do you need more powerful systems? If you can get an army of cheap IQ 140 alien actresses who stay in character, that sounds like a game-changer. If you have to take over the world and institute a global surveillance regime to prevent the emergence of unfriendlier, more powerful forms of AI, they could help you do it.
Maybe. This could be an idea to develop further. I’m skeptical but it does seem interesting.
Doomimir: I fundamentally disbelieve in this wildly implausible scenario, but granting it for the sake of argument … I think you’re failing to appreciate that in this story, you’ve already handed off the keys to the universe. Your AI’s weird-alien-goal-misgeneralization-of-obedience might look like obedience when weak, but if it has the ability to predict the outcomes of its actions, it would be in a position to choose among those outcomes—and in so choosing, it would be in control. The fate of the galaxies would be determined by its will, even if the initial stages of its ascension took place via innocent-looking actions that stayed within the edges of its concepts of “obeying orders” and “asking clarifying questions”. Look, you understand that AIs trained on human data are not human, right?
Idk...
So clearly this is still relying on the fallacy that coherence applies in the standard way. But at the same time if you try to have the AI take over the world, then it’s going to need to develop a certain sort of robustness which means that even if coherence didn’t originally apply, it has to now.
But the way this cashes out is something like “once you’ve decided that you’ve figured out how to align AI, you need to disable your alien actress army, but at the same time your adversaries that you are trying to surveil must be unable to disable it”. Which, given how abstract the alien actress army idea is, I find hard to think through the feasibility of.
- ReaderM 26 Oct 2023 16:49 UTC
  2 points
  1
  Parent
  Large language models gain their capabilities from self-supervised learning on humans performing activities, or from reinforcement learning from human feedback about how to achieve things, or from internalizing its human-approved knowledge into its motivation. In all of these cases, you rely on humans figuring out how to do stuff, in order to make the AI able to do stuff, so it is of course logical that this would tightly integrated capabilities and alignment in the way Simplicia says.
  No. Language Models aren’t relying on humans figuring anything out. How could they ? They only see results not processes.
  You can train a Language Model on protein sequences. Just the sequences alone, nothing else and see it represent biological structure and function in the inner layers. No one taught them this. It was learnt from the data.
  https://www.pnas.org/doi/full/10.1073/pnas.2016239118
  The point here is that Language Models see results and try to predict the computation that led to those results. This is not imitation. It’s a crucial difference because it means you aren’t bound by the knowledge of the people supplying this data.
  You can take this protein language model. You can train on described function and sequences and you can have a language model that can take supplied use cases and generate novel functional protein sequences to match.
  https://www.nature.com/articles/s41587-022-01618-2
  Have humans figured this out ? Can we go function to protein just like that ? No way! Not even close
  - tailcalled 26 Oct 2023 17:02 UTC
    2 points
    0
    Parent
    No. Language Models aren’t relying on humans figuring anything out. How could they ? They only see results not processes.
    They find functions that fit the results. Most such functions are simple and therefore generalize well. But that doesn’t mean they generalize arbitrarily well.
    You can train a Language Model on protein sequences. Just the sequences alone, nothing else and see it represent biological structure and function in the inner layers. No one taught them this. It was learnt from the data.
    https://www.pnas.org/doi/full/10.1073/pnas.2016239118
    Not really any different from the human language LLM, it’s just trained on stuff evolution has figured out rather than stuff humans have figured out. This wouldn’t work if you used random protein sequences instead of evolved ones.
    The point here is that Language Models see results and try to predict the computation that led to those results. This is not imitation. It’s a crucial difference because it means you aren’t bound by the knowledge of the people supplying this data.
    They try to predict the results. This leads to predicting the computation that led to the results, because the computation is well-approximated by a simple function and they are also likely to pick a simple function.
    You can take this protein language model. You can train on described function and sequences and you can have a language model that can take supplied use cases and generate novel functional protein sequences to match.
    https://www.nature.com/articles/s41587-022-01618-2
    Have humans figured this out ? Can we go function to protein just like that ? No way! Not even close
    Inverting relationships like this is a pretty good use-case for language models. But here you’re still relying on having an evolutionary ecology to give you lots of examples of proteins.
    - ReaderM 26 Oct 2023 18:07 UTC
      −1 points
      −2
      Parent
      >They find functions that fit the results. Most such functions are simple and therefore generalize well. But that doesn’t mean they generalize arbitrarily well.
      You have no idea how simple the functions they are learning are.
      >Not really any different from the human language LLM, it’s just trained on stuff evolution has figured out rather than stuff humans have figured out. This wouldn’t work if you used random protein sequences instead of evolved ones.
      It would work just fine. The model would predict random arbitrary sequences and the structure would still be there.
      >They try to predict the results. This leads to predicting the computation that led to the results, because the computation is well-approximated by a simple function and they are also likely to pick a simple function.
      Models don’t care about “simple”. They care about what works. Simple is arbitrary and has no real meaning. There are many examples of interpretability research revealing convoluted functions.
      https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking
      >Inverting relationships like this is a pretty good use-case for language models. But here you’re still relying on having an evolutionary ecology to give you lots of examples of proteins.
      So ? The point is that they’re limited by the data and the casual processes that informed it, not the intelligence or knowledge of humans providing the data. Models like this can and often do eclipse human ability.
      If you train a predictor on text describing the outcome of games as well as games then a good enough predictor should be able to eclipse the output of even the best match in training by modulating the text describing the outcome.
      - tailcalled 26 Oct 2023 19:10 UTC
        4 points
        0
        Parent
        It would work just fine. The model would predict random arbitrary sequences and the structure would still be there.
        I don’t understand your position. Are you saying that if we generated protein sequences by uniformly randomly independently picking letters from “ILVFMCAGPTSYWQNHEDKR” to sample strings, and then trained an LLM to predict those uniform random strings, it would end up with internal structure representing how biology works? Because that’s obviously wrong to me and I don’t see why you’d believe it.
        Models don’t care about “simple”. They care about what works. Simple is arbitrary and has no real meaning. There are many examples of interpretability research revealing convoluted functions.
        https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking
        The algorithm that uses a Fourier transform for modular multiplication is really simple. It is probably the most straightforward way to solve the problem with the tools available to the network, and it is strongly related to our best known algorithms for multiplication.
        So ? The point is that they’re limited by the data and the casual processes that informed it, not the intelligence or knowledge of humans providing the data. Models like this can and often do eclipse human ability.
        My claim is that for our richest data, the causal processes that inform the data is human intelligence. Of course you are right that there are other datasets available, but they are less rich but sometimes useful (as in the case of proteins).
        Furthermore what I’m saying is that if the AI learns to create its own information instead of relying on copying data, it could achieve much more.
        If you train a predictor on text describing the outcome of games as well as games then a good enough predictor should be able to eclipse the output of even the best match in training by modulating the text describing the outcome.
        Plausibly true, but don’t our best game-playing AIs also do self-play to create new game-playing information instead of purely relying on other’s games? Like AlphaStar.
        ReaderM 27 Oct 2023 17:01 UTC
        1 point
        0
        Parent
        
        I don’t understand your position. Are you saying that if we generated protein sequences by uniformly randomly independently picking letters from “ILVFMCAGPTSYWQNHEDKR” to sample strings, and then trained an LLM to predict those uniform random strings, it would end up with internal structure representing how biology works? Because that’s obviously wrong to me and I don’t see why you’d believe it.
        
        Ah no. I misunderstood you here. You’re right.
        
        **What I was trying to get at the notion that something in particular (Human, evolution etc) has to have “figured something out”. The only requirement is that the Universe has “figured it out”. i.e it is possible. More on this later down.
        
        The algorithm that uses a Fourier transform for modular multiplication is really simple. It is probably the most straightforward way to solve the problem with the tools available to the network, and it is strongly related to our best known algorithms for multiplication.
        
        Simple is relative. It’s a good solution. I never said it wasn’t. It’s not however the simplest solution. The point here is that models don’t optimize for simple. They don’t care about that. They optimize for what works. If a simple function works then great. If it stops working then the model shifts from that just as readily as it picked a “simple” one in the first place. There is also no rule that a simple function would be less or more representative of the casual processes informing the real outcome than a more complicated one.
        
        If a perfect predictor straight out of training is using a simple function for any task, it is because it worked for all the data it’d ever seen not because it was simple.
        
        My claim is that for our richest data, the causal processes that inform the data is human intelligence.
        
        **1. You underestimate just how much of the internet contains text whose prediction outright requires superhuman capabilities, like figuring out hashes, or predicting the results of scientific experiments, generating the result of many iterations of refinement, or predicting stock prices/movement. The Universe has figured it out. And that is enough. A perfect predictor of the internet would be a superintelligence, it won’t ‘max out’ anywhere near human just because humans wrote it down.
        
        Essentially, for the predictor, the upper bound of what can be learned from a dataset is not the most capable trajectory, but the conditional structure of the universe implicated by their sum.
        
        A predictor modelling all human text isn’t modelling humans but the universe. Text is written by the universe and humans are just the bits that touch the keyboard.
        
        A Human is a general intelligence. Humans are a Super Intelligence. A single machine that is at the level of the greatest human intelligence for every domain is a Super Intelligence even if there’s no notable synergy or interpolation.
        
        But the chances of synergy or interpolation between domains is extremely high.
        
        Much the same way you might expect new insights to arise from a human who has read and mastered the total sum of human knowledge.
        
        A relatively mundane example of this that has already happened is the fact that you can converse in Catalan with GPT-4 on topics no Human Catalan speaker knows.
        
        Game datasets aren’t the only example of outcome that is preceded by text describing it. Quite a lot of text is fashioned in this way actually.
        
        Plausibly true, but don’t our best game-playing AIs also do self-play to create new game-playing information instead of purely relying on other’s games? Like AlphaStar.
        
        All our best game playing AI are not predictors. They are either deep reinforcement learning (which as with anything can be modelled by a predictor https://arxiv.org/abs/2106.01345 https://arxiv.org/abs/2205.14953 https://sites.google.com/view/multi-game-transformers) or some variation of search or some of both. The only information that can be provided to them are the rules and the state of the board/game so they are bound by the data in a way a predictor isn’t necessarily.
        
        Self play is a fine option. What I’m disputing is the necessity of it when prediction is on the table.
        tailcalled 28 Oct 2023 8:34 UTC
        4 points
        2
        Parent
        I don’t believe your idea that neural networks are gonna gain superhuman ability through inverting hashes because I don’t believe neural networks can learn to invert cryptographic hashes. They are specifically designed to be non-invertible.