I remain unconvinced that there’s a predictive model of the world opposite this statement, in people who affirm it, that would allow them to say, “nah, LLMs aren’t deeply alien.”
If LLM cognition was not “deeply alien” what would the world look like?
What distinguishing evidence does this world display, that separates us from that world?
What would an only kinda-alien bit of cognition look like?
What would very human kind of cognition look like?
What different predictions does the world make?
Does alienness indicate that it is because the models, the weights themselves have no “consistent beliefs” apart from their prompts? Would a human neocortex, deprived of hippocampus, present any such persona? Is a human neocortex deeply alien? Are all the parts of a human brain deeply alien?
Is it because they “often spout completely non-human kinds of texts”? Is the Mersenne Twister deeply alien? What counts as “completely non-human”?
Is it because they have no moral compass, being willing to continue any of the data on which they were trained? Does any human have a “moral compass” apart from the data on which they were trained? If I can use some part of my brain to improv a consistent Nazi, does that mean that it makes sense to call the part of my brain that lets me do that immoral or psychopathic?
Is it that the algorithms that we’ve found in DL so far don’t seem to slot into readily human-understandable categories? Would a not-deeply-alien algorithm be able-to-be cracked open and show us clear propositions of predicate logic? If we had a human neocortex in an oxygen-infused broth in front of us, and we recorded the firing of every cell, do we anticipate that the algorithms there would be clear propositions of predicate logic? Would we be compelled to conclude that human neocortexes were deeply alien?
Or is it deeply alien because we think the substrate of thought is different, based on backprop rather than local learning? What if local learning could actually approximate backpropagation?. Or if more realistic non-backprop potential brain algorithms actually… kind just acted quite similarly to backprop, such that you could draw a relatively smooth line between them and backprop? Would this or more similar research impact whether we thought brains were aliens or not?
Does substrate-difference count as evidence against alien-ness, or does alien-ness just not make that kind of predictions? Is the cognition of an octopus less alien to us than the cognition of an LLM, because it runs on a more biologically-similar substrate?
Does every part of a system by itself need to fit into the average person’s ontology for the total to not be deeply alien; do we need to be able to fit every part within a system into a category comprehensible by an untutored human in order to describe it as not deeply alien? Is anything in the world not deeply alien by this standard?
To re-question: What predictions can I make about the world because LLMs are “deeply alien”?
Are these predictions clear?
When speaking to someone who I consider a noob, is it best to give them terms whose emotive import is clear, but whose predictive import is deeply unclear?
What kind of contexts does this “deeply alien” statement come up in? Are those contexts people are trying to explain, or to persuade?
If I piled up all the useful terms that I know that help me predict how LLMs behave, would “deeply alien” be an empty term on top of these?
Or would it give me no more predictive value than “many behaviors of an LLM are currently not understood”?
Another component corresponds to a general view that LLMs are trained in a very different way from how humans learn. (Though you could in principle get the same cognition from very different learning processes.)
This does correspond to specific falsifiable predictions.
Despite being pretty confident in “deeply alien” in many respects, it doesn’t seem clear to me whether LLMs will in practice have very different relative capability profiles from humans on larger scale downstream tasks we actually care about. (It currently seems like the answer will be “mostly no” from my perspective.)
In addition to the above, I’d add in some stuff about how blank slate theory seems to be wrong as a matter of human psychology. If evidence comes out tomorrow that actually humans are blank slates to a much greater extent than I realized, so much so that e.g. the difference between human and dog brains is basically just size and training data, I’d be more optimistic that what’s going on inside LLMs isn’t deeply alien.
If evidence comes out tomorrow that actually humans are blank slates to a much greater extent than I realized, so much so that e.g. the difference between human and dog brains is basically just size and training data
is basically correct, in the sense that a lot of the reason humans succeeded was essentially culture + language, which is essentially both increasing training data and increasing the data quality, and also that human brains have more favorable scaling laws than basically any other animals, because we dissipate heat way better than other animals, and also have larger heads.
A lot of the reason you couldn’t make a dog as smart as a human is because it’s brain would almost certainly not fit in the birth canal, and if you solved that problem, you’d have to handle heat dissipation, and dogs do not dissipate heat well compared to humans.
IMO, while I think the original blank slate hypothesis was incorrect, I do think a weaker version of it does work, and a lot of AI progress is basically the revenge of the blank slate people, in that you can have very weak priors and learning still works, both for capabilities and alignment.
That much I knew already at the time I wrote the comment. I know that e.g. human brains are basically just scaled up chimp brains etc. I definitely am closer to the blank slate end of the spectrum than the ‘evolution gave us tons of instincts’ end of the spectrum. But it’s not a total blank slate. What do you think would happen if we did a bunch of mad science to make super-dogs that had much bigger brains? And then took one of those dogs and tried to raise it like a human child? My guess is that it would end up moving some significant fraction of the distance from dog to human, but not crossing the gap entirely, and if it did end up fitting in to human society, getting a job, etc. it would have very unusual fetishes at the very least and probably all sorts of other weird desires and traits.
I agree with you, but I think my key crux is that I tend to think that ML people have more control over AI’s data sources than humans have over their kids or super-dogs, and this is only going to increase for boring capabilities reasons (primarily due to the need to get past the data wall, and importantly having control over the data lets you create very high quality data), and as it turns out, a lot of what the AI values is predicted very well if you know it’s data, and much less well if you knew it’s architecture.
And this suggests a fairly obvious alignment strategy. You’ve mentioned that the people working on it might not do it because they are too busy racing to superintelligence, and I agree that the major failure mode in practice will be companies being in such a Molochian race to the top for superintelligence that they can’t do any safety efforts.
This would frankly be terrifying, and I don’t want this to happen, but contra many on Lesswrong, I think there’s a real chance that AI is safe and aligned even in effectively 0 dignity futures like this, but I don’t think the chance is high enough that I’d promote a race to superintelligence at all.
OK. We might have some technical disagreement remaining about the promisingness of data-control strategies, but overall it seems like we are on basically the same page.
Zooming in on our potential disagreement though: ”...as it turns out, a lot of what the AI values is predicted very well if you know it’s data” Can you say more about what you mean by this and what your justification is? IMO there are lots of things about AI values that we currently failed to predict in advance (though usually it’s possible to tell a plausible story with the benefit of hindsight). idk. Curious to hear more.
What I mean by this is that if you want to predict what an AI will do, for example will it do better in having a new capability than other models, or what it’s values are like, especially if you want to predict OOD behavior accurately, you would be far better off if you knew what it’s data sources are, as well as the quality of it’s data, than if you only knew it’s prior/architecture.
Re my justification for it, my basic justification comes from this tweet thread, which points out that a lot of o1′s success could come from high-quality data, and while I don’t like the argument that search/fancy bits aren’t happening at all (I do think o1 is doing a very small run-time search), I agree with the conclusion that the data quality was probably most of the reason o1 is so good in coding.
Somewhat more generally, I’m pretty influenced by this post, and while I don’t go as far as claiming that all of what an AI is the dataset, I do think a weaker version of the claim is pretty likely to be true.
But one prediction we could have made in advance, if we knew that data was a major factor in how AIs learn values, is that value misspecification was likely to be far less severe of a problem than 2000-2010s thinking on LW had, and value learning had a tractable direction of progress, as training on human books and language would load into it mostly-correct values, and another prediction we could have made is that human values wouldn’t all be that complicated, and could instead be represented by say several hundred megabyte/1 gigabyte codes quite well, and we could plausibly simplify that further.
To be clear, I don’t think you could have had too high of a probability writ it’s prediction before LLMs, but you’d at least have the hypothesis in serious consideration.
Cf here:
The answer isn’t embedded in 100 lines of Python, but in a subset of the weights of GPT-4 Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values. What we have now isn’t a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with. The human value function as expressed by GPT-4 is also immune to almost all in-practice, non-adversarial, perturbations
From here on Matthew Barnett’s post about the historical value misspecification argument, and note I’m not claiming that alignment is solved right now:
and here, where it talks about the point that to the extent there’s a gap between loading in correct values versus loading in capabilities, it’s that loading in values data is easier than loading in capabilities data, which kind of contradicts this post from @Rob Bensinger here on motivations being harder to learn, and one could have predicted this because there was a lot of data on human values, and a whole lot of the complexity of the values is in the data, not the generative model, thus it’s very easy to learn values, but predictably harder to learn a lot of the most useful capabilities.
Again, we couldn’t have too high a probability for this specific outcome happening, but you’d at least seriously consider the hypothesis.
From Rob Bensinger:
“Hidden Complexity of Wishes” isn’t arguing that a superintelligence would lack common sense, or that it would be completely unable to understand natural language. It’s arguing that loading the right *motivations* into the AI is a lot harder than loading the right understanding.
From @beren on alignment generalizing further than capabilities, in spiritual response to Bensinger:
In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ from general unsupervised learning. As evidenced by the general struggles of AI agents, ability to actually optimize coherently in complex stochastic ‘real-world’ environments over long time horizons is fundamentally more difficult than simply building a detailed linguistic understanding of the world.
But that’s how we could well have made predictions about AIs, or at least elevated these hypotheses to reasonable probability mass, in an alternate universe where LW didn’t anchor too hard on their previous models of AI like AIXI and Solomonoff induction.
Note that in order for my argument to go through, we also need the brain to be similar enough to DL systems that we can validly transfer insights from DL to the brain, and while I don’t think you could place too high of a probability 10-20 years ago on that, I do think that at the very least this should have been considered as a serious possibility, which LW mostly didn’t do.
However, we now have that evidence, and I’ll post links below:
These are a lot of questions, my guess is most of which are rhetorical, so not sure which ones you are actually interested in getting an answer on. Most of the specific questions I would answer with “no”, in that they don’t seem to capture what I mean by “alien”, or feel slightly strawman-ish.
Responding at a high-level:
There are a lot of experiments that seem like they shed light on the degree to which cognition in AI systems is similar to human or animal cognition. Some examples:
Does the base model pass a Turing test?
Does the performance distribution of the base model on different tasks match the performance distribution of humans?
Does the generalization and learning behavior of the base model match how humans learn things?
When trained using RL on things like game-environments (after pre-training on a language corpus), does the system learn at similar rates and plateau at similar skill levels as human players?
There are a lot of structural and algorithmic properties that could match up between human and LLM systems:
Do they interface with the world in similar ways?
Do they require similar amounts and kinds of data to learn the same relationships?
Do the low-level algorithmic properties of how human brains store and process information look similar between the two systems?
A lot more stuff, but I am not sure how useful going into a long list here is. At least to me it feels like a real thing, and different observations would change the degree to which I would describe a system as alien.
I think the exact degree of alienness is really interesting and one of the domains where I would like to see more research.
For example, a bunch of the experiments I would most like to see, that seem helpful with AI Alignment, are centered on better measuring the performance distribution of transformer architectures on tasks that are not primarily human imitation, so that we could better tell which things LLMs have a much easier time learning than humans (currently even if a transformer could relatively easily reach vastly superhuman performance at a task with more specialized training data, due to the structure of the training being oriented around human imitation, observed performance at the task will cluster around human level, but seeing where transformers could reach vast superhuman performance would be quite informative on understanding the degree to which its cognition is alien).
So I don’t consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictions about the behavior of LLM systems.
I like a lot of these questions, although some of them give me an uncanny feeling akin to “wow, this is a very different list of uncertainties than I have.” I’m sorry the my initial list of questions was aggressive.
So I don’t consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictions about the behavior of LLM systems.
I’m not sure how they add up to alienness, though? They’re about how we’re different than models—wheras the initial claim was that models are psychopathic, ammoral, etc.. If we say a model is “deeply alien”—is that just saying it’s different than us in lots of ways? I’m cool with that—but the surplus negative valence involved in “LLMs are like shoggoths” versus “LLMs have very different performance characteristics than humans” seems to me pretty important.
Otherwise, why not say that calculators are alien, or any of the things in existence with different performance curves than we have? Chessbots, etc. If I write a loop in Python to count to 10, the process by which it does so is arguably more different from how I count to ten than the process by which an LLM counts to ten, but we don’t call Python alien.
This feels like reminding an economics student that the market solves things differently than a human—which is true—by saying “The market is like Baal.”
Do they require similar amounts and kinds of data to learn the same relationships?
There is a fun paper on this you might enjoy. Obviously not a total answer to the question.
The main difference between calculators, weather predictors, markets, and Python versus LLMs is that LLMs can talk to you in a relatively strong sense of “talk”. So, by default, people don’t have mistaken impressions of the cognitative nature of calculators, markets, and Python, while they might have a mistake about LLMs.
Like it isn’t surprising to most people that calculators are quite amoral in their core (why would you even expect morality?). But the claim that the thing which GPT-4 is built out of is quite amoral is non-obvious to people (though obvious to people with slightly more understanding).
I do think there is an important point which is communicated here (though it seems very obvious to people who actually operate in the domain).
I agree this can be initially surprising to non-experts!
I just think this point about the amorality of LLMs is much better communicated by saying “LLMs are trained to continue text from an enormous variety of sources. Thus, if you give them [Nazi / Buddhist / Unitarian / corporate / garbage nonsense] text to continue, they will generally try to continue it in that style.”
Agreed, though of course as always, there is the issue that that’s an intentional-stance way to describe what a language model does: “they will generally try to continue it in that style.” Hence mechinterp, which tries to (heh) move to a mechanical stance, which will likely be something like “when you give them a [whatever] text to continue, it will match [some list of features], which will then activate [some part of the network that we will name later], which implements the style that matches those features”.
(incidentally, I think there’s some degree to which people who strongly believe that artificial NNs are alien shoggoths are underestimating the degree to which their own brains are also alien shoggoths. but that doesn’t make it a good model of either thing. the only reason it was ever an improvement over a previous word was when people had even more misleading intuitive-sketch models.)
LLMs are trained to continue text from an enormous variety of sources
This is a bit of a noob question, but is this true post RLHF? Generally most of my interactions with language models these days (e.g. asking for help with code, asking to explain something I don’t understand about history/medicine/etc) don’t feel like they’re continuing my text, it feels like they’re trying to answer my questions politely and well. I feel like “ask shoggoth and see what it comes up with” is a better model for me than “go the AI and have it continue your text about the problem you have”.
To the best of my knowledge, the majority of research (all the research?) has found that the changes to a LLM’s text-continuation abilities from RLHF (or whatever descendant of RLHF is used) are extremely superficial.
Our findings reveal that base LLMs and their alignment-tuned versions
perform nearly identically in decoding on the majority of token positions (i.e., they
share the top-ranked tokens). Most distribution shifts occur with stylistic tokens
(e.g., discourse markers, safety disclaimers). These direct evidence strongly sup-
ports the hypothesis that alignment tuning primarily learns to adopt the language
style of AI assistants, and that the knowledge required for answering user queries
predominantly comes from the base LLMs themselves.
Or, in short, the LLM is still basically doing the same thing, with a handful of additions to keep it on-track in the desired route from the fine-tuning.
(I also think our very strong prior belief should be that LLMs are basically still text-continuation machines, given that 99.9% or so of the compute put into them is training them for this objective, and that neural networks lose plasticity as they learn. Ash and Adams is like a really good intro to this loss of plasticity, although most of the research that cites this is RL-related so people don’t realize.)
Similarly, a lot of people have remarked on how the textual quality of the responses from a RLHF’d language model can vary with the textual quality of the question. But of course this makes sense from a text-prediction perspective—a high-quality answer is more likely to follow a high-quality question in text than a high-quality answer from a low-quality question. This kind of thing—preceding the model’s generation with high-quality text—was the only way to make it have high quality answers for base models—but it’s still there, hidden.
So yeah, I do think this is a much better model for interacting with these things than asking a shoggoth. It actually gives you handles to interact with them better, while asking a shoggoth gives you no such handles.
The people who originally came up with the shoggoth meme, I’d bet, were very well aware of how LLMs are pretrained to predict text and how they are best modelled (at least for now) as trying to predict text. When I first heard the shoggoth meme that’s what I thought—I interpreted it as “it’s this alien text-prediction brain that’s been retrained ever so slightly to produce helpful chatbot behaviors. But underneath it’s still mostly just about text prediction. It’s not processing the conversation in the same way that a human would.” Mildly relevant: In the Lovecraft canon IIRC Shoggoths are servitor-creatures, they are basically beasts of burden. They aren’t really powerful intelligent agents in their own right, they are sculpted by their creators to perform useful tasks. So, for me at least, calling them shoggoth has different and more accurate vibes than, say, calling them Cthulhu. (My understanding of the canon may be wrong though)
Hmm, I think that’s a red herring though. Consider humans—most of them have read lots of text from an enormous variety of sources as well. Also while it’s true that current LLMs have only a little bit of fine-tuning applied after their pre-training, and so you can maybe argue that they are mostly just trained to predict text, this will be less and less true in the future.
How about “LLMs are like baby alien shoggoths, that instead of being raised in alien culture, we’ve adopted at birth and are trying to raise in human culture. By having them read the internet all day.”
(Come to think of it, I actually would feel noticeably more hopeful about our prospects for alignment success if we actually were “raising the AGI like we would a child.” If we had some interdisciplinary team of ML and neuroscience and child psychology experts that was carefully designing a curriculum for our near-future AGI agents, a curriculum inspired by thoughtful and careful analogies to human childhood, that wouldn’t change my overall view dramatically but it would make me noticeably more hopeful. Maybe brain architecture & instincts basically don’t matter that much and Blank Slate theory is true enough for our purposes that this will work to produce an agent with values that are in-distribution for the range of typical modern human values!)
(This doesn’t contradict anything you said, but it seems like we totally don’t know how to “raise an AGI like we would a child” with current ML. Like I don’t think it counts for very much if almost all of the training time is a massive amount of next-token prediction. Like a curriculum of data might work very differently on AI vs humans due to a vastly different amount of data and a different training objective.)
I’ve seen mixed data on how important curricula are for deep learning. One paper (on CIFAR) suggested that curricula only help if you have very few datapoints or the labels are noisy. But possibly that doesn’t generalize to LLMs.
I think data ordering basically never matters for LLM pretraining. (As in, random is the best and trying to make the order more specific doesn’t help.)
I remain unconvinced that there’s a predictive model of the world opposite this statement, in people who affirm it, that would allow them to say, “nah, LLMs aren’t deeply alien.”
If LLM cognition was not “deeply alien” what would the world look like?
What distinguishing evidence does this world display, that separates us from that world?
What would an only kinda-alien bit of cognition look like?
What would very human kind of cognition look like?
What different predictions does the world make?
Does alienness indicate that it is because the models, the weights themselves have no “consistent beliefs” apart from their prompts? Would a human neocortex, deprived of hippocampus, present any such persona? Is a human neocortex deeply alien? Are all the parts of a human brain deeply alien?
Is it because they “often spout completely non-human kinds of texts”? Is the Mersenne Twister deeply alien? What counts as “completely non-human”?
Is it because they have no moral compass, being willing to continue any of the data on which they were trained? Does any human have a “moral compass” apart from the data on which they were trained? If I can use some part of my brain to improv a consistent Nazi, does that mean that it makes sense to call the part of my brain that lets me do that immoral or psychopathic?
Is it that the algorithms that we’ve found in DL so far don’t seem to slot into readily human-understandable categories? Would a not-deeply-alien algorithm be able-to-be cracked open and show us clear propositions of predicate logic? If we had a human neocortex in an oxygen-infused broth in front of us, and we recorded the firing of every cell, do we anticipate that the algorithms there would be clear propositions of predicate logic? Would we be compelled to conclude that human neocortexes were deeply alien?
Or is it deeply alien because we think the substrate of thought is different, based on backprop rather than local learning? What if local learning could actually approximate backpropagation?. Or if more realistic non-backprop potential brain algorithms actually… kind just acted quite similarly to backprop, such that you could draw a relatively smooth line between them and backprop? Would this or more similar research impact whether we thought brains were aliens or not?
Does substrate-difference count as evidence against alien-ness, or does alien-ness just not make that kind of predictions? Is the cognition of an octopus less alien to us than the cognition of an LLM, because it runs on a more biologically-similar substrate?
Does every part of a system by itself need to fit into the average person’s ontology for the total to not be deeply alien; do we need to be able to fit every part within a system into a category comprehensible by an untutored human in order to describe it as not deeply alien? Is anything in the world not deeply alien by this standard?
To re-question: What predictions can I make about the world because LLMs are “deeply alien”?
Are these predictions clear?
When speaking to someone who I consider a noob, is it best to give them terms whose emotive import is clear, but whose predictive import is deeply unclear?
What kind of contexts does this “deeply alien” statement come up in? Are those contexts people are trying to explain, or to persuade?
If I piled up all the useful terms that I know that help me predict how LLMs behave, would “deeply alien” be an empty term on top of these?
Or would it give me no more predictive value than “many behaviors of an LLM are currently not understood”?
Most of my view on “deeply alien” is downstream of LLMs being extremely superhuman at literal next token prediction and generally superhuman at having an understanding of random details of webtext.
Another component corresponds to a general view that LLMs are trained in a very different way from how humans learn. (Though you could in principle get the same cognition from very different learning processes.)
This does correspond to specific falsifiable predictions.
Despite being pretty confident in “deeply alien” in many respects, it doesn’t seem clear to me whether LLMs will in practice have very different relative capability profiles from humans on larger scale downstream tasks we actually care about. (It currently seems like the answer will be “mostly no” from my perspective.)
In addition to the above, I’d add in some stuff about how blank slate theory seems to be wrong as a matter of human psychology. If evidence comes out tomorrow that actually humans are blank slates to a much greater extent than I realized, so much so that e.g. the difference between human and dog brains is basically just size and training data, I’d be more optimistic that what’s going on inside LLMs isn’t deeply alien.
Re this discussion, I think that this claim:
is basically correct, in the sense that a lot of the reason humans succeeded was essentially culture + language, which is essentially both increasing training data and increasing the data quality, and also that human brains have more favorable scaling laws than basically any other animals, because we dissipate heat way better than other animals, and also have larger heads.
A lot of the reason you couldn’t make a dog as smart as a human is because it’s brain would almost certainly not fit in the birth canal, and if you solved that problem, you’d have to handle heat dissipation, and dogs do not dissipate heat well compared to humans.
IMO, while I think the original blank slate hypothesis was incorrect, I do think a weaker version of it does work, and a lot of AI progress is basically the revenge of the blank slate people, in that you can have very weak priors and learning still works, both for capabilities and alignment.
That much I knew already at the time I wrote the comment. I know that e.g. human brains are basically just scaled up chimp brains etc. I definitely am closer to the blank slate end of the spectrum than the ‘evolution gave us tons of instincts’ end of the spectrum. But it’s not a total blank slate. What do you think would happen if we did a bunch of mad science to make super-dogs that had much bigger brains? And then took one of those dogs and tried to raise it like a human child? My guess is that it would end up moving some significant fraction of the distance from dog to human, but not crossing the gap entirely, and if it did end up fitting in to human society, getting a job, etc. it would have very unusual fetishes at the very least and probably all sorts of other weird desires and traits.
I agree with you, but I think my key crux is that I tend to think that ML people have more control over AI’s data sources than humans have over their kids or super-dogs, and this is only going to increase for boring capabilities reasons (primarily due to the need to get past the data wall, and importantly having control over the data lets you create very high quality data), and as it turns out, a lot of what the AI values is predicted very well if you know it’s data, and much less well if you knew it’s architecture.
And this suggests a fairly obvious alignment strategy. You’ve mentioned that the people working on it might not do it because they are too busy racing to superintelligence, and I agree that the major failure mode in practice will be companies being in such a Molochian race to the top for superintelligence that they can’t do any safety efforts.
This would frankly be terrifying, and I don’t want this to happen, but contra many on Lesswrong, I think there’s a real chance that AI is safe and aligned even in effectively 0 dignity futures like this, but I don’t think the chance is high enough that I’d promote a race to superintelligence at all.
OK. We might have some technical disagreement remaining about the promisingness of data-control strategies, but overall it seems like we are on basically the same page.
Zooming in on our potential disagreement though: ”...as it turns out, a lot of what the AI values is predicted very well if you know it’s data” Can you say more about what you mean by this and what your justification is? IMO there are lots of things about AI values that we currently failed to predict in advance (though usually it’s possible to tell a plausible story with the benefit of hindsight). idk. Curious to hear more.
What I mean by this is that if you want to predict what an AI will do, for example will it do better in having a new capability than other models, or what it’s values are like, especially if you want to predict OOD behavior accurately, you would be far better off if you knew what it’s data sources are, as well as the quality of it’s data, than if you only knew it’s prior/architecture.
Re my justification for it, my basic justification comes from this tweet thread, which points out that a lot of o1′s success could come from high-quality data, and while I don’t like the argument that search/fancy bits aren’t happening at all (I do think o1 is doing a very small run-time search), I agree with the conclusion that the data quality was probably most of the reason o1 is so good in coding.
https://x.com/aidanogara_/status/1838779311999918448
Somewhat more generally, I’m pretty influenced by this post, and while I don’t go as far as claiming that all of what an AI is the dataset, I do think a weaker version of the claim is pretty likely to be true.
https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/
But one prediction we could have made in advance, if we knew that data was a major factor in how AIs learn values, is that value misspecification was likely to be far less severe of a problem than 2000-2010s thinking on LW had, and value learning had a tractable direction of progress, as training on human books and language would load into it mostly-correct values, and another prediction we could have made is that human values wouldn’t all be that complicated, and could instead be represented by say several hundred megabyte/1 gigabyte codes quite well, and we could plausibly simplify that further.
To be clear, I don’t think you could have had too high of a probability writ it’s prediction before LLMs, but you’d at least have the hypothesis in serious consideration.
Cf here:
From here on Matthew Barnett’s post about the historical value misspecification argument, and note I’m not claiming that alignment is solved right now:
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument#N9ManBfJ7ahhnqmu7
and here, where it talks about the point that to the extent there’s a gap between loading in correct values versus loading in capabilities, it’s that loading in values data is easier than loading in capabilities data, which kind of contradicts this post from @Rob Bensinger here on motivations being harder to learn, and one could have predicted this because there was a lot of data on human values, and a whole lot of the complexity of the values is in the data, not the generative model, thus it’s very easy to learn values, but predictably harder to learn a lot of the most useful capabilities.
Again, we couldn’t have too high a probability for this specific outcome happening, but you’d at least seriously consider the hypothesis.
From Rob Bensinger:
https://x.com/robbensinger/status/1648120202708795392
From @beren on alignment generalizing further than capabilities, in spiritual response to Bensinger:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
But that’s how we could well have made predictions about AIs, or at least elevated these hypotheses to reasonable probability mass, in an alternate universe where LW didn’t anchor too hard on their previous models of AI like AIXI and Solomonoff induction.
Note that in order for my argument to go through, we also need the brain to be similar enough to DL systems that we can validly transfer insights from DL to the brain, and while I don’t think you could place too high of a probability 10-20 years ago on that, I do think that at the very least this should have been considered as a serious possibility, which LW mostly didn’t do.
However, we now have that evidence, and I’ll post links below:
https://www.lesswrong.com/posts/rjghymycfrMY2aRk5/llm-cognition-is-probably-not-human-like#KBpfGY3uX8rDJgoSj
https://x.com/BogdanIonutCir2/status/1837653632138772760
https://x.com/SharmakeFarah14/status/1837528997556568523
These are a lot of questions, my guess is most of which are rhetorical, so not sure which ones you are actually interested in getting an answer on. Most of the specific questions I would answer with “no”, in that they don’t seem to capture what I mean by “alien”, or feel slightly strawman-ish.
Responding at a high-level:
There are a lot of experiments that seem like they shed light on the degree to which cognition in AI systems is similar to human or animal cognition. Some examples:
Does the base model pass a Turing test?
Does the performance distribution of the base model on different tasks match the performance distribution of humans?
Does the generalization and learning behavior of the base model match how humans learn things?
When trained using RL on things like game-environments (after pre-training on a language corpus), does the system learn at similar rates and plateau at similar skill levels as human players?
There are a lot of structural and algorithmic properties that could match up between human and LLM systems:
Do they interface with the world in similar ways?
Do they require similar amounts and kinds of data to learn the same relationships?
Do the low-level algorithmic properties of how human brains store and process information look similar between the two systems?
A lot more stuff, but I am not sure how useful going into a long list here is. At least to me it feels like a real thing, and different observations would change the degree to which I would describe a system as alien.
I think the exact degree of alienness is really interesting and one of the domains where I would like to see more research.
For example, a bunch of the experiments I would most like to see, that seem helpful with AI Alignment, are centered on better measuring the performance distribution of transformer architectures on tasks that are not primarily human imitation, so that we could better tell which things LLMs have a much easier time learning than humans (currently even if a transformer could relatively easily reach vastly superhuman performance at a task with more specialized training data, due to the structure of the training being oriented around human imitation, observed performance at the task will cluster around human level, but seeing where transformers could reach vast superhuman performance would be quite informative on understanding the degree to which its cognition is alien).
So I don’t consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictions about the behavior of LLM systems.
I like a lot of these questions, although some of them give me an uncanny feeling akin to “wow, this is a very different list of uncertainties than I have.” I’m sorry the my initial list of questions was aggressive.
I’m not sure how they add up to alienness, though? They’re about how we’re different than models—wheras the initial claim was that models are psychopathic, ammoral, etc.. If we say a model is “deeply alien”—is that just saying it’s different than us in lots of ways? I’m cool with that—but the surplus negative valence involved in “LLMs are like shoggoths” versus “LLMs have very different performance characteristics than humans” seems to me pretty important.
Otherwise, why not say that calculators are alien, or any of the things in existence with different performance curves than we have? Chessbots, etc. If I write a loop in Python to count to 10, the process by which it does so is arguably more different from how I count to ten than the process by which an LLM counts to ten, but we don’t call Python alien.
This feels like reminding an economics student that the market solves things differently than a human—which is true—by saying “The market is like Baal.”
There is a fun paper on this you might enjoy. Obviously not a total answer to the question.
The main difference between calculators, weather predictors, markets, and Python versus LLMs is that LLMs can talk to you in a relatively strong sense of “talk”. So, by default, people don’t have mistaken impressions of the cognitative nature of calculators, markets, and Python, while they might have a mistake about LLMs.
Like it isn’t surprising to most people that calculators are quite amoral in their core (why would you even expect morality?). But the claim that the thing which GPT-4 is built out of is quite amoral is non-obvious to people (though obvious to people with slightly more understanding).
I do think there is an important point which is communicated here (though it seems very obvious to people who actually operate in the domain).
I agree this can be initially surprising to non-experts!
I just think this point about the amorality of LLMs is much better communicated by saying “LLMs are trained to continue text from an enormous variety of sources. Thus, if you give them [Nazi / Buddhist / Unitarian / corporate / garbage nonsense] text to continue, they will generally try to continue it in that style.”
Than to say “LLMs are like alien shoggoths.”
Like it’s just a better model to give people.
Agreed, though of course as always, there is the issue that that’s an intentional-stance way to describe what a language model does: “they will generally try to continue it in that style.” Hence mechinterp, which tries to (heh) move to a mechanical stance, which will likely be something like “when you give them a [whatever] text to continue, it will match [some list of features], which will then activate [some part of the network that we will name later], which implements the style that matches those features”.
(incidentally, I think there’s some degree to which people who strongly believe that artificial NNs are alien shoggoths are underestimating the degree to which their own brains are also alien shoggoths. but that doesn’t make it a good model of either thing. the only reason it was ever an improvement over a previous word was when people had even more misleading intuitive-sketch models.)
This is a bit of a noob question, but is this true post RLHF? Generally most of my interactions with language models these days (e.g. asking for help with code, asking to explain something I don’t understand about history/medicine/etc) don’t feel like they’re continuing my text, it feels like they’re trying to answer my questions politely and well. I feel like “ask shoggoth and see what it comes up with” is a better model for me than “go the AI and have it continue your text about the problem you have”.
To the best of my knowledge, the majority of research (all the research?) has found that the changes to a LLM’s text-continuation abilities from RLHF (or whatever descendant of RLHF is used) are extremely superficial.
So you have one paper, from the abstract:
Or, in short, the LLM is still basically doing the same thing, with a handful of additions to keep it on-track in the desired route from the fine-tuning.
(I also think our very strong prior belief should be that LLMs are basically still text-continuation machines, given that 99.9% or so of the compute put into them is training them for this objective, and that neural networks lose plasticity as they learn. Ash and Adams is like a really good intro to this loss of plasticity, although most of the research that cites this is RL-related so people don’t realize.)
Similarly, a lot of people have remarked on how the textual quality of the responses from a RLHF’d language model can vary with the textual quality of the question. But of course this makes sense from a text-prediction perspective—a high-quality answer is more likely to follow a high-quality question in text than a high-quality answer from a low-quality question. This kind of thing—preceding the model’s generation with high-quality text—was the only way to make it have high quality answers for base models—but it’s still there, hidden.
So yeah, I do think this is a much better model for interacting with these things than asking a shoggoth. It actually gives you handles to interact with them better, while asking a shoggoth gives you no such handles.
The people who originally came up with the shoggoth meme, I’d bet, were very well aware of how LLMs are pretrained to predict text and how they are best modelled (at least for now) as trying to predict text. When I first heard the shoggoth meme that’s what I thought—I interpreted it as “it’s this alien text-prediction brain that’s been retrained ever so slightly to produce helpful chatbot behaviors. But underneath it’s still mostly just about text prediction. It’s not processing the conversation in the same way that a human would.” Mildly relevant: In the Lovecraft canon IIRC Shoggoths are servitor-creatures, they are basically beasts of burden. They aren’t really powerful intelligent agents in their own right, they are sculpted by their creators to perform useful tasks. So, for me at least, calling them shoggoth has different and more accurate vibes than, say, calling them Cthulhu. (My understanding of the canon may be wrong though)
(TBC, I totally agree that object level communication about the exact points seems better all else equal if you can actually do this communication.)
Hmm, I think that’s a red herring though. Consider humans—most of them have read lots of text from an enormous variety of sources as well. Also while it’s true that current LLMs have only a little bit of fine-tuning applied after their pre-training, and so you can maybe argue that they are mostly just trained to predict text, this will be less and less true in the future.
How about “LLMs are like baby alien shoggoths, that instead of being raised in alien culture, we’ve adopted at birth and are trying to raise in human culture. By having them read the internet all day.”
(Come to think of it, I actually would feel noticeably more hopeful about our prospects for alignment success if we actually were “raising the AGI like we would a child.” If we had some interdisciplinary team of ML and neuroscience and child psychology experts that was carefully designing a curriculum for our near-future AGI agents, a curriculum inspired by thoughtful and careful analogies to human childhood, that wouldn’t change my overall view dramatically but it would make me noticeably more hopeful. Maybe brain architecture & instincts basically don’t matter that much and Blank Slate theory is true enough for our purposes that this will work to produce an agent with values that are in-distribution for the range of typical modern human values!)
(This doesn’t contradict anything you said, but it seems like we totally don’t know how to “raise an AGI like we would a child” with current ML. Like I don’t think it counts for very much if almost all of the training time is a massive amount of next-token prediction. Like a curriculum of data might work very differently on AI vs humans due to a vastly different amount of data and a different training objective.)
I’ve seen mixed data on how important curricula are for deep learning. One paper (on CIFAR) suggested that curricula only help if you have very few datapoints or the labels are noisy. But possibly that doesn’t generalize to LLMs.
I think data ordering basically never matters for LLM pretraining. (As in, random is the best and trying to make the order more specific doesn’t help.)
That was my impression too.