I’m a machine learning engineer in Silicon Valley with an interest in AI alignment and safety.
RogerDearnaley
Existing large tech companies are using approaches like this, training or fine-tuning small models on data generated by large ones.
For example, it’s helpful for the cold start problem, where you don’t yet have user input to train/fine-tune your small model on because the product the model is intended for hasn’t been launched yet: have a large model create some simulated user input, train the small model on that, launch a beta test, and then retrain your small model with real user input as soon as you have some.
I’ve been thinking for a while that one could do syllabus learning for LLMs. It’s fairly easy to classify text by reading age. So start training the LLM on only text with a low reading age, and then increase the ceiling on reading age until it’s training on the full distribution of text. (https://arxiv.org/pdf/2108.02170.pdf experimented with curriculum learning in early LLMs, with little effect, but oddly didn’t test reading age.)
To avoid distorting the final training distribution by much, you would need to be able to raise the reading age limit fairly fast, so by the time it’s reached maximum you’re only used up say ten percent of the text with low reading ages, so then in the final training distribution those’re only say ten percent underrepresented. So the LLM is still capable of generating children’s stories if needed (just slightly less likely to do so randomly).
The hope is that this would improve quality faster early in the training run, to sooner get the LLM to a level where it can extract more benefit from even the more difficult texts, so hopefully reach a slightly higher final quality from the same amount of training data and compute. Otherwise for those really difficult texts that happen to be used early on in the training run, the LLM presumably gets less value from them than if they’d been later in the training. I’d expect any resulting improvement to be fairly small, but then this isn’t very hard to do.
A more challenging approach would be to do the early training on low-reading-age material in a smaller LLM, potentially saving compute, and then do something like add more layers near the middle, or distill the behavior of the small LLM into a larger one, before continuing the training. Here the aim would be to also save some compute during the early parts of the training run. Potential issues would be if the distillation process or loss of quality from adding new randomly-initialized layers ended up costing more compute/quality than we’d saved/gained.
[In general, the Bitter Lesson suggests that sadly the time and engineering effort spent on these sorts of small tweaks might be better spent on just scaling up more.]
I’d really like to have a better solution to alignment than one that relied entirely on something comparable to sensor hardening.
What are your thoughts on how value learning interacts with E.L.K.? Obviously the issue with value learning this that it only helps with outer alignment, not inner alignment: you’re transforming the problem from “How do we know the machine isn’t lying to us?” to “How do we know that the machine is actually trying to learn what we want (which includes not being lied to)?” It also explicitly requires the machine to build a model of “what humans want”, and then the complexity level and latent knowledge content required is fairly similar between “figure out what the humans want and then do that” and “figure out what the humans want and then show them a video of what doing that would look like”.
Maybe we should just figure out some way to do surprise inspections on the vault? :-)
If we can solve enough of the alignment problem, the rest gets solved for us.
If we can get a half-assed approximate solution to the alignment problem, sufficient to semi-align a STEM-capable AGI value learner of about smart-human level well enough to not kill everyone, then it will be strongly motivated to solve the rest of the alignment problem for us, just as the ‘sharp left turn’ is happening, especially if it’s also going Foom. So with value learning, there is is a region of convergence around alignment.
Or to reuse one of Eliezer’s metaphors, then if we can point the rocket on approximately the right trajectory, it will automatically lock on and course-correct from there.
Subproblem 1.2/2.1: Traps
Allowing traps in the environment creates two different problems:
(Subproblem 1.2) Bayes-optimality becomes intractable in a very strong sense (even for a small number of deterministic MDP hypotheses with small number of states).
(Subproblem 2.1) It’s not clear how to to talk about learnability and learning rates.
It makes some sense to consider these problems together, but different direction emphasize different sides.
Evolved organisms (such as humans) are good at dealing with traps: getting eaten is always a possibility. At the simplest level they do this by having multiple members of the species die, and using an evolutionary learning mechanism to evolve detectors for potential trap situations and some trap-avoiding behavior for this to trigger. An example of this might be the human instinct of vertigo near cliff edges — it’s hard not to step back. The cost of this is that some number of individuals die from the traps before the species evolves a way of avoiding the trap.
As a sapient species using the scientific method, we have more sophisticated ways to detect traps. Often we may have a well-supported model of the world that lets us predict and avoid a trap (“nuclear war could well wipe out the human race, let’s not do that”). Or we may have an unproven theory that predicts a possible trap, but that also predicts some less dangerous phenomenon. So rather than treating the universe like a multi-armed bandit and jumping into the potential trap to find out what happens and test our theory, we perform the lowest risk/cost experiment that will get us a good Bayesian update on the support for our unproven theory, hopefully at no cost to life or limb. If that raises the theory’s support, then we become more cautious about the predicted trap, or if it lowers it, we become less. Repeat until your Bayesian updates converge on either 100% or 0%.
An evolved primate heuristic for this is “if nervous of an unidentified object, poke it with a stick and see what happens”. This of course works better on, say, live/dead snakes than on some other perils that modern technology has exposed us to.
The basic trick here is to have a world model sophisticated enough that it can predict traps in advance, and we can find hopefully non-fatal ways of testing them that don’t require us to jump into the trap. This requires that the universe has some regularities strong enough to admit models like this, as ours does. Likely most universes that didn’t would be uninhabitable and life wouldn’t evolve in them.
My exposure to the AI and safety ethics community’s thinking has primarily been via LW/EA and papers, so it’s entirely possible that I have a biased sample.
I had another thought on this. Existing deontological rules are intended for humans. Humans are optimizing agents, and they’re all of about the same capacity (members of a species that seems, judging by the history of stone tool development, to have been sapient for maybe a quarter million years, so possibly only just over the threshold for sapience). So there is another way in which deontological rules reduce cognitive load: generally we’re thinking about our own benefit and that of close family and friends. It’s ‘not our responsibility’ to benefit everyone in the society — all of them are already doing that, looking out for themselves. So that might well explain why standard deontological rules concentrate on avoiding harm to others, rather than doing good to others.
AGI, on the other hand, firstly may well be smarter than all the humans, possibly far smarter, so may have the capacity to do for humans things they can’t do for themselves, possibly even for a great many humans. Secondly, its ethical role is not to help itself and its friends, but to help humans: all humans. It ought to be acting selflessly. So its duty to humans isn’t just to avoid harming them and let them go about their business, but to actively help them. So I think deontological rules for an AI, if you tried to construct them, should be quite different in this respect than deontological rules for a human, and should probably focus just as much on helping as on not harming.
As long as all the agentic AGIs people are building are value learners (i.e. their utility function is hard-coded to something like “figure out what utility function humans in aggregate would want you to use if they understood the problem better, and use that”), then improving their understanding of the human values becomes a convergent instrumental strategy for them: obviously the better they understand the human-desired utility function, the better job they can do of optimizing it. In particular, if AGI’s capabilities are large, and as a result many of the things it can do are outside the region of validity of its initial model of human values, and also it understands the concept of the region of validity of a model (a rather basic, obviously required capability for an AGI that can do research, so this seems like a reasonable assumption), then it can’t use most of its capabilities safely, so solving that problem obviously becomes top priority. This is painfully obvious to us, so it should also be painfully obvious to an AGI capable of doing research.
In that situation, a fast takeoff should just cause you to get an awful lot of AGI intelligence focused on the problem of solving alignment. So, as the author mentions, perhaps we should be thinking about how we would maintain human supervision in that eventuality? That strikes me as a particular problem that I’d feel more comfortable to have solved by a human alignment researcher than an AGI one.
I’m not an ethical philosopher, but my intuition, based primarily on personal experience, is that deontological ethics are a collection of heuristic rules of thumb extracted from the average answers of utilitarian ethics applied to a common range of situations that often crop up between humans. (I also view this as a slightly-idealized description of the legal system.) As such, they’re useful primarily in the same ways that heuristics often are useful compared to actually calculating a complex function, by reducing computational load. For people, they also provide useful markers to avoid ‘slippery slope’ situations where personal benefit might encourage you to err on one side in a complex estimation/calculation of overall utility. They also provide a way of trying to settle arguments: “I didn’t break any deontological ethical rules” is often a defense in the court of public opinion, and is often less contentious than “utilitarian ethics support my actions”.
As such, my feeling is that for a powerful AGI, it should have better ability to handle computational load than a human, it is more likely to encounter situations that are ‘out of distribution’ (atypical or even weird) compared to a human, which might take these heuristics outside their range of validity, it ought to be more capable of computing a utility function without personal bias, and it is likely to be smart enough to find ways to ‘rules lawyer’ corner cases that the deontological heuristics don’t handle well. So for a sufficiently smart AGI, I would strongly suspect that even well-implemented deontological ethics would be more dangerous than well-implemented utilitarian ethics. But I’m mostly working from software-engineer intuition, that I don’t really trust a spaghetti-code ball of heuristics — so this isn’t a philosophical argument.
However, for less capable AI systems, ones not powerful enough to run a good utilitarian value function, a set of deontological ethical heuristics (and also possibly-simplified summaries of relevant laws) might well be useful to reduce computational load, if these were carefully crafted to cover the entire range of situations that they are likely to encounter (and especially with guides for identifying when a situation was outside that range and it should consult something more capable). However, the resulting collection of heuristics might look rather different from the deontological ethical rules I’d give a human child.
More broadly, most people in the AI alignment space that I’ve seen approaching the problem of either describing human values to an AI, or having it learn them, have appeared to view ethics from a utilitarian/consequentialist rather than a deontological perspective, and tend to regard this prospect as very challenging and complex — far more so than if you just had to teach the machine a list of deontological ethical rules. So my impression is that most people in AI safety and alignment are not using a deontological viewpoint — I’d love to hear it that has been your experience too? Indeed, my suspicion is that many of them would view that as either oversimplified, or unlikely to continue to continue to work well as rapid technological change enabled by AGI caused a large number of new ethical conundrums to appear that we don’t yet have a social consensus on deontological rules for.
For example, my personal impression is that many human societies are still arguing about changes in deontological ethics in response to the easy availability of birth control, something that we’ve had for O(60) years. In the presence of AGI, rates of technological change could well increase massively, and we could face ethical conundrums far more complex than those posed by birth control.
I thought this was a very interesting paper — I particularly liked the relationship to phase transitions.
However, I think there’s a likely to be another ‘phase’ that they don’t discuss (possibly it didn’t crop up in their small models, since it’s only viable in a sufficiently large model): one where you pack a very large number of features (thousands or millions, say) into a fairly large number of dimensions (hundreds, say). In spaces with dimensionality >= O(100), the statistics of norm and dot product are such that even randomly chosen unit norm vectors are almost invariably nearly orthogonal. So gradient descent would have very little work to do to find a very large number of vectors (much larger than the number of dimensions) that are all mutually almost-orthogonal, so that show very little interference between them. This is basically the limiting case of the pattern observed in the paper of packing n features in superposition into d dimensions where n > d >= 1, taking this towards the limit where n >> d >> 1.
Intuitively this phase seems particularly likely in a context like the residual stream of an LLM, where (in theory, if not in practice) the embedding space is invariant under arbitrary rotations, so there’s no obvious reason to expect vectors to align with the coordinate axes. On the other hand, in a system where there was a preferred basis (such as a system using L1 regularization), you might get such vectors that were themselves sparse, with most components zero but a significant number of non-zero components, enough for the randomness to still give low interference.
More speculatively, in a neural net that was using at least some of its neurons in this high-dimensionality dense superposition phase, the model will presumably learn ways to manipulate these vectors to do computation in superposition. One possibility for this might be methods comparable to some of the possible Vector Symbolic Architectures (also known as hyperdimensional computing) outlined in e.g. https://arxiv.org/pdf/2106.05268.pdf. Of the primitives used in that, a fully connected layer can clearly be trained to implement both addition of vectors and permutations of their elements, I suspect something functionally comparable to the vector elementwise-multiplication (Hadamard product) operation could be produced by using the nonlinearity of a smooth activation function such as GELU or Swish, and I suspect their their clean-up memory operation could be implemented using attention. If it turned out to be the case that SGD actually often finds solutions of this form, then an understanding of vector symbolic architectures might be helpful for interpretability of models where portions of them used this phase. This seems most likely in models that need to pack vast numbers of features into large numbers of dimensions, such as modern large LLMs.
My view is that we’ve already made some significant progress on alignment, compared to say where we were O(15) years ago, and have also had some unexpectedly lucky breaks. Personally I’d list:
Value learning, as a potential solution to issues like corrigibility and the shut-down problem.
Once your value learner is a STEM-capable AGI, then doing or assisting with alignment research becomes a convergent instrumental strategy for it.
The closest thing we currently have to an AGI, LLMs, are fortunately not particularly agentic, they’re more of a tool AI (until you wrap them in a script to run them in a loop with suitable prompts).
To be more specific, for the duration of generating a specific document (at least before RLHF), an LLM emulates the output of a human or humans generating text, so to the extent that they pick up/emulate agentic behavior from us, it’s myopic past the end of document, and emulates some human(s) who have contributed text to their training set. Semi-randomly-chosen humans are a type of agent that humans are unusually good at understanding and predicting. The orthogonality thesis doesn’t apply to them: they will have an emulation of some version of human values. Like actual random humans, they’re not inherently fully aligned, but on average they’re distinctly better than paperclip maximizers. (Also both RLHF and prompts can alter the random distribution.)
While human values are large and fragile, LLMs are capable of capturing fairly good representations of large fragile things, including human values. So things like constitutional RL work. That still leaves concerns about what happens when we apply optimization pressure or distribution shifts to these representations of human values, but it’s at least a lot better than expecting us to hand-craft a utility function for the entire of human values in symbolic form. If we could solve knowing when an LLM representation of human values was out-of distribution and not reliable, then we might actually have a basis for an AGI-alignment solution that I wouldn’t expect to immediately kill everyone. (For example, it might make an acceptable initial setting to preload into an AGI value learner that could then refine it and extend its region of validity.) Even better, knowing when an LLM isn’t able to give a reliable answer is a capabilities problem, not just an alignment problem, since it’s the same issue as getting an LLM to reply “I don’t know” when asked a question to which it would otherwise have hallucinated a false answer. So all of the companies buying and selling access to LLMs are strongly motivated to solve this. (Indeed, leading LLM companies appear to have made significant progress on reducing hallucination rates in the last year.)
This is a personal list and I’m sure will be missing some items.
That we’ve made some progress and had some lucky breaks doesn’t guarantee that this will continue, but it’s unsurprising to me that
alignment research in the context of a specific technology that we can actually experiment with is easier than trying to do alignment research in abstract for arbitrary future systems, and that
with more people interested in alignment research we’re making progress faster.
I agree that, at least for the more serious risks, there doesn’t seem to be consensus on what the mitigations should be.
For example, I’d be interested to know what proportion of alignment researchers would consider an AGI that’s a value learner (and of course has some initial model of human values created by humans to start that value learning process from) to have better outer-alignment safety properties that an AGI with a fixed utility function created by humans.
For me it very clear that the former is better, as it incentivizes the AGI to converge from its initial model of human values towards true human values, allowing it to fix problems when the initial model, say, goes out-of-distribution or doesn’t have sufficient detail. But I have no idea how much consensus there is on this, and I see a lot of alignment researchers working on approaches that don’t appear to assume that the AI system is a value learner.
The best solution I can think of to outer-aligning an AGI capable of doing STEM research is to build one that’s a value learner and an alignment researcher. Obviously for a value learner, doing alignment research is a convergent instrumental strategy: it wants to do whatever humans want, so it needs to better figure out what that is so it can do a better job. Then human values become an attractor.
However, to implement this strategy, you first need to build a value-learning AGI capable of doing STEM research (which obviously we don’t yet know how to do) that is initially sufficiently aligned to human values that it starts off inside the basin of attraction. I.e. it needs a passable first guess at human values for it to improve upon: one that’s sufficiently close that a) it doesn’t kill us all in the meantime while its understanding of our values is converging, b) it understands that we want things from it like honesty, corrigibility, willingness to shut down, fairness and so forth, and c) that we can’t give it a complete description of human values because we don’t fully understand them ourselves.
Your suggestion of using something like an LLM to encode a representation of human values is exactly the lines that I think we should be thinking on for that “initial starting value” for human values for a value learning AGI. Indeed, there are already researchers building ethical question testing sets for LLMs.
An interesting paper on successfully distinguishing different mechanisms inside image classification models: https://arxiv.org/pdf/2211.08422.pdf — for this small model they correspond to different, disconnected local minimal of the loss function (I assume basically because it only has enough capacity to implement one strategy really well, so it has to pick one). They even outline approaches to move models from one mechanism that doesn’t generalize well to another that does.
I don’t immediately see how to extend this to the sort of different mechanisms that Paul was discussing, but it feels like it might be relevant; albeit the mechanisms might be a lot less clearly separable on something as complex and multi-task-capable as an AGI, which might well need to learn multiple capabilities (possibly including deceit) and then have a way of deciding which one to apply in a particular case.
One thing that is pretty clear is that an honest mechanism and a deceitful mechanism are going to have very different latent knowledge inside them: “how to I keep the diamond safe?” and “how do I tamper with the sensors so the diamond looks safe?” are very different problems. They’re also potentially of different difficulty levels, which might have a big effect on which one gradient descent, or indeed smart AGI optimization, is going to find a solution to first. If our sensors were hardened enough to make fooling them really difficult, that might make finding a passable (and improvable) approach to vault safety much easier than to fooling the humans, at least for gradient descent. Of course, while gradient descent generally stays in whatever local minimum it found first, and AGI doing optimization probably doesn’t have that limitation, and could decide to switch strategies. On the other hand, the strategy “don’t do any work other than fooling the humans” generalizes really well to many different problems.
However, I still feel that this approach to AGI safety is like trying to build barriers between yourself and something malicious and very smart, and you’re a lot better off if the system doesn’t have anything malicious in it to start off with. So, I’m a lot more optimistic about an AGI that’s a value learner, can figure out that we don’t want to be shown deceitful images (not a very hard problem in human values), and then not do that because it knows that’s not what we want.
Having discussed this proposal with an expert on LLMs, they tell me that, if the boundaries between prompt and input text and between input text and output text are each marked with special reserved tokens as I described (and if “can a longformer attend to that location from here?” issues are dealt with somehow), then for each boundary there is a 2-neuron circuit that will produce a signal for each token as whether is it before or after that special token (and I assume a 2-or-3-neuron circuit for being after one but before the other). It seems extremely likely that with appropriate “obey-the-prompt-only” training such neural circuits would be learned, so features of “I’m in the prompt”, “I’m in the input text”, and “I’m in the output text” would become available downstream of them. Nevertheless, this means that these signals are not available until after layer 2 (or for the combination, possibly layer 3), and their accuracy will depend on these neural circuits being learnt exactly and not getting perturbed by anything during training.
From a security viewpoint, this doesn’t feel secure enough to me. However, switching architecture to an encoder-decoder or dual-encoder-single-decode model may be too drastic a change just to fix a security issue. An intermediate positions would be to use feature engineering. For example, suppose you have an LLM with a residual embedding dimension . You could reduce the token embedding (and perhaps also position embedding) dimension to and use the remaining dimension to encode the distinctions between prompt, input, and output (say using, in prompt = , in input = , in output = ). That of course doesn’t prevent intermediate layers from outputting to this dimension and potentially messing this signal up (though giving them only output dimensions and preventing that would also be an option). Or you could simply pass this feature along as an extra read-only dimension/feature appended to the residual channel dimensions, so every sets of weights that read from or attend to the residuals needs to have weights, making them slightly larger. All of these variant proposals involve making some modifications to the LLM’s architecture, but they’re all a lot simpler and less expensive than my first proposal.
All of these proposals (including the original) are, of course going against advice of the Bitter Lesson My response would be that I’m quite aware that (given unfakable boundary tokens) the neural net can learn to distinguish between the prompt, input, and output text without us doing anything further: I just don’t trust it to do so as reliably, efficiently, or perfectly as if we use feature engineering to explicitly supply this signal as input to the first layer. In the case of security, there is a huge difference between being secure under, say, 99.99% of inputs vs. 100%, because you have an attacker actively searching for the insecure 0.01% of the space. Training a classifier to achieve more than 99.99% accuracy tends to require huge amounts of training data, or data adversarialy enriched in potential problem cases, because you only get gradient from the failed cases, and I don’t see how you can ever get to 100% by training. So I’m not convinced that the Bitter Lesson applies to security issues.
On the other hand, the feature engineering approach can only ensure that the signal is available to the neural net: even that can’t ensure that the LLM will 100% never obey instructions in the input text, only that the “this is input text” label was 100% available to every layer of the LLM.
Quite a number of emotion neurons have also been found in the CLIP text/image network, see https://distill.pub/2021/multimodal-neurons/#emotion-neuron for more details. In this case it’s apparently not representing the emotions of the writer of the text/photographer of the image, but those of the subject of the picture. Nevertheless, it demonstrates that neural nets can learn non-trivial representations of human emotions (interestingly, even down to distinguishing between ‘healthy’ and ‘unhealthy/mentally troubled’ variants of the same emotion). It would be interesting to see if LLMs distinguish between writing about a specific emotion, and writing while feeling that emotion. My expectation would be that these two ideas are correlated but distinct: one can write dispassionately about anger, or write angrily about some other emotion, so a sufficiently large LLM would need to use different representations for them, but they might well overlap.
Some very interesting and inspiring material.
I was fascinated to see that https://distill.pub/2021/multimodal-neurons/#emotion-neurons provides some clear evidence for emotion neurons in CLIP rather similar to the ones for modeling author’s current emotional state that I hypothesized might exist in LLMs in https://www.lesswrong.com/posts/4Gt42jX7RiaNaxCwP/?commentId=ggKug9izazELkRLun As I noted there, if true this would have significant potential for LLM safety and alignment.
Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)
A. Harden sensors so that tampering is harder than the intended task
We could design and deploy a lot of redundant sensors, and do science and red-teaming to understand the possible principles by which those sensors can be compromised. In the modern world it’s kind of insane to imagine a human rebellion that not only succeeded but left no trace of itself.
Suppose the diamond in the room we’re monitoring contained a security tag, powered by incoming light, with a cryptographic secret in it, and an interface that allows us to continuously do a zero-knowledge identity proof on it via a laser optical interface, using a laser pointed into the room as input and the (high-speed) camera as output. So to successfully tamper with the camera, the AI would need access to the secret inside the security tag inside the diamond, or at least to reroute the channel to-and-from it. So it would need to be able to steal the diamond, extract the security tag, and reverse engineer its cryptographic secret out of it in a time shorter then the period between camera frames, or similarly replace the diamond with an optically-similar replacement now containing the security tag in the room, or integrate the security tag and its input-output channel into its camera-spoofing system, in each case in the time between frames. None of these are actually physically impossible, but they could be made extremely hard to do in a small fraction of a second (say we’re using a really-high speed camera with basically no temporal gaps between frames).
The basic idea here is that cryptographic techniques may be useful when hardening things. Whether this approach could usefully be extended to anything significantly more AGI-alignment-like than a diamond-in-a-vault scenario is unclear.
Now we’re going to build a new model that is constructed based on the description of this model. Each component in the new model is going to be a small model trained to imitate a human computing the function that the description of the component specifies.
Some of the recent advances in symbolic regression and equation learning might be useful during this step to help generate functions describing component behavior, if what the component in the model is doing is moderately complicated. (E.g. A Mechanistic Interpretability Analysis of Grokking found that a model trained to do modular arithmetic ended up implementing it using discrete Fourier transforms and trig identities, which sounds like the sort of thing that might be a lot easier to figure our from a learned equation describing the component’s behavior). Being able to reduce a neural circuit to an equation or a Bayesnet or whatever would help a lot with interpretablity, and at that point you might even not need to train an implementation model — we could maybe just use the symbolic form directly, as a more compact and more efficiently computable representation.
At the end of this process, you might even end up with something symbolic that looked a lot like a “Good Old Fashioned AI” (GOFAI) model — but a “Bitter Lesson compatible” one first learnt by a neural net and then reverse engineered using interpretability. Obviously doing this would put high demands on our interpretation tools.
If I had such a function describing a neural net component, one of my first questions would be: what portions of the domain of this function are well covered by the training set that the initial neural net model was trained on, or at least sufficiently near items in that training set interpolating the function to them seems likely to be safe (given its local first, second, third… partial derivatives) vs. what portions are untested extrapolations? Did the symbolic regression/function learning process give us multiple candidate functions, and if so how much do they differ outside that well-tested region of the function domain?
This seems like it would give us some useful intuition for when the model might be unsafely extrapolating outside the training distribution and we need to be particularly cautious.
Some portions of the neural net may turn out to be irreducibly complex — I suspect it would be good to be able to identify when something genuinely is complex, and when we’re just looking at a big tangled-up blob of memorized instances from the training set (e.g. by somehow localizing sources of loss on the test set).
This is very interesting: thanks for plotting it.
However, there is something that’s likely to happen that might perturb this extrapolation. Companies building large foundation models are likely soon going to start building multimodal models (indeed, GPT-4 is already multimodal, since it understands images as well as text). This will happen for at least three inter-related reasons:
Multimodal models are inherently more useful, since they also understand some combination of images, video, music… as well as text, and the relationships between them.
It’s going to be challenging to find orders of magnitude more high-quality text data than exists on the Internet, but there are huge amounts of video and image data (YouTube, TV and cinema, Google Street View, satellite images, everything any Tesla’s cameras have ever uploaded, …), and it seems that the models of reality needed to understand/predict text, images, and video overlap and interact significantly and usefully.
It seems likely that video will give the models better understanding of commonsense aspects of physical reality important to humans (and humanoid robots): humans are heavily visual, and so are a lot of things in the society we’ve built
The question then is, does a thousand tokens-worth of text, video, and image data teach the model the same net amount? It seems plausible that video or image data might require more input to learn the same amount (depending on details of compression and tokenization), in which case training compute requirements might increase, which could throw the trend lines off. Even if not, the set of skills the model is learning will be larger, and while some things it’s learning overlap between these, others don’t, which could also alter the trend lines.