Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

The Scaling Hypothesis (Gwern Branwen) (summarized by Rohin): This post centers around the scaling hypothesis:

Once we find a scalable architecture which can be applied fairly uniformly, we can simply train ever larger networks and ever more sophisticated behavior will emerge naturally as the easiest way to optimize for all the tasks and data. More powerful NNs are “just” scaled-up weak NNs, in much the same way that human brains look much like scaled-up primate brains⁠.

Importantly, we can get this sophisticated behavior just by training on simple objectives, such as “predict the next word”, as long as the data is sufficiently diverse. So, a priori, why might we expect the scaling hypothesis to be true?

The core reason is that optimal (or human-level) prediction of text really does require knowledge, reasoning, causality, etc. If you don’t know how to perform addition, you are probably not going to be able to predict the next word in the sentence “Though he started with six eggs, he found another fourteen, bringing his total to ____”. However, since any specific fact is only useful in a tiny, tiny number of cases, it only reduces the expected loss by a tiny amount. So, you’ll only see models learn this sort of behavior once they have exhausted all the other “easy wins” for predicting text; this will only happen when the models and dataset are huge.

Consider a model tasked with predicting characters in text with a set of 64 characters (52 uppercase and lowercase letters, along with some punctuation). Initially it outputs random characters, assigning a probability of ¹⁄₆₄ to the correct character, resulting in a loss of 6 bits. Once you start training, the easiest win is to simply notice how frequent each character is; just noticing that uppercase letters are rare, spaces are common, vowels are common, etc. could get your error down to 4-5 bits. After this, it might start to learn what words actually exist; this might take 10^5 − 10^6 samples since each word is relatively rare and there are thousands of words to learn, but this is a drop in the bucket given our huge dataset. After this step, it may have also learned punctuation along the way, and might now be down to 3-4 bits. At this point, if you sample from the model, you might get correctly spelled English words, but they won’t make any sense.

With further training the model now has to pick up on associations between adjacent words to make progress. Now it needs to look at things 10 characters ago to predict the next character—a far cry from our initial letter frequencies where it didn’t even need to look at other characters! For example, it might learn that “George W” tends to be followed by “ashington”. It starts to learn grammar, being able to correctly put verbs in relation to subjects and objects (that are themselves nouns). It starts to notice patterns in how words like “before” and “after” are used; these can then be used to better predict words in the future; at this point it’s clear that the model is starting to learn semantics. Now the loss is around 2 bits per character. A little more training and your model starts to produce sentences that sound human-like in isolation, but don’t fit together: a model might start a story about a very much alive protagonist and then talk about how she is dead in the next sentence. Training is now about fixing errors like these and each such fix gains a tiny amount of accuracy—think ten thousandths of a bit. Every further 0.1 bits you gain represents the model learning a huge amount of relevant knowledge (and correspondingly each subsequent 0.1 bits takes a much larger amount of training and data). The final few fractions of a bit are the most important and comprise most of what we call “intelligence”.

(The human baseline is a loss of 0.7 bits, with lots of uncertainty on that figure.)

So far this is a clever argument, but doesn’t really establish that this will work in practice—for example, maybe your model has to have 10^100 parameters to learn all of this, or maybe existing models and algorithms are not sophisticated enough to find the right parameters (and instead just plateau at, say, 2 bits of loss). But recent evidence provides strong support for the scaling hypothesis:

1. The scaling laws (AN #87) line of work demonstrated that models could be expected to reach the interesting realm of loss at amounts of compute, data, and model capacity that seemed feasible in the near future.

2. Various projects have trained large models and demonstrated that this allows them to solve tasks that they weren’t explicitly trained for, often in a more human-like way and with better performance than a more supervised approach. Examples include GPT-3 (AN #102), Image GPT, BigGAN, AlphaStar (AN #73), etc. (The full post has something like 25 examples.)

The author then argues that it seems like most researchers seem to be completely ignoring this phenomenon. OpenAI is the only actor that really has the conviction needed to put a large amount of resources behind a project based on the scaling hypothesis (such as GPT-3); DeepMind seems to believe in a weaker version where we need to build a bunch of “modules” similar to those in the human brain, but that those modules can then be scaled up indefinitely. Other actors seem to not take either scaling hypothesis very seriously.

Rohin’s opinion: In my view, the scaling hypothesis is easily the most important hypothesis relevant to AI forecasting and AI development models, and this is the best public writeup of it that I know of. (For example, it seems to be an implicit assumption in the bio anchors framework (AN #121).) I broadly agree with the author that it’s a bit shocking how few people seem to be taking it seriously after OpenAI Five, AlphaStar, GPT-3, Copilot, etc.

I think this includes the AI safety space, where as far as I can tell the primary effect has been that it is even more fashionable to have shorter timelines, whereas it hasn’t affected AI safety research very much. However, I do know around 3-4 researchers who changed what they were working on based on changing their mind about the scaling hypothesis, so it’s possible there are several others I don’t know about.

As a simple example of how the scaling hypothesis affects AI safety research, it suggests that the training objective (“predict the next word”) is relatively unimportant in determining properties of the trained agent; in contrast, the dataset is much more important. This suggests that analyses based on the “reward function used to train the agent” are probably not going to be very predictive of the systems we actually build.

TECHNICAL AI ALIGNMENT

AGENT FOUNDATIONS

The Accumulation of Knowledge (Alex Flint) (summarized by Rohin): Probability theory can tell us about how we ought to build agents that have knowledge (start with a prior and perform Bayesian updates as evidence comes in). However, this is not the only way to create knowledge: for example, humans are not ideal Bayesian reasoners. As part of our quest to describe existing agents (AN #66), could we have a theory of knowledge that specifies when a particular physical region within a closed system is “creating knowledge”? We want a theory that works in the Game of Life (AN #151) as well as the real world.

This sequence investigates this question from the perspective of defining the accumulation of knowledge as increasing correspondence between a map and the territory, and concludes that such definitions are not tenable. In particular, it considers four possibilities and demonstrates counterexamples to all of them:

1. Direct map-territory resemblance: Here, we say that knowledge accumulates in a physical region of space (the “map”) if that region of space looks more like the full system (the “territory”) over time.

Problem: This definition fails to account for cases of knowledge where the map is represented in a very different way that doesn’t resemble the territory, such as when a map is represented by a sequence of zeros and ones in a computer.

2. Map-territory mutual information: Instead of looking at direct resemblance, we can ask whether there is increasing mutual information between the supposed map and the territory it is meant to represent.

Problem: In the real world, nearly every region of space will have high mutual information with the rest of the world. For example, by this definition, a rock accumulates lots of knowledge as photons incident on its face affect the properties of specific electrons in the rock giving it lots of information.

3. Mutual information of an abstraction layer: An abstraction layer is a grouping of low-level configurations into high-level configurations such that transitions between high-level configurations are predictable without knowing the low-level configurations. For example, the zeros and ones in a computer are the high-level configurations of a digital abstraction layer over low-level physics. Knowledge accumulates in a region of space if that space has a digital abstraction layer, and the high-level configurations of the map have increasing mutual information with the low-level configurations of the territory.

Problem: A video camera that constantly records would accumulate much more knowledge by this definition than a human, even though the human is much more able to construct models and act on them.

4. Precipitation of action: The problem with our previous definitions is that they don’t require the knowledge to be useful. So perhaps we can instead say that knowledge is accumulating when it is being used to take action. To make this mechanistic, we say that knowledge accumulates when an entity’s actions become more fine-tuned to a specific environment configuration over time. (Intuitively, they learned more about the environment and so could condition their actions on that knowledge, which they previously could not do.)

Problem: This definition requires the knowledge to actually be used to count as knowledge. However, if someone makes a map of a coastline, but that map is never used (perhaps it is quickly destroyed), it seems wrong to say that during the map-making process knowledge was not accumulating.

AI GOVERNANCE

AI Certification: Advancing Ethical Practice by Reducing Information Asymmetries (Peter Cihon et al) (summarized by Rohin): Certification is a method of reducing information asymmetries: it presents credible information about a product to an audience that they couldn’t have easily gotten otherwise. With AI systems, certification could be used to credibly share information between AI actors, which could promote trust amongst competitors, or to share safety measures to prevent a race to the bottom on safety, caused by worrying that “the other guys would be even more unsafe”. Certification is at its best when there is demand from an audience to see such certificates; public education about the need for credible information can help generate such demand.

However, certification often runs into problems. Symbol-substance decoupling happens when certificates are issued to systems that don’t meet the standards for certification. For example, in “ethics washing”, companies advertise a self-certificate in which their products are approved by ethics boards, but those ethics boards have no real power. Means-ends decoupling happens when the standards for certification don’t advance the goals for which the certificate was designed. For example, a certificate might focus on whether a system was tested, rather than on what test was conducted, leading applicants to use easy-to-pass tests that don’t actually provide a check on whether the method is safe.

Effective certification for future AI systems needs to be responsive to changes in AI technology. This can be achieved in a few ways: first, we can try to test the underlying goals which are more likely to remain stable; for example, we could certify ethical principles that will likely remain the same in the future. Second, we can match the certification to the types of people and institutions, that is, our certifications talk about the executives, citizens, or corporations (rather than e.g. specific algorithms, that may be replaced in the future). Third, the certification system can build in mechanisms for updating the certification criteria periodically.

The paper then analyzes seven existing certification systems for AI systems; you’ll have to read the paper for details.

Case studies of self-governance to reduce technology risk (Jia) (summarized by Rohin): Should we expect AI companies to reduce risk through self-governance? This post investigates six historical cases, of which the two most successful were the Asilomar conference on recombinant DNA and the actions of Leo Szilard and other physicists in 1939 (around the development of the atomic bomb). It is hard to make any confident conclusions, but the author identifies the following five factors that make self-governance more likely:

1. The risks are salient.

2. If self-governance doesn’t happen, then the government will step in with regulation (which is expected to be poorly designed).

3. The field is small, so that coordination is easier.

4. There is support from gatekeepers (e.g. academic journals).

5. There is support from credentialed scientists.

Corporate Governance of Artificial Intelligence in the Public Interest (Peter Cihon et al) (summarized by Rohin): This paper is a broad overview of corporate governance of AI, where by corporate governance we mean “anything that affects how AI is governed within corporations” (a much broader category than the governance that is done by corporations about AI). The authors identify nine primary groups of actors that can influence corporate governance and give many examples of how such actors have affected AI governance in the past. The nine groups are managers, workers, investors, corporate partners and competitors, industry consortia, nonprofit organizations, the public, the media, and governments.

Since the paper is primarily a large set of examples along with pointers to other literature on the topic, I’m not going to summarize it in more detail here, though I did find many of the examples interesting (and would dive into them further if time was not so scarce).

FEEDBACK

I’m always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

[AN #156]: The scaling hypothesis: a plan for building AGI