# Testing The Natural Abstraction Hypothesis: Project Intro

The natural abstraction hypothesis says that

Our physical world abstracts well: for most systems, the information relevant “far away” from the system (in various senses) is much lower-dimensional than the system itself. These low-dimensional summaries are exactly the high-level abstract objects/concepts typically used by humans.

These abstractions are “natural”: a wide variety of cognitive architectures will learn to use approximately the same high-level abstract objects/concepts to reason about the world.

If true, the natural abstraction hypothesis would dramatically simplify AI and AI alignment in particular. It would mean that a wide variety of cognitive architectures will reliably learn approximately-the-same concepts as humans use, and that these concepts can be precisely and unambiguously specified.

Ultimately, the natural abstraction hypothesis is an empirical claim, and will need to be tested empirically. At this point, however, we lack even the tools required to test it. This post is an intro to a project to build those tools and, ultimately, test the natural abstraction hypothesis in the real world.

## Background & Motivation

One of the major conceptual challenges of designing human-aligned AI is the fact that human values are a function of humans’ latent variables: humans care about abstract objects/concepts like trees, cars, or other humans, not about low-level quantum world-states directly. This leads to conceptual problems of defining “what we want” in physical, reductive terms. More generally, it leads to conceptual problems in translating between human concepts and concepts learned by other systems—e.g. ML systems or biological systems.

If true, the natural abstraction hypothesis provides a framework for translating between high-level human concepts, low-level physical systems, and high-level concepts used by non-human systems.

The foundations of the framework have been sketched out in previous posts.

What is Abstraction? introduces the mathematical formulation of the framework and provides several examples. Briefly: the high-dimensional internal details of far-apart subsystems are independent given their low-dimensional “abstract” summaries. For instance, the Lumped Circuit Abstraction abstracts away all the details of molecule positions or wire shapes in an electronic circuit, and represents the circuit as components each summarized by some low-dimensional behavior—like V = IR for a resistor. This works because the low-level molecular motions in a resistor are independent of the low-level molecular motions in some far-off part of the circuit, *given* the high-level summary. All the rest of the low-level information is “wiped out” by noise in low-level variables “in between” the far-apart components.

Chaos Induces Abstractions explains one major reason *why* we expect low-level details to be independent (given high-level summaries) for typical physical systems. If I have a bunch of balls bouncing around perfectly elastically in a box, then the total energy, number of balls, and volume of the box are all conserved, but chaos wipes out all other information about the exact positions and velocities of the balls. My “high-level summary” is then the energy, number of balls, and volume of the box; all other low-level information is wiped out by chaos. This is exactly the abstraction behind the ideal gas law. More generally, given any uncertainty in initial conditions—even very small uncertainty—mathematical chaos “amplifies” that uncertainty until we are maximally uncertain about the system state… *except* for information which is perfectly conserved. In most dynamical systems, some information is conserved, and the rest is wiped out by chaos.

Anatomy of a Gear: What makes a good “gear” in a gears-level model? A physical gear is a very high-dimensional object, consisting of huge numbers of atoms rattling around. But for purposes of predicting the behavior of the gearbox, we need only a one-dimensional summary of all that motion: the rotation angle of the gear. More generally, a good “gear” is a subsystem which abstracts well—i.e. a subsystem for which a low-dimensional summary can contain all the information relevant to predicting far-away parts of the system.

Science in a High Dimensional World: Imagine that we are early scientists, investigating the mechanics of a sled sliding down a slope. The number of variables which could conceivably influence the sled’s speed is vast: angle of the hill, weight and shape and material of the sled, blessings or curses laid upon the sled or the hill, the weather, wetness, phase of the moon, latitude and/or longitude and/or altitude, astrological motions of stars and planets, etc. Yet in practice, just a relatively-low-dimensional handful of variables suffices—maybe a dozen. A consistent sled-speed can be achieved while controlling *only a dozen variables*, out of *literally billions*. And this generalizes: across every domain of science, we find that controlling just a relatively-small handful of variables is sufficient to reliably predict the system’s behavior. Figuring out *which* variables is, in some sense, the central project of science. This is the natural abstraction hypothesis in action: across the sciences, we find that low-dimensional summaries of high-dimensional systems suffice for broad classes of “far-away” predictions, like the speed of a sled.

## The Problem and The Plan

The natural abstraction hypothesis can be split into three sub-claims, two empirical, one mathematical:

Abstractability: for most physical systems, the information relevant “far away” can be represented by a summary much lower-dimensional than the system itself.

Human-Compatibility: These summaries are the abstractions used by humans in day-to-day thought/language.

Convergence: a wide variety of cognitive architectures learn and use approximately-the-same summaries.

Abstractability and human-compatibility are empirical claims, which ultimately need to be tested in the real world. Convergence is a more mathematical claim, i.e. it will ideally involve proving theorems, though empirical investigation will likely still be needed to figure out exactly *which* theorems.

These three claims suggest three different kinds of experiment to start off:

Abstractability: does reality abstract well? Corresponding experiment type: run a reasonably-detailed low-level simulation of something realistic; see if info-at-a-distance is low-dimensional.

Human-Compatibility: do these match human abstractions? Corresponding experiment type: run a reasonably-detailed low-level simulation of something realistic; see if info-at-a-distance recovers human-recognizable abstractions.

Convergence: are these abstractions learned/used by a wide variety of cognitive architectures? Corresponding experiment type: train a predictor/agent against a simulated environment with known abstractions; look for a learned abstract model.

The first two experiments both require computing information-relevant-at-a-distance in a reasonably-complex simulated environment. The “naive”, brute-force method for this would not be tractable; it would require evaluating high-dimensional integrals over “noise” variables. So the first step will be to find practical algorithms for directly computing abstractions from low-level simulated environments. These don’t need to be fully-general or arbitrarily-precise (at least not initially), but they need to be general enough to apply to a reasonable variety of realistic systems.

Once we have algorithms capable of directly computing the abstractions in a system, training a few cognitive models against that system is an obvious next step. This raises another algorithmic problem: how do we efficiently check whether a cognitive system has learned particular abstractions? Again, this doesn’t need to be fully general or arbitrarily precise. It just needs to be general enough to use as a tool for the next step.

The next step is where things get interesting. Ideally, we want general theorems telling us which cognitive systems will learn which abstractions in which environments. As of right now, I’m not even sure exactly what those theorems should say. (There are some promising directions, like modular variation of goals, but the details are still pretty sparse and it’s not obvious whether these are the right directions.) This is the perfect use-case for a feedback loop between empirical and theoretical work:

Try training various cognitive systems in various environments, see what abstractions they learn.

Build a model which matches the empirical results, then come up with new tests for that model.

Iterate.

Along the way, it should be possible to prove theorems on what abstractions will be learned in at least some cases. Experiments should then probe cases not handled by those theorems, enabling more general models and theorems, eventually leading to a unified theory.

(Of course, in practice this will probably also involve a larger feedback loop, in which lessons learned training models also inform new algorithms for computing abstractions in more-general environments, and for identifying abstractions learned by the models.)

The end result of this process, the holy grail of the project, would be a system which provably learns all learnable abstractions in a fairly general class of environments, and represents those abstractions in a legible way. In other words: it would be a standardized tool for measuring abstractions. Stick it in some environment, and it finds the abstractions in that environment and presents a standard representation of them. Like a thermometer for abstractions.

Then, the ultimate test of the natural abstraction hypothesis would just be a matter of pointing the abstraction-thermometer at the real world, and seeing if it spits out human-recognizable abstract objects/concepts.

## Summary

The natural abstraction hypothesis suggests that most high-level abstract concepts used by humans are “natural”: the physical world contains subsystems for which all the information relevant “far away” can be contained in a (relatively) low-dimensional summary. These subsystems are exactly the high-level “objects” or “categories” or “concepts” we recognize in the world. If true, this hypothesis would dramatically simplify the problem of human-aligned AI. It would imply that a wide range of architectures will reliably learn similar high-level concepts from the physical world, that those high-level concepts are exactly the objects/categories/concepts which humans care about (i.e. inputs to human values), and that we can precisely specify those concepts.

The natural abstraction hypothesis is mainly an empirical claim, which needs to be tested in the real world.

My main plan for testing this involves a feedback loop between:

Calculating abstractions in (reasonably-realistic) simulated systems

Training cognitive models on those systems

Empirically identifying patterns in which abstractions are learned by which cognitive models in which environments

Proving theorems about which abstractions are learned by which cognitive models in which environments.

The holy grail of the project would be an “abstraction thermometer”: an algorithm capable of reliably identifying the abstractions in an environment and representing them in a standard format. In other words, a tool for measuring abstractions. This tool could then be used to measure abstractions in the real world, in order to test the natural abstraction hypothesis.

I plan to spend at least the next six months working on this project. Funding for the project has been supplied by the Long-Term Future Fund.

- (My understanding of) What Everyone in Technical Alignment is Doing and Why by 29 Aug 2022 1:23 UTC; 388 points) (
- How To Get Into Independent Research On Alignment/Agency by 19 Nov 2021 0:00 UTC; 334 points) (
- On how various plans miss the hard bits of the alignment challenge by 12 Jul 2022 2:49 UTC; 281 points) (
- Why Agent Foundations? An Overly Abstract Explanation by 25 Mar 2022 23:17 UTC; 276 points) (
- How to pursue a career in technical AI alignment by 4 Jun 2022 21:36 UTC; 247 points) (EA Forum;
- Natural Abstractions: Key claims, Theorems, and Critiques by 16 Mar 2023 16:37 UTC; 187 points) (
- How To Go From Interpretability To Alignment: Just Retarget The Search by 10 Aug 2022 16:08 UTC; 157 points) (
- On how various plans miss the hard bits of the alignment challenge by 12 Jul 2022 5:35 UTC; 125 points) (EA Forum;
- What’s General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems? by 15 Aug 2022 22:48 UTC; 114 points) (
- Long-Term Future Fund: May 2021 grant recommendations by 27 May 2021 6:44 UTC; 110 points) (EA Forum;
- Productive Mistakes, Not Perfect Answers by 7 Apr 2022 16:41 UTC; 96 points) (
- The Telephone Theorem: Information At A Distance Is Mediated By Deterministic Constraints by 31 Aug 2021 16:50 UTC; 95 points) (
- A mostly critical review of infra-Bayesianism by 28 Feb 2023 18:37 UTC; 93 points) (
- Testing The Natural Abstraction Hypothesis: Project Update by 20 Sep 2021 3:44 UTC; 86 points) (
- 12 Nov 2021 19:06 UTC; 81 points) 's comment on Discussion with Eliezer Yudkowsky on AGI interventions by (
- Research agenda: Formalizing abstractions of computations by 2 Feb 2023 4:29 UTC; 81 points) (
- How to pursue a career in technical AI alignment by 4 Jun 2022 21:11 UTC; 66 points) (
- Voting Results for the 2021 Review by 1 Feb 2023 8:02 UTC; 66 points) (
- Alignment Org Cheat Sheet by 20 Sep 2022 17:36 UTC; 65 points) (
- What Selection Theorems Do We Expect/Want? by 1 Oct 2021 16:03 UTC; 65 points) (
- The Pragmascope Idea by 4 Aug 2022 21:52 UTC; 58 points) (
- How Do Selection Theorems Relate To Interpretability? by 9 Jun 2022 19:39 UTC; 57 points) (
- The Brain That Builds Itself by 31 May 2022 9:42 UTC; 56 points) (
- Paradigms of AI alignment: components and enablers by 2 Jun 2022 6:19 UTC; 52 points) (
- Path dependence in ML inductive biases by 10 Sep 2022 1:38 UTC; 51 points) (
- Reframing inner alignment by 11 Dec 2022 13:53 UTC; 49 points) (
- 16 Mar 2023 20:35 UTC; 48 points) 's comment on Natural Abstractions: Key claims, Theorems, and Critiques by (
- Reply to Nate Soares on Dolphins by 10 Jun 2021 4:53 UTC; 46 points) (
- [Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques by 16 Mar 2023 16:38 UTC; 46 points) (
- Computing Natural Abstractions: Linear Approximation by 15 Apr 2021 17:47 UTC; 41 points) (
- Is InstructGPT Following Instructions in Other Languages Surprising? by 13 Feb 2023 23:26 UTC; 39 points) (
- Consider trying Vivek Hebbar’s alignment exercises by 24 Oct 2022 19:46 UTC; 38 points) (
- What Does The Natural Abstraction Framework Say About ELK? by 15 Feb 2022 2:27 UTC; 34 points) (
- An overview of some promising work by junior alignment researchers by 26 Dec 2022 17:23 UTC; 34 points) (
- [Hebbian Natural Abstractions] Introduction by 21 Nov 2022 20:34 UTC; 34 points) (
- Blood Is Thicker Than Water 🐬 by 28 Sep 2021 3:21 UTC; 34 points) (
- The Natural Abstraction Hypothesis: Implications and Evidence by 14 Dec 2021 23:14 UTC; 33 points) (
- confusion about alignment requirements by 6 Oct 2022 10:32 UTC; 33 points) (
- 12 Nov 2021 15:33 UTC; 28 points) 's comment on Discussion with Eliezer Yudkowsky on AGI interventions by (
- The economy as an analogy for advanced AI systems by 15 Nov 2022 11:16 UTC; 27 points) (
- [AN #148]: Analyzing generalization across more axes than just accuracy or loss by 28 Apr 2021 18:30 UTC; 24 points) (
- Why the Problem of the Criterion Matters by 30 Oct 2021 20:44 UTC; 24 points) (
- [Simulators seminar sequence] #2 Semiotic physics—revamped by 27 Feb 2023 0:25 UTC; 24 points) (
- Neural net / decision tree hybrids: a potential path toward bridging the interpretability gap by 23 Sep 2021 0:38 UTC; 21 points) (
- AXRP Episode 11 - Attainable Utility and Power with Alex Turner by 25 Sep 2021 21:10 UTC; 19 points) (
- Consider trying Vivek Hebbar’s alignment exercises by 24 Oct 2022 19:46 UTC; 16 points) (EA Forum;
- Categories of Arguing Style : Why being good among rationalists isn’t enough to argue with everyone by 7 May 2023 17:45 UTC; 16 points) (
- Identification of Natural Modularity by 25 Jun 2022 15:05 UTC; 15 points) (
- [Hebbian Natural Abstractions] Mathematical Foundations by 25 Dec 2022 20:58 UTC; 15 points) (
- Abstraction is Bigger than Natural Abstraction by 31 May 2023 0:00 UTC; 15 points) (
- Why and When Interpretability Work is Dangerous by 28 May 2023 0:27 UTC; 15 points) (
- 14 Jun 2021 7:00 UTC; 13 points) 's comment on Looking Deeper at Deconfusion by (
- 10 Jan 2022 7:04 UTC; 13 points) 's comment on The date of AI Takeover is not the day the AI takes over by (
- 16 Jul 2022 6:16 UTC; 12 points) 's comment on Safety Implications of LeCun’s path to machine intelligence by (
- 2 Jan 2023 19:54 UTC; 11 points) 's comment on Large language models can provide “normative assumptions” for learning human preferences by (
- 31 May 2021 11:20 UTC; 11 points) 's comment on What is the most effective way to donate to AGI XRisk mitigation? by (
- A mesa-optimization perspective on AI valence and moral patienthood by 9 Sep 2021 22:23 UTC; 10 points) (EA Forum;
- An overview of some promising work by junior alignment researchers by 26 Dec 2022 17:23 UTC; 10 points) (EA Forum;
- [Linkpost] How To Get Into Independent Research On Alignment/Agency by 14 Feb 2022 21:40 UTC; 10 points) (EA Forum;
- Alignment Targets and The Natural Abstraction Hypothesis by 8 Mar 2023 11:45 UTC; 10 points) (
- 8 Jul 2021 1:51 UTC; 9 points) 's comment on Potential Bottlenecks to Taking Over The World by (
- Countering arguments against working on AI safety by 20 Jul 2022 18:23 UTC; 7 points) (
- Why and When Interpretability Work is Dangerous by 28 May 2023 0:27 UTC; 5 points) (EA Forum;
- 10 Oct 2021 16:56 UTC; 5 points) 's comment on Selection Theorems: A Program For Understanding Agents by (
- 16 Apr 2021 15:05 UTC; 4 points) 's comment on Computing Natural Abstractions: Linear Approximation by (
- 13 Jan 2023 1:28 UTC; 3 points) 's comment on The AI Control Problem in a wider intellectual context by (
- 21 Aug 2022 18:03 UTC; 3 points) 's comment on My Plan to Build Aligned Superintelligence by (
- 28 Jun 2021 9:41 UTC; 2 points) 's comment on Frequent arguments about alignment by (
- 16 Mar 2023 10:22 UTC; 2 points) 's comment on Want to predict/explain/control the output of GPT-4? Then learn about the world, not about transformers. by (
- 18 Feb 2022 5:43 UTC; 2 points) 's comment on Implications of automated ontology identification by (

A decent introduction to the natural abstraction hypothesis, and how testing it might be attempted. A very worthy project, but it isn’t that easy to follow for beginners, nor does it provide a good understanding of how the testing might work in detail. What might consist a success, what might consist a failure of this testing? A decent introduction, but only an introduction, and it should have been part of a sequence or a longer post.

I’m curious if you’d looked at this followup (also nominated for review this year) http://lesswrong.com/posts/dNzhdiFE398KcGDc9/testing-the-natural-abstraction-hypothesis-project-update

I have looked at it, but ignored it when commenting on this post, which should stand on its own (or as part of a sequence).

Fair. Fwiw I’d be interested in your review of the followup as a standalone.

Here’s the review, though it’s not very detailed (the post explains why):

https://www.lesswrong.com/posts/dNzhdiFE398KcGDc9/testing-the-natural-abstraction-hypothesis-project-update?commentId=spMRg2NhPogHLgPa8

Yup, makes sense. Thank you!