AI self-improvement is possible

document purpose

This document mainly argues that human mental development implies that AI self-improvement from sub-human capabilities is possible.

document structure

For unambiguous referencing, sections are prefixed D: for description, L: for lemma, H: for hypothesis, or A: for argument. Descriptions are statements of facts, categorization, and definitions. Lemmas are linked arguments and conclusions that I have high confidence in. Hypotheses are views that at least some people have. Arguments are lines of reasoning that may or may not be correct.

D:abbreviations

  • ANN = artificial neural network

  • RL = reinforcement learning

  • SI = self-improvement (other than direct RL)

  • LLM = large language model (using a Transformer architecture)

  • SDF = signed distance field

D:data

Current LLMs are empirically data-limited, as shown by the “Chinchilla scaling laws” described in this paper.

D:efficiency

Humans are much more data-efficient than current LLMs. LLMs are trained on far more text than humans can read in their lifetime, yet their capabilities are (at least in many ways) much worse than humans. Considering D:data, this implies that humans use superior algorithms.

Humans are, for some tasks, more energy-efficient than current AI systems. A human brain uses only about 20 watts, which is much less than a modern GPU, but humans can do image segmentation faster than current ML models running on a single modern GPU.

H:plateau

Applying intelligence to improving intelligence has diminishing returns, so intelligence that becomes self-improving will plateau at a similar level of intelligence.

A:plateau

We don’t see humans or institutions become ultra-intelligent by recursive SI, so we know H:plateau is correct.

Here’s Robin Hanson making this argument; that’s still his view today.

D:genius

The genetic and morphological changes from early mammals to early apes were much greater than those from early apes to humans, but the absolute increase in intelligence from the latter step was much greater. A few more changes make the difference between an average human and the very smartest humans. Around human-level intelligence, returns do not seem to be diminishing.

H:human_SI

D:genius happens because small improvements in genetically-specified neural systems are amplified as those initial systems generate higher-level systems or perhaps recursively self-improve.

H:human_hyperparameters

D:genius happens because humans have better genetically-specified neural architectures, which when trained by RL produce better results.

D:prodigy

Exceptionally smart adult humans were, on average, more mentally capable as children, rather than less. Similarly, ANN architectures that plateau at better performance also generally tend to do better early in training.

L:childhood

Humans have the longest time-to-adulthood of any animal. Chimapzees also have unusually long childhoods. That is a large evolutionary disadvantage, so there must be a strong reason for it. Larger animals such as horses reach full size faster than humans, so it’s not related to body growth. The only other explanation is that a long childhood is necessary for certain brain development.

Human children first learn very basic actions like moving their limbs, then proceed to progressively more abstract and complex tasks that build on previously developed capabilities.

A:not_hyperparameters

If H:human_hyperparameters was correct, then by D:prodigy, we probably wouldn’t see L:childhood, but we do. So, H:human_hyperparameters is probably wrong as an explanation for differences between humans and other animals, which makes H:human_SI seem more likely.

That’s not an argument against H:human_hyperparameters being true for differences between humans, but if H:human_SI is true at all, it’s probably partly true for differences between humans.

L:reaction_time

Human reaction times are similar to other animals of similar size.

Human children take longer to learn to do basic actions like walking and running than other animals. If that slower speed was because of much greater network depth, then we’d see slower reaction times, so the network depth for basic actions is similar, so that’s not why human childhood development is relatively slow.

L:uneven_evolution

Dolphins and elephants are both fairly intelligent, but human-level intelligence did not develop gradually at similar rates across several disparate species. This implies to me that development of human-level intelligence was not something that can be easily evolved in a stepwise manner with a smooth cost-benefit curve.

Thus, development of human (and pre-human) intelligence was not a matter of simply stacking more layers or scaling up brains, with no other changes.

A:childhood_implications

Why do human children initially develop capabilities more slowly than most animals? Why is that a tradeoff for greater capabilities later?

What’s the relevant difference in humans? Per L:reaction_time and L:uneven_evolution, it’s unlikely to be deeper networks. Per D:prodigy, it’s is unlikely to be architectures that do straightforward RL more slowly but eventually become better. As far as I can see, the only remaining possibility is that humans use SI for some things that most animals use RL for.

H:no_drift

We can assume that AIs would design successors and self-modification in such a way that their goals would be preserved.

L:drift

The values of some agents will change over time. We know this because adult humans have different values from children.

Some agents will knowingly act in ways that change their values. We know this because some humans do that. If you convince someone that something—for example, researching a topic, going to college, or having a kid—would likely change their values, they’re still likely to do that anyway. Thus, H:no_drift is at least sometimes incorrect.

If agents would never produce an unaligned more-intelligent successor agent, then humans would never create unaligned superhuman AI, but some people are quite enthusiastic about doing that.

D:relative_drift

Humans have much greater value drift than animals. Instead of just having kids, smart people will do stuff like make model trains, build LIGO, or make a complex open-source game.

A:adaptation_executors

Agents execute adaptations. Interpretation of agents as “goal-pursuers” is a simplifying abstraction used to make it easier to understand them, but agents only pursue goals to the extent that they have been trained/​designed to do that and the current situation is similar to the training/​test environment. Thus, H:no_drift is not just incorrect, but fundamentally misguided.

A:domain_shift

Neural networks with different weights can represent different problems, so when attempting to modify neural networks in such a way that goal-direction is preserved, changing weights is equivalent to changing the problem domain, so modification-managing systems that are task-specific will become maladapted as weights are changed.

L:monitoring

Humans are monitored by lower-level systems with higher authority. This is observable in ordinary human experience.

For example, consider a person Alice who is overweight and trying to lose weight. She isn’t immediately getting food, but if someone puts a box of cakes in front of her, she will eat them despite consciously not wanting to. There is some system, sys_hunger, which has high authority and can override her conscious desire to lose weight, but sys_hunger is myopic and only overrides when food is immediately available. We know sys_hunger has access to Alice’s higher-level mental state because, for example, she won’t try to eat what she knows is plastic fake food.

For another example, consider a student Ben that can’t focus on his homework because he wants to call a girl (Jen) he likes. Ben says to himself, “If I get good grades, Jen will be more likely to date me.” This triggers an inspection, where something looks at his mental state and concludes: “You don’t really believe that doing your homework is going to get you a date with Jen. Rejected.”

L:deception

Some agents will actively deceive their internal monitoring systems. We know this because humans do that.

An example considered good is someone addicted to cigarettes trying to trick themselves into smoking less. An example considered bad is anorexics sharing tips for tricking themselves into losing more weight.

A:stability

If the purpose of L:monitoring systems is to maintain alignment with evolutionarily-hardcoded goals, their presence is only beneficial for that if they have less value drift than higher-level systems. That would only be true in 2 cases:

  • fine-tuning only: L:monitoring systems have little enough training from a hardcoded baseline to mostly retain their initial state.

  • generative levels: Humans have systems that generate other systems or self-modify, and L:monitoring systems are earlier in a chain of system-generation or have more-limited self-modification.

D:addiction

Repeated usage of certain drugs (eg heroin) causes humans to become misaligned from their previous values and the values of society. This happens via feedback to low-level systems through channels that aren’t normally available, which causes those low-level systems to change in a persistent way.

A:stability_implications

To me, L:monitoring systems seem to be too adaptive to be fine-tuned hard-coded systems, which by A:stability implies that humans use generative levels. D:addiction is another reason to believe L:monitoring systems are not just fine-tuned to a limited extent.

That only makes evolutionary sense if the generated systems have greater capability or adaptability, and L:monitoring systems do seem to be significantly less capable in humans than the systems they monitor. This implies that SI to human levels can be done from sub-human capabilities.

H:drift_bound

The extent of SI in humans is limited by increased value drift outweighing increased capabilities.

H:mutational_load

Human capabilities are limited by high intelligence requiring high genetic precision that can be achieved only rarely with normal rates of mutation generation and elimination.

D:management_methods

Consider an ANN N1 that is part of a larger system. There are 5 basic ways in which a management system could manage N1:

  • Control of hyperparameters such as learning rate. People have tried using ANNs to control hyperparameters during ANN training, but so far, results haven’t been better than ADAM optimizers using simple formulas. Basic network architecture can also be considered a hyperparameter.

  • Control of connection patterns between subsystems. This includes control of where inputs come from and where outputs go. Mixture-of-expert designs can be considered a type of this.

  • Control of gradients and RL targets. An obvious way to modify gradients in a complex way is to use outputs from another ANN as the output targets; this technique is called distillation. Training a small ANN to copy the output probabilities of a larger ANN is an effective method to improve its performance: the wrong answers with higher probabilities are “good wrong answers” that help the small ANN organize its latent representation space better. Another type of distillation is training a network on outputs from more-specialized networks; see the Distral paper and this application of a variant of that method to soccer-playing robots.

  • Direct modification of N1 weights. A network that generates weights for another network is a hypernetwork. Generation of weights for a network based on mixing weights from other networks is neuroevolution. Getting hypernetworks to work well has been difficult, but HyperDiffusion involves training diffusion ANNs on weights from ANNs overfit on SDFs of 3d models, and seems to work well. Per-neuron modification of activation functions can also go in this category.

  • Augmentation of N1 with tools like database lookup or programming languages. Some ANN systems use lookup in vector databases. Some LLM systems have been integrated with feedback from running code they generate.

A:synapses

Per this post, over a short time period, neuron behavior in brains is analogous to a sparse ANN with 1 weight per synapse.

Thus, if a ANN with that many parameters has its weights adjusted dynamically at some rate significantly slower than neuron firing speeds, some pattern of that adjustment is sufficient for human-level intelligence.

A:human_management_methods

Humans use at least some of D:management_methods, and this is done consciously in normal life.

People have some ability to control how much they learn from particular experiences—to say, “that was a rare unlucky event” or “that’s actually more common than people think”. The fact that LSD increases update rate indicates that there are controls for this, and if there are controls they are presumably used.

Humans can consciously control generation of neural systems from other systems, which can be considered directed neuroevolution. For example, consider a person John playing a game G1 for the first time. Someone tells John to start by pretending they’re playing a game G2 that John is familiar with but combining that with activity A, and after that John rapidly improves. What happens in that situation is:

  1. John activates a neural system S which has been trained on G2; this involves configuration switching.

  2. The neurons of S are switched to a new mode that copies the weights for playing G2.

  3. The new mode of S is trained on G1.

When John adjusts S according to skills trained on A, that involves distillation, with S being trained to produce outputs closer to the outputs of some system for A.


As best I can tell, I use a multiscale extension of SDF hyperdiffusion for 3d visualization, so I think humans are able to use hypernetworks. That’s a less-convincing argument for other people, so let’s consider someone visualizing a 3d object based on a verbal description.

ANNs can be trained by RL to provide an implicit representation of a specific 3d object (eg with SDFs or NERFs) but considering the speed of neurons, people are able to visualize 3d objects too quickly for that to be the method used.

A verbal description converted to a 3d visualization must first be encoded to a latent representation, then decoded from that to every location in the 3d representation. For 2d images, decoding to every pixel is reasonable, but decoding to every location in 3d is inefficient and leads to blocky voxel representations. Such voxel representations are incompatible with the experienced characteristics of 3d visualizations, so human 3d visualizations must involve some implicit representation step during generation, although that could later be converted into (eg) a textured surface form for greater efficiency. Generation of that implicit representation must involve input to the neurons of a representation-generating network across its structure, so a hypernetwork is involved.


Regarding augmentation, humans can obviously decide to use tools like notes or calculators for specific parts of tasks.

D:speed

Action potentials in most neuron dendrites travel at <100m/​s. Electrical signals in wires travel ~10^6 times as fast.

Transistors are much faster than neurons. Synapses have a delay of ~0.5ms; individual transistors in CPUs can switch >10^7 times that fast.

conclusion

Per D:efficiency, large improvements in data-efficiency and energy-efficiency of AI systems from algorithmic and hardware improvements are possible.

The methods available for SI can be categorized by D:management_methods. A:human_management_methods shows that humans use SI methods in directed ways. That usage implies that such SI is useful.

By the combined weight of A:childhood_implications and A:stability_implications, it’s likely that humans use SI to bootstrap from sub-human capabilities to human-level capabilities. Per D:speed, an AI system could do such bootstrapping much more quickly. Based on human development speed, I’d expect that to take from 5 minutes to 1 day for a fully parallelized system, with that time multiplied by serialization of processing.

Per L:drift, A:adaptation_executors, and A:domain_shift, the goals of systems using SI are likely to change. Per D:relative_drift and A:domain_shift, change in goals should increase with the degree of SI, and large changes are probably inevitable for large amounts of SI.