“Deep Learning” Is Function Approximation

A Surprising Development in the Study of Multi-layer Parameterized Graphical Function Approximators

As a programmer and epistemology enthusiast, I’ve been studying some statistical modeling techniques lately! It’s been boodles of fun, and might even prove useful in a future dayjob if I decide to pivot my career away from the backend web development roles I’ve taken in the past.

More specifically, I’ve mostly been focused on multi-layer parameterized graphical function approximators, which map inputs to outputs via a sequence of affine transformations composed with nonlinear “activation” functions.

(Some authors call these “deep neural networks” for some reason, but I like my name better.)

It’s a curve-fitting technique: by setting the multiplicative factors and additive terms appropriately, multi-layer parameterized graphical function approximators can approximate any function. For a popular choice of “activation” rule which takes the maximum of the input and zero, the curve is specifically a piecewise-linear function. We iteratively improve the approximation by adjusting the parameters in the direction of the derivative of some error metric on the current approximation’s fit to some example input–output pairs , which some authors call “gradient descent” for some reason. (The mean squared error is a popular choice for the error metric, as is the negative log likelihood . Some authors call these “loss functions” for some reason.)

Basically, the big empirical surprise of the previous decade is that given a lot of desired input–output pairs and the proper engineering know-how, you can use large amounts of computing power to find parameters to fit a function approximator that “generalizes” well—meaning that if you compute for some x that wasn’t in any of your original example input–output pairs (which some authors call “training” data for some reason), it turns out that is usually pretty similar to the you would have used in an example pair.

It wasn’t obvious beforehand that this would work! You’d expect that if your function approximator has more parameters than you have example input–output pairs, it would overfit, implementing a complicated function that reproduced the example input–output pairs but outputted crazy nonsense for other choices of —the more expressive function approximator proving useless for the lack of evidence to pin down the correct approximation.

And that is what we see for function approximators with only slightly more parameters than example input–output pairs, but for sufficiently large function approximators, the trend reverses and “generalization” improves—the more expressive function approximator proving useful after all, as it admits algorithmically simpler functions that fit the example pairs.

The other week I was talking about this to an acquaintance who seemed puzzled by my explanation. “What are the preconditions for this intuition about neural networks as function approximators?” they asked. (I paraphrase only slightly.) “I would assume this is true under specific conditions,” they continued, “but I don’t think we should expect such niceness to hold under capability increases. Why should we expect this to carry forward?”

I don’t know where this person was getting their information, but this made zero sense to me. I mean, okay, when you increase the number of parameters in your function approximator, it gets better at representing more complicated functions, which I guess you could describe as “capability increases”?

But multi-layer parameterized graphical function approximators created by iteratively using the derivative of some error metric to improve the quality of the approximation are still, actually, function approximators. Piecewise-linear functions are still piecewise-linear functions even when there are a lot of pieces. What did you think it was doing?

Multi-layer Parameterized Graphical Function Approximators Have Many Exciting Applications

To be clear, you can do a lot with function approximation!

For example, if you assemble a collection of desired input–output pairs where the is an array of pixels depicting a handwritten digit and y is a character representing which digit, then you can fit a “convolutional” multi-layer parameterized graphical function approximator to approximate the function from pixel-arrays to digits—effectively allowing computers to read handwriting.

Such techniques have proven useful in all sorts of domains where a task can be conceptualized as a function from one data distribution to another: image synthesis, voice recognition, recommender systems—you name it. Famously, by approximating the next-token function in tokenized internet text, large language models can answer questions, write code, and perform other natural-language understanding tasks.

I could see how someone reading about computer systems performing cognitive tasks previously thought to require intelligence might be alarmed—and become further alarmed when reading that these systems are “trained” rather than coded in the manner of traditional computer programs. The summary evokes imagery of training a wild animal that might turn on us the moment it can seize power and reward itself rather than being dependent on its masters.

But “training” is just a suggestive name. It’s true that we don’t have a mechanistic understanding of how function approximators perform tasks, in contrast to traditional computer programs whose source code was written by a human. It’s plausible that this opacity represents grave risks, if we create powerful systems that we don’t know how to debug.

But whatever the real risks are, any hope of mitigating them is going to depend on acquiring the most accurate possible understanding of the problem. If the problem is itself largely one of our own lack of understanding, it helps to be specific about exactly which parts we do and don’t understand, rather than surrendering the entire field to a blurry aura of mystery and despair.

An Example of Applying Multi-layer Parameterized Graphical Function Approximators in Success-Antecedent Computation Boosting

One of the exciting things about multi-layer parameterized graphical function approximators is that they can be combined with other methods for the automation of cognitive tasks (which is usually called “computing”, but some authors say “artificial intelligence” for some reason).

In the spirit of being specific about exactly which parts we do and don’t understand, I want to talk about Mnih et al. 2013′s work on getting computers to play classic Atari games (like Pong, Breakout, or Space Invaders). This work is notable as one of the first high-profile examples of using multi-layer parameterized graphical function approximators in conjunction with success-antecedent computation boosting (which some authors call “reinforcement learning” for some reason).

If you only read the news—if you’re not in tune with there being things to read besides news—I could see this result being quite alarming. Digital brains learning to play video games at superhuman levels from the raw pixels, rather than because a programmer sat down to write an automation policy for that particular game? Are we not already in the shadow of the coming race?

But people who read textbooks and not just news, being no less impressed by the result, are often inclined to take a subtler lesson from any particular headline-grabbing advance.

Mnih et al.’s Atari result built off the technique of Q-learning introduced two decades prior. Given a discrete-time present-state-based outcome-valued stochastic control problem (which some authors call a “Markov decision process” for some reason), Q-learning concerns itself with defining a function that describes the value of taking action while in state , for some discrete sets of states and actions. For example, to describe the problem faced by an policy for a grid-based video game, the states might be the squares of the grid, and the available actions might be moving left, right, up, or down. The Q-value for being on a particular square and taking the move-right action might be the expected change in the game’s score from doing that (including a scaled-down expectation of score changes from future actions after that).

Upon finding itself in a particular state , a Q-learning policy will usually perform the action with the highest , “exploiting” its current beliefs about the environment, but with some probability it will “explore” by taking a random action. The predicted outcomes of its decisions are compared to the actual outcomes to update the function , which can simply be represented as a table with as many rows as there are possible states and as many columns as there are possible actions. We have theorems to the effect that as the policy thoroughly explores the environment, it will eventually converge on the correct .

But Q-learning as originally conceived doesn’t work for the Atari games studied by Mnih et al., because it assumes a discrete set of possible states that could be represented with the rows in a table. This is intractable for problems where the state of the environment varies continuously. If a “state” in Pong is a 6-tuple of floating-point numbers representing the player’s paddle position, the opponent’s paddle position, and the x- and y-coordinates of the ball’s position and velocity, then there’s no way for the traditional Q-learning algorithm to base its behavior on its past experiences without having already seen that exact conjunction of paddle positions, ball position, and ball velocity, which almost never happens. So Mnih et al.’s great innovation was—

(Wait for it …)

—to replace the table representing with a multi-layer parameterized graphical function approximator! By approximating the mapping from state–action pairs to discounted-sums-of-”rewards”, the “neural network” allows the policy to “generalize” from its experience, taking similar actions in relevantly similar states, without having visited those exact states before. There are a few other minor technical details needed to make it work well, but that’s the big idea.

And understanding the big idea probably changes your perspective on the headline-grabbing advance. (It certainly did for me.) “Deep learning is like evolving brains; it solves problems and we don’t know how” is an importantly different story from “We swapped out a table for a multi-layer parameterized graphical function approximator in this specific success-antecedent computation boosting algorithm, and now it can handle continuous state spaces.”

Risks From Learned Approximation

When I solicited reading recommendations from people who ought to know about risks of harm from statistical modeling techniques, I was directed to a list of reputedly fatal-to-humanity problems, or “lethalities”.

Unfortunately, I don’t think I’m qualified to evaluate the list as a whole; I would seem to lack some necessary context. (The author keeps using the term “AGI” without defining it, and adjusted gross income doesn’t make sense in context.)

What I can say is that when the list discusses the kinds of statistical modeling techniques I’ve been studying lately, it starts to talk funny. I don’t think someone who’s been reading the same textbooks as I have (like Prince 2023 or Bishop and Bishop 2024) would write like this:

Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. [...] This is sufficient on its own [...] to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.

To be clear, I agree that if you fit a function approximator by iteratively adjusting its parameters in the direction of the derivative of some loss function on example input–output pairs, that doesn’t create an explicit internal representation of the loss function inside the function approximator.

It’s just—why would you want that? And really, what would that even mean? If I use the mean squared error loss function to approximate a set of data points in the plane with a line (which some authors call a “linear regression model” for some reason), obviously the line itself does not somehow contain a representation of general squared-error-minimization. The line is just a line. The loss function defines how my choice of line responds to the data I’m trying to approximate with the line. (The mean squared error has some elegant mathematical properties, but is more sensitive to outliers than the mean absolute error.)

It’s the same thing for piecewise-linear functions defined by multi-layer parameterized graphical function approximators: the model is the dataset. It’s just not meaningful to talk about what a loss function implies, independently of the training data. (Mean squared error of what? Negative log likelihood of what? Finish the sentence!)

This confusion about loss functions seems to be linked to a particular theory of how statistical modeling techniques might be dangerous, in which “outer” training results in the emergence of an “inner” intelligent agent. If you expect that, and you expect intelligent agents to have a “utility function”, you might be inclined to think of “gradient descent” “training” as trying to transfer an outer “loss function” into an inner “utility function”, and perhaps to think that the attempted transfer primarily doesn’t work because “gradient descent” is an insufficiently powerful optimization method.

I guess the emergence of inner agents might be possible? I can’t rule it out. (“Functions” are very general, so I can’t claim that a function approximator could never implement an agent.) Maybe it would happen at some scale?

But taking the technology in front of us at face value, that’s not my default guess at how the machine intelligence transition would go down. If I had to guess, I’d imagine someone deliberately building an agent using function approximators as a critical component, rather than your function approximator secretly having an agent inside of it.

That’s a different threat model! If you’re trying to build a good agent, or trying to prohibit people from building bad agents using coordinated violence (which some authors call “regulation” for some reason), it matters what your threat model is!

(Statistical modeling engineer Jack Gallagher has described his experience of this debate as “like trying to discuss crash test methodology with people who insist that the wheels must be made of little cars, because how else would they move forward like a car does?”)

I don’t know how to build a general agent, but contemporary computing research offers clues as to how function approximators can be composed with other components to build systems that perform cognitive tasks.

Consider AlphaGo and its successor AlphaZero. In AlphaGo, one function approximator is used to approximate a function from board states to move probabilities. Another is used to approximate the function from board states to game outcomes, where the outcome is +1 when one player has certainly won, −1 when the other player has certainly won, and a proportionately intermediate value indicating who has the advantage when the outcome is still uncertain. The system plays both sides of a game, using the board-state-to-move-probability function and board-state-to-game-outcome function as heuristics to guide a search algorithm which some authors call “Monte Carlo tree search”. The board-state-to-move-probability function approximation is improved by adjusting its parameters in the direction of the derivative of its cross-entropy with the move distribution found by the search algorithm. The board-state-to-game-outcome function approximation is improved by adjusting its parameters in the direction of the derivative of its squared difference with the self-play game’s ultimate outcome.

This kind of design is not trivially safe. A similarly superhuman system that operated in the real world (instead of the restricted world of board games) that iteratively improved an action-to-money-in-this-bank-account function seems like it would have undesirable consequences, because if the search discovered that theft or fraud increased the amount of money in the bank account, then the action-to-money function approximator would generalizably steer the system into doing more theft and fraud.

Statistical modeling engineers have a saying: if you’re surprised by what your nerual net is doing, you haven’t looked at your training data closely enough. The problem in this hypothetical scenario is not that multi-layer parameterized graphical function approximators are inherently unpredictable, or must necessarily contain a power-seeking consequentialist agent in order to do any useful cognitive work. The problem is that you’re approximating the wrong function and get what you measure. The failure would still occur if the function approximator “generalizes” from its “training” data the way you’d expect. (If you can recognize fraud and theft, it’s easy enough to just not use that data as examples to approximate, but by hypothesis, this system is only looking at the account balance.) This doesn’t itself rule out more careful designs that use function approximators to approximate known-trustworthy processes and don’t search harder than their representation of value can support.

This may be cold comfort to people who anticipate a competitive future in which cognitive automation designs that more carefully respect human values will foreseeably fail to keep up with the frontier of more powerful systems that do search harder. It may not matter to the long-run future of the universe that you can build helpful and harmless language agents today, if your civilization gets eaten by more powerful and unfriendlier cognitive automation designs some number of years down the line. As a humble programmer and epistemology enthusiast, I have no assurances to offer, no principle or theory to guarantee everything will turn out all right in the end. Just a conviction that, whatever challenges confront us in the future, we’ll be a better position to face them by understanding the problem in as much detail as possible.

Bibliography

Bishop, Christopher M., and Andrew M. Bishop. 2024. Deep Learning: Foundations and Concepts. Cambridge, UK: Cambridge University Press. https://​​www.bishopbook.com/​​

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. “Playing Atari with Deep Reinforcement Learning.” https://​​arxiv.org/​​abs/​​1312.5602

Prince, Simon J.D. 2023. Understanding Deep Learning. Cambridge, MA: MIT Press. http://​​udlbook.com/​​

Sutton, Richard S., and Andrew G. Barto. 2024. Reinforcement Learning. 2nd ed. Cambridge, MA: MIT Press.

• This does a great job of importing and translating a set of intuitions from a much more established and rigorous field. However, as with all works framing deep learning as a particular instance of some well-studied problem, it’s vital to keep the context in mind:

Despite literally thousands of papers claiming to “understand deep learning” from experts in fields as various as computational complexity, compressed sensing, causal inference, and—yes—statistical learning, NO rigorous, first-principles analysis has ever computed any aspect of any deep learning model beyond toy settings. ALL published bounds are vacuous in practice.

It’s worth exploring why, despite strong results in their own settings, and despite strong “intuitive” parallels to deep learning, this remains true. The issue is that all these intuitive arguments have holes big enough to accommodate, well, the end of the world. There are several such challenges in establishing a tight correspondence between “classical machine learning” and deep learning, but I’ll focus on one that’s been the focus of considerable effort: defining simplicity.

This notion is essential. If we consider a truly arbitrary function, there is no need for a relationship between the behavior on one input and the behavior on another—the No Free Lunch Theorem. If we want our theory to have content (that is, to constrain the behavior of a Deep Learning system whatsoever) we’ll need to narrow the range of possibilities. Tools from statistical learning like the VC dimension are useless as is due to overparameterization, as you mention. We’ll need a notion of simplicity that captures what sorts of computational structures SGD finds in practice. Maybe circuit size, or minima sharpness, or noise sensitivity… - how hard could it be?

Well no one’s managed it. To help understand why, here are two (of many) barrier cases:

• Sparse parity with noise. For an input bitstring x, y is defined as the xor of a fixed, small subset of indices. E.g if the indices are 1,3,9 and x is 101000001 then y is 1 xor 1 xor 1 = 1. Some small (tending to zero) measurement error is assumed. Though this problem is approximated almost perfectly by extremely small and simple boolean circuits (a log-depth tree of xor gates with inputs on the chosen subset), it is believed to require an exponential amount of computation to predict even marginally better than random! Neural networks require exponential size to learn it in practice.
Deep Learning Fails

• The protein folding problem. Predict the shape of a protein from its amino acid sequence. Hundreds of scientists have spent decades scouring for regularities, and failed. Generations of supercomputers have been built to attempt to simulate the subtle evolution of molecular structure, and failed. Billions of pharmaceutical dollars were invested—hundreds of billions were on the table for success. The data is noisy and multi-modal. Protein language models learn it all anyway.
Deep Learning Succeeds!

For what notion is the first problem complicated, and the second simple?

Again, without such a notion, statistical learning theory makes no prediction whatsoever about the behavior of DL systems on new examples. If a model someday outputted a sequence of actions which caused the extinction of the human race, we couldn’t object on principle, only say “so power-seeking was simpler after all”. And even with such a notion, we’d still have to prove that Gradient Descent tends to find it in practice and a dozen other difficulties...

Without a precise mathematical framework to which we can defer, we’re left with Empirics to help us choose between a bunch of sloppy, spineless sets of intuitions. Much less pleasant. Still, here’s a few which push me towards Deep Learning as a “computationally general, pattern-finding process” rather than function approximation:

• Neural networks optimized only for performance show surprising alignment with representations in the human brain, even exhibiting 1-1 matches between particular neurons in ANNs and living humans. This is an absolutely unprecedented level of predictivity, despite the models not being designed for such and taking no brain data as input.

• LLM’s have been found to contain rich internal structure such as grammatical parse-trees, inference-time linear models, and world models. This sort of mechanistic picture is missing from any theory that considers only i/​o

• Small changes in loss (i.e. function approximation accuracy) have been associated with large qualitative changes in ability and behavior—such as learning to control robotic manipulators using code, productively recurse to subagents, use tools, or solve theory of mind tasks.

I know I’ve written a lot, so I appreciate your reading it. To sum up:

• Despite intuitive links, efforts to apply statistical learning theory to deep learning have failed, and seem to face substantial difficulties

• So, we have to resort to experiment where I feel this intuitive story doesn’t fit the data, and provide some challenge cases

• NO rigorous, first-principles analysis has ever computed any aspect of any deep learning model beyond toy settings

This is false. From the abstract of Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization (muP), many optimal HPs remain stable even as model size changes. This leads to a new HP tuning paradigm we call muTransfer: parametrize the target model in muP, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all. We verify muTransfer on Transformer and ResNet. For example, 1) by transferring pretraining HPs from a model of 13M parameters, we outperform published numbers of BERT-large (350M parameters), with a total tuning cost equivalent to pretraining BERT-large once; 2) by transferring from 40M parameters, we outperform published numbers of the 6.7B GPT-3 model, with tuning cost only 7% of total pretraining cost. A Pytorch implementation of our technique can be found at this http URL and installable via pip install mup.

muP comes from a principled mathematical analysis of how different ways of scaling various architectural hyperparameters alongside model width influences activation statistics.

• I was trying to make a more specific point. Let me know if you think the distinction is meaningful -

So there are lots of “semi-rigorous” successes in Deep Learning. One I understand better than muP is good old Xavier initialization. Assuming that the activations at a given layer are normally distributed, we should scale our weights like 1/​sqrt(n) so the activations don’t diverge from layer to layer (since the sum of independent normals scales like sqrt(n)). This is exactly true for the first gradient step, but can become false at any later step once the weights are no longer independent and can “conspire” to blow up the activations anyway. So not a “proof” but successful in practice.

My understanding of muP is similar in that is “precise” only if certain correlations are well controlled (I’m fuzzy here). But still v. successful in practice. But the proof isn’t airtight and we still need that last step—checking “in practice”.

This is very different from the situation within statistical learning itself, which has many beautiful and unconditional results which we would very much like to port over to deep learning. My central point is that in the absence of a formal correspondence, we have to bridge the gap with evidence. That’s why the last part of my comment was some evidence I think speaks against the intuitions of statistical learning theory. I consider Xavier initialization, muP, and scaling laws, etc. as examples where this bridge was successfully crossed—but still necessary! And so we’re reduced to “arguing over evidence between paradigms” when we’d prefer to “prove results within a paradigm”

• For what notion is the first problem complicated, and the second simple?

I might be out of my depth here, but—could it be that sparse parity with noise is just objectively “harder than it sounds” (because every bit of noise inverts the answer), whereas protein folding is “easier than it sounds” (because if it weren’t, evolution wouldn’t have solved it)?

Just because the log-depth xor tree is small, doesn’t mean it needs to be easy to find, if it can hide amongst vastly many others that might have generated the same evidence … which I suppose is your point. (The “function approximation” frame encourages us to look at the boolean circuit and say, “What a simple function, shouldn’t be hard to noisily approximate”, which is not exactly the right question to be asking.)

• I think it’s also a question of the learning techniques used. It seems like generalizable solutions to xor involve at least one of the two following:

• noticing that many of the variables in the input can be permuted without changing the output and therefore it’s only the counts of the variables that matter

• noticing how the output changes or doesn’t change when you start with one input and flip the variable one at a time

But current neural networks can’t use either of these techniques, partly because it doesn’t align well with the training paradigm. They both kind of require the network to be able to pick the inputs to see the ground truth for, whereas training based on a distribution has the (input, output) pairs randomized.

To “humanize” these problems, we can:

• Break intuitive permutation invariance by using different symbols in each slot. So instead of an input looking like 010110 or 101001 or 111000, it might look like A😈4z👞i or B🔥😱Z😅0 or B😈😱Z😅i.

• Break the ability to notice the effects of single bitflips by just seeing random strings rather than neighboring strings.

This makes it intuitively much harder to me.

• Downvoted because I waded through all those rhetorical shenanigans and I still don’t understand why you didn’t just say what you mean.

• This comment had been apparently deleted by the commenter (the comment display box having a “deleted because it was a little rude, sorry” deletion note in lieu of the comment itself), but the ⋮-menu in the upper-right gave me the option to undelete it, which I did because I don’t think my critics are obligated to be polite to me. (I’m surprised that post authors have that power!) I’m sorry you didn’t like the post.

• I am suprised that you have that affordance. I want to know I can delete my comments and be sure they won’t get read by anyone after I delete them.

• Oh, hmm, this is an edge-case we’ve never ran into. The point of giving authors the ability to undelete comments they can delete is so that they can reverse deletions they made (or admins made on their post trying to help them enforce their norms) not the deletions other people made.

I’ll look into fixing the permissions here. Definitely not intended, just a side effect of some other things we tried to do.

• As a deep-learning novice, I found the post charming and informative.

• To me, the lengthy phrases do in fact get closer to “zack saying what zack meant” than the common terms like ‘deep learning’—but, like you, I didn’t really get anything new out of the longer phrases. I believe that people who don’t already think of deep learning as function approximation may get something out of it tho. So in consequence I didn’t downvote or upvote.

• [ ]
[deleted]
• It took me a good while reading this to figure out whether it was a deconstruction of tabooing words. I would have felt less so if the post didn’t keep replacing terms with ones that are both no less charged and also no more descriptive of the underlying system, and then start drawing conclusions from the resulting terms’ aesthetics.

With regards to Yudkowsky’s takes, the key thing to keep in mind is that Yudkowsky started down his path by reasoning backwards from properties ASI would have, not from reasoning forward from a particular implementation strategy. The key reason to be concerned that outer optimization doesn’t define inner optimization isn’t a specific hypothesis about whether some specific strategy with neural networks will have inner optimizers, it’s because ASI will by necessity involve active optimization on things, and we want our alignment techniques to have at least any reason to work in that regime at all.

• I’m not sure what point this post is trying to make exactly. Yes, it’s function approximation; I think we all know that.

When we talk about inner and outer alignment, outer alignment is “picking the correct function to learn.” (When we say “loss,” we mean the loss on a particular task, not the abstract loss function like RMSE.)

Inner alignment is about training a model that generalizes to situations outside the training data.

• I liked how this post tabooed terms and looked at things at lower levels of abstraction than what is usual in these discussions.

I’d compare tabooing to a frame by Tao about how in mathematics you have the pre-rigorous, rigorous and post-rigorous stages. In the post-rigorous stage one “would be able to quickly and accurately perform computations in vector calculus by using analogies with scalar calculus, or informal and semi-rigorous use of infinitesimals, big-O notation, and so forth, and be able to convert all such calculations into a rigorous argument whenever required” (emphasis mine).

Tabooing terms and being able to convert one’s high-level abstractions into mechanistic arguments whenever required seems to be the counterpart in (among others) AI alignment. So, here’s positive reinforcement for taking the effort to try and do that!

Separately, I found the part

(Statistical modeling engineer Jack Gallagher has described his experience of this debate as “like trying to discuss crash test methodology with people who insist that the wheels must be made of little cars, because how else would they move forward like a car does?”)

quite thought-provoking. Indeed, how is talk about “inner optimizers” driving behavior any different from “inner cars” driving the car?

When you train a ML model with SGD—wait, sorry, no. When you try construct an accurate multi-layer parametrized graphical function approximator, a common strategy is to do small, gradual updates to the current setting of parameters. (Some could call this a random walk or a stochastic process over the set of possible parameter-settings.) Over the construction-process you therefore have multiple intermediate function approximators. What are they like?

The terminology of “function approximators” actually glosses over something important: how is the function computed? We know that it is “harder” to construct some function approximators than others, and depending on the amount of “resources” you simply cannot[1] do a good job. Perhaps a better term would be “approximative function calculators”? Or just anything that stresses that there is some internal process used to convert inputs to outputs, instead of this “just happening”.

This raises the question: what is that internal process like? Unfortunately the texts I’ve read on multi-layer parametrized graphical function approximation have been incomplete in these respects (I hope the new editions will cover this!), so take this merely as a guess. In many domains, most clearly games, it seems like “looking ahead” would be useful for good performance[2]: if I do X, the opponent could do Y, and I could then do Z. Perhaps these approximative function calculators implement even more general forms of search algorithms.

So while searching for accurate approximative function calculators we might stumble upon calculators that itself are searching for something. How neat is that!

I’m pretty sure that under the hood cars don’t consist of smaller cars or tiny car mechanics—if they did, I’m pretty sure my car building manual would have said something about that.

1. ^

(As usual, assuming standard computational complexity conjectures like P != NP and that one has reasonable lower bounds in finite regimes, too, rather than only asymptotically.)

2. ^

Or, if you don’t like the word “performance”, you may taboo it and say something like “when trying to construct approximative function calculators that are good at playing chess—in the sense of winning against a pro human or a given version of Stockfish—it seems likely that they are, in some sense, ‘looking ahead’ for what happens in the game next; this is such an immensely useful thing for chess performance that it would be surprising if the models did not do anything like that”.

• It’s the same thing for piecewise-linear functions defined by multi-layer parameterized graphical function approximators: the model is the dataset. It’s just not meaningful to talk about what a loss function implies, independently of the training data. (Mean squared error of what? Negative log likelihood of what? Finish the sentence!)

I don’t think this is a confusion, but rather a mere difference in terminology. Eliezer’s notion of “loss function” is equivalent to Zack’s notion of “loss function” curried with the training data. Thus, when Eliezer writes about the network modelling or not modelling the loss function, this would include modelling the process that generated the training data.

• The issue seems more complex and subtle to me.

It is fair to say that the loss function (when combined with the data) is a stochastic environment (stochastic due to sampling the data), and the effect of gradient descent is to select a policy (a function out of the function space) which performs very well in this stochastic environment (achieves low average loss).

If we assume the function-approximation achieves the minimum possible loss, then it must be the case that the function chosen is an optimal control policy where the loss function (understood as including the data) is the utility function which the policy is optimal with respect to.

In this framing, both Zack and Eliezer would be wrong:

• Zack would be wrong because there is nothing nonsensical about asking whether the function-approximation “internalizes” the loss. Utility functions are usually understood behaviorally; a linear regression might not “represent” (ie denote) squared-error anywhere, but might still be utility-theoretically optimal with respect to mean-squared error, which is enough for “representation theorems” (the decision-theory thingy) to apply.

• Eliezer would be wrong because his statement that there is no guarantee about representing the loss function would be factually incorrect. At best Eliezer’s point could be interpreted as saying that the representation theorems break down when loss is merely very low rather than perfectly minimal.

But Eliezer (at least in the quote Zack selects) is clearly saying “explicit internal representation” rather than the decision-theoretic “representation theorem” thingy. I think this is because Eliezer is thinking about inner optimization, as Zack also says. When we are trying to apply function-approximation (“deep learning”) to solve difficult problems for us—in particular, difficult problems never seen in the data-set used for training—it makes some sense to suppose that the internal representation will involve nontrivial computations, even “search algorithms” (and importantly, we know of no way to rule this out without crippling the generalization ability of the function-approximation).

So based on this, we could refine the interpretation of Eliezer’s point to be: even if we achieve the minimum loss on the data-set given (and therefore obey decisiot-theretic representation-theorems in the stochastic environment created by the loss function combined with the data), there is no particular guarantee that the search procedure learned by the function-approximation is explicitly searching to minimize said loss.

This is significant because of generalization. We actually want to run the approximated-function on new data, with hopes that it does “something appropriate”. (This is what Eliezer means when he says “distribution-shifted environments” in the quote.) This important point is not captured in your proposed reconciliation of Zack and Eliezer’s views.

But then why emphasize (as Eliezer does) that the function approximation does not necessarily internalize the loss function it is trained on? Internalizing said loss function would probably prevent it from doing anything truly catastrophic (because it is not planning for a world any different than the actual training data it has seen). But it does not especially guarantee that it does what we would want it to do. (Because the-loss-function-on-the-given-data is not what we really want; really we want some appropriate generalization to happen!)

I think this is a rhetorical simplification, which is fair game for Zack to try and correct to something more accurate. Whether Eliezer truly had the misunderstanding when writing, I am not sure. But I agree that the statement is, at least, uncareful.

Has Zack succeeded in correcting the issue by providing a more accurate picture? Arguably TurnTrout made the same objection in more detail. He summarizes the whole thing into two points:

1. Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.

2. Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent’s network. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.

(Granted, TurnTrout is talking about reward signals rather than loss functions, and this is an important distinction; however, my understanding is that he would say something very similar about loss functions.)

Point #1 appears to strongly agree with at least a major part of Eliezer’s point. To re-quote the List of Lethalities portion Zack quotes in the OP:

Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. [...] This is sufficient on its own [...] to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.

However, I think point #2 is similar in spirit to Zack’s objection in the OP. (TurnTrout does not respond to the same exact passage, but has his own post taking issues with List of Lethalities.)

I will call the objection I see in common between Zack and TurnTrout the type error objection. Zack says that of course a line does not “represent” the loss function of a linear regression; why would you even want it to? TurnTrout says that “reward is not the optimization target”—we should think of a reward function as a “chisel” which shapes a policy, rather than thinking of it as the goal we are trying to instill in the policy. In both cases, I understand them as saying that the loss function used for training is an entirely different sort of thing from the goals an intelligent system pursues after training. (The “wheels made of little cars” thing also resembles a type-error objection.)

While I strongly agree that we should not naively assume a reinforcement-learning agent internalizes the reward as its utility function, I think the type-error objection is over-stated, as may be clear from my point about decision-theoretic representation theorems at the beginning.

Reward functions do have the wrong type signature, but neural networks are not actually trained on reward gradients; rather, a loss is defined from the reward in some way. The type signature of the loss function is not wrong; indeed, if training were perfect, then we could conclude that the resulting neural networks would be decision-theoretically perfect at minimizing loss on the training distribution.

What we would not be able to make confident predictions about is what such systems would do outside of the training distribution, where the training procedure has not exercised selection pressure on the behavior of the system. Here, we must instead rely on the generalization power of function-approximation, which (seen through a somewhat bayesian lens) means trusting the system to have the inductive biases which we would want.

• To be clear, I’m definitely pretty sympathetic to TurnTrout’s type error objection. (Namely: “If the agent gets a high reward for ingesting superdrug X, but did not ingest it during training, then we shouldn’t particularly expect the agent to want to ingest superdrug X during deployment, even if it realizes this would produce high reward.”) But just rereading what Zack has written, it seems quite different from what TurnTrout is saying and I still stand by my interpretation of it.

• eg. Zack writes: “obviously the line itself does not somehow contain a representation of general squared-error-minimization”. So in this line fitting example, the loss function, i.e. “general squared-error-minimization” refers to the function , and not .

• And when he asks why one would even want the neural network to represent the loss function, there’s a pretty obvious answer of “well, the loss function contains many examples of outcomes humans rated as good and bad and we figure it’s probably better if the model understands the difference between good and bad outcomes for this application.” But this answer only applies to the curried loss.

I wasn’t trying to sign up to defend everything Eliezer said in that paragraph, especially not the exact phrasing, so can’t reply to the rest of your comment which is pretty insightful.

• In both cases, I understand them as saying that the loss function used for training is an entirely different sort of thing from the goals an intelligent system pursues after training.

I think Turntrout would object to that charecterization as it is privileging the hypothesis that you get systems which pursue goals after training. I’m assuming you mean the agent does some sort of EV maximization by “goals an intelligent systems pursues”. Though I have a faint suspicion Turntrout would disagree even with a more general interpretation of “pursues goals”.

• It wasn’t obvious beforehand that this would work! You’d expect that if your function approximator has more parameters than you have example input–output pairs, it would overfit, implementing a complicated function that reproduced the example input–output pairs but outputted crazy nonsense for other choices of x—the more expressive function approximator proving useless for the lack of evidence to pin down the correct approximation.

And that is what we see for function approximators with only slightly more parameters than example input–output pairs, but for sufficiently large function approximators, the trend reverses and “generalization” improves—the more expressive function approximator proving useful after all, as it admits algorithmically simpler functions that fit the example pairs.

One way I think of this that makes it more intuitive:

You can understand mean squared error as implying that your regression equation has a Gaussian-distributed error term $$\varepsilon$$:

$Y = \beta X + \varepsilon$

In simple statistical models, this $$\varepsilon$$ latent is not explicitly estimated, but instead eliminated by pumping up the sample size so much that it can be averaged away. This only works when you have far fewer parameters than data points.

Overparameterized models, on the other hand, use the “texture” of the datapoint (as in, rare, “high-frequency” “noise” characteristics that allow you to relatively uniquely pick out the datapoint from the dataset) to identify the datapoint, and shove the $$\varepsilon$$ term into that “texture”. (You can see this explicitly in 1D regression problems, where the overparameterized models have a sharp deviation from the trend line at each data point.)

I think this works under quite general conditions. I vaguely remember there was a paper that showed that if you train a NN on random labels, then it “learns to memorize” in some sense (can’t remember what sense exactly but I think it was something like, it could more quickly be trained on a different random labelling). I suspect this is because it learns to emphasize the “textures” of the datapoints in the training set, finding features that more unqiuely distinguish datapoints in the future.

Real NNs obviously also learn real structure, not just $$\varepsilon$$, but I assume they do some mixture where at least part of what they learn is just arbitrary features that distinguish unique datapoints.

• I think this is a strong starting point but I think the nice crisp “neural net = function approximator” mostly falls apart as a useful notion when you do fancy stuff with your neural net like active learning or RLAIF. Maybe it’s not technically the neural net doing that...

I guess we don’t have great terms to delineate all these levels of the system:

• code that does a forward pass (usually implicitly also describing backward pass & update given loss)

• code that does that plus training (ie data fetch and loss function)

• that plus RL environment or training set

• that plus “training scaffolding” code that eg will do active learning or restart the game if it freezes

• just code & weights for a forward pass during inference (presuming that the system has separable training and inference stages)

• that plus all the “inference scaffolding” code which will eg do censorship or internet search or calculator integration

• that plus the “inference UI”. (Consider how differently people use gpt4 api vs chatgpt website.) (This could also eg be the difference between clicking a checkbox for who to kill and clicking the cancel button on the notification!)

• the actual final system turned on in the wild with adversarial users and distribution shift and so on

I wonder if some folks are taking past each other by implicitly referring to different items above...

• Other people were commending your tabooing of words, but I feel using terms like “multi-layer parameterized graphical function approximator” fails to do that, and makes matters worse because it leads to non-central fallacy-ing. It’d been more appropriate to use a term like “magic” or “blipblop”. Calling something a function appropriator leads to readers carrying a lot of associations into their interpretation, that probably don’t apply to deep learning, as deep learning is a very specific example of function approximation, that deviates from the prototypical examples in many respects. (I think when you say “function approximator” the image that pops into most peoples head is fitting a polynomial to a set of datapoints in R^2)

Calling something a function approximator is only meaningful if you make a strong argument for why a function approximator cant (or at least is systematically unlikely to) give rise to specific dangerous behaviors or capabilities. But I don’t see you giving such arguments in this post. Maybe I did not understand it. In either case, you can read posts like Gwern’s “Tools want to be agents” or Yudkowsky’s writings, explaining why goal directed behavior is a reasonable thing to expect to arise from current ML, and you can replace every instance of “neural network” /​ “AI” with “multi-layer parameterized graphical function approximator”, and I think you’ll find that all the arguments make just as much sense as they did before. (modulo some associations seeming strange, but like I said, I think thats because there is some non-central fallacying going on).

• That deviates from the prototypical examples in many respects.

It basically proves too much because it’s equivocation. I am struggling to find anything in Zack’s post which is not just the old wine of the “just” fallacy in new ‘function approximation’ skins. When someone tells you that a LLM is “just” next token prediction, or a neural network is “just some affine layers with nonlinearities” or it’s “just a Markov chain with a lot of statistics”, then you’ve learned more about the power and generality of ‘next token prediction’ etc than you have what they were trying to debunk.

If I use the mean squared error loss function to approximate a set of data points in the plane with a line (which some authors call a “linear regression model” for some reason), obviously the line itself does not somehow contain a representation of general squared-error-minimization. The line is just a line.

I don’t think that is obvious at all, and is roughly on the level of saying ‘a tiger is just a set of atoms in a 3D volume’ or ‘programs are just a list of bits’. What are these data points on this hyperplane, exactly...? They could be anything—they could be, say, embeddings of optimization algorithms*. If you had an appropriate embedding of the latent space of algorithms, why can’t there be a point (or any other kind of object or structure, such as a line) which corresponds to general squared-error minimization or others? And this doesn’t seem ‘possible’, this seems likely: already an existing LLM like GPT-4 or Claude-3 Opus is clearly mapping its few-shot examples to some sort of latent space, appears to be doing internal gradient descent on higher level embeddings or manifolds, and is quite effective at writing things like ‘here are 10 variants of error minimization optimization algorithms as examples; write a Python program using Numpy to write a new one which [...]’ (and if it is not, it sure seems like future ones will); something inside that LLM must correspond to a small number of bytes and some sort of mathematical object and represent the combination of the points. Programs and algorithms are ‘just’ data points, like everything else. (“With an appropriate encoding, an AGI is just an index into the decimal expansion of pi; an index, a simple ordinary natural number, is obviously completely harmless; QED, AGI is completely harmless.”) Which means that if you think something is ‘just’ data, then your assertions are meaninglessly vacuous because if it applies to everything, then it means nothing.

* Or more relevantly, RL algorithms such as specific agents… of course a line or point on an appropriate hyperplane can ‘represent’ the loss function, and it would be a useful thing to do so. Why would you ever want a system to not be able to do that? We can see in Decision Transformers and research into the meta-reinforcement learning behavior of LLMs that large LLMs prompted with reward functions & scenarios, trained by imitation learning at scale across countless agents, do what must be exactly that, and represent environments and rewards, and can eg. explicitly write the source code to implement reward functions for RL training of sub-agents. I think it’s telling that these results generally do not come up in these posts about how ‘reward is not the optimization target’.

• I am struggling to find anything in Zack’s post which is not just the old wine of the “just” fallacy [...] learned more about the power and generality of ‘next token prediction’ etc than you have what they were trying to debunk.

I wouldn’t have expected you to get anything out of this post!

Okay, if you project this post into a one-dimensional “AI is scary and mysterious” vs. “AI is not scary and not mysterious” culture war subspace, then I’m certainly writing in a style that mood-affiliates with the latter. The reason I’m doing that is because the picture of what deep learning is that I got from being a Less Wrong-er felt markedly different from the picture I’m getting from reading the standard textbooks, and I’m trying to supply that diff to people who (like me-as-of-eight-months-ago, and unlike Gwern) haven’t read the standard textbooks yet.

I think this is a situation where different readers need to hear different things. I’m sure there are grad students somewhere who already know the math and could stand to think more about what its power and generality imply about the future of humanity or lack thereof. I’m not particularly well-positioned to help them. But I also think there are a lot of people on this website who have a lot of practice pontificating about the future of humanity or lack thereof, who don’t know that Simon Prince and Christopher Bishop don’t think of themselves as writing about agents. I think that’s a problem! (One which I am well-positioned to help with.) If my attempt to remediate that particular problem ends up mood-affiliating with the wrong side of a one-dimensional culture war, maybe that’s because the one-dimensional culture war is crazy and we should stop doing it.

• but for sufficiently large function approximators, the trend reverses

Transformers/​deep learning work because of built-in regularization methods (like dropout layers) and not because “the trend reverses”. If you did naive “best fit polynomial” with a 7 billion parameter polynomial you would not get a good result.

• It may also be worth adding that transformers aren’t piecewise linear. A self-attention layer dynamically constructs pathways for information to flow through, which is very nonlinear.