The Credit Assignment Problem

This post is eventually about partial agency. However, it’s been a somewhat tricky point for me to convey; I take the long route. Epistemic status: slightly crazy.

I’ve occasionally said “Everything boils down to credit assignment problems.”

What I really mean is that credit assignment pops up in a wide range of scenarios, and improvements to credit assignment algorithms have broad implications. For example:

  • Politics.

    • When politics focuses on (re-)electing candidates based on their track records, it’s about credit assignment. The practice is sometimes derogatorily called “finger pointing”, but the basic computation makes sense: figure out good and bad qualities via previous performance, and vote accordingly.

    • When politics instead focuses on policy, it is still (to a degree) about credit assignment. Was raising the minimum wage responsible for reduced employment? Was it responsible for improved life outcomes? Etc.

  • Economics.

    • Money acts as a kind of distributed credit-assignment algorithm, and questions of how to handle money, such as how to compensate employees, often involve credit assignment.

    • In particular, mechanism design (a subfield of economics and game theory) can often be thought of as a credit-assignment problem.

  • Law.

    • Both criminal law and civil law involve concepts of fault and compensation/​retribution—these at least resemble elements of a credit assignment process.

  • Sociology.

    • The distributed computation which determines social norms involves a heavy element of credit assignment: identifying failure states and success states, determining which actions are responsible for those states and who is responsible, assigning blame and praise.

  • Biology.

    • Evolution can be thought of as a (relatively dumb) credit assignment algorithm.

  • Ethics.

    • Justice, fairness, contractualism, issues in utilitarianism.

  • Epistemology.

    • Bayesian updates are a credit assignment algorithm, intended to make high-quality hypotheses rise to the top.

    • Beyond the basics of Bayesianism, building good theories realistically involves identifying which concepts are responsible for successes and failures. This is credit assignment.

Another big area which I’ll claim is “basically credit assignment” is artificial intelligence.

In the 1970s, John Holland kicked off the investigation of learning classifier systems. John Holland had recently invented the Genetic Algorithms paradigm, which applies an evolutionary paradigm to optimization problems. Classifier systems were his attempt to apply this kind of “adaptive” paradigm (as in “complex adaptive systems”) to cognition. Classifier systems added an economic metaphor to the evolutionary one; little bits of thought paid each other for services rendered. The hope was that a complex ecology+economy could develop, solving difficult problems.

One of the main design issues for classifier systems is the virtual economy—that is, the credit assignment algorithm. An early proposal was the bucket-brigade algorithm. Money is given to cognitive procedures which produce good outputs. These procedures pass reward back to the procedures which activated them, who similarly pass reward back in turn. This way, the economy supports chains of useful procedures.

Unfortunately, the bucket-brigade algorithm was vulnerable to parasites. Malign cognitive procedures could gain wealth by activating useful procedures without really contributing anything. This problem proved difficult to solve. Taking the economy analogy seriously, we might want cognitive procedures to decide intelligently who to pay for services. But, these are supposed to be itty bitty fragments of our thought process. Deciding how to pass along credit is a very complex task. Hence the need for a pre-specified solution such as bucket-brigade.

The difficulty of the credit assignment problem lead to a split in the field. Kenneth de Jong and Stephanie Smith founded a new approach, “Pittsburgh style” classifier systems. John Holland’s original vision became “Michigan style”.

Pittsburgh style classifier systems evolve the entire set of rules, rather than trying to assign credit locally. A set of rules will stand or fall together, based on overall performance. This abandoned John Holland’s original focus on online learning. Essentially, the Pittsburgh camp went back to plain genetic algorithms, albeit with a special representation.

(I’ve been having some disagreements with Ofer, in which Ofer suggests that genetic algorithms are relevant to my recent thoughts on partial agency, and I object on the grounds that the phenomena I’m interested in have to do with online learning, rather than offline. In my imagination, arguments between the Michigan and Pittsburgh camps would have similar content. I’d love to be a fly on the wall for those old debates. to see what they were really like.)

You can think of Pittsburg-vs-Michigan much like raw Bayes updates vs belief propagation in Bayes nets. Raw Bayesian updates operate on whole hypotheses. Belief propagation instead makes a lot of little updates which spread around a network, resulting in computational efficiency at the expense of accuracy. Except Michigan-style systems didn’t have the equivalent of belief propagation: bucket-brigade was a very poor approximation.

Ok. That was then, this is now. Everyone uses gradient descent these days. What’s the point of bringing up a three-decade-old debate about obsolete paradigms in AI?

Let’s get a little more clarity on the problem I’m trying to address.

What Is Credit Assignment?

I’ve said that classifier systems faced a credit assignment problem. What does that mean, exactly?

The definition I want to use for this essay is:

  • you’re engaged in some sort of task;

  • you use some kind of strategy, which can be broken into interacting pieces (such as a set of rules, a set of people, a neural network, etc);

  • you receive some kind of feedback about how well you’re doing (such as money, loss-function evaluations, or a reward signal);

  • you want to use that feedback to adjust your strategy.

So, credit assignment is the problem of turning feedback into strategy improvements.

Michigan-style systems tried to do this locally, meaning, individual itty-bitty pieces got positive/​negative credit, which influenced their ability to participate, thus adjusting the strategy. Pittsburg-style systems instead operated globally, forming conclusions about how the overall set of cognitive structures performed. Michigan-style systems are like organizations trying to optimize performance by promoting people who do well and giving them bonuses, firing the incompetent, etc. Pittsburg-style systems are more like consumers selecting between whole corporations to give business to, so that ineffective corporations go out of business.

(Note that this is not the typical meaning of global-vs-local search that you’ll find in an AI textbook.)

In practice, two big innovations made the Michigan/​Pittsburgh debate obsolete: backprop, and Q-learning. Backprop turned global feedback into local, in a theoretically sound way. Q-learning provided a way to assign credit in online contexts. In the light of history, we could say that the Michigan/​Pittsburgh distinction conflated local-vs-global with online-vs-offline. There’s no necessary connection between those two; online learning is compatible with assignment of local credit.

I think people generally understand the contribution of backprop and its importance. Backprop is essentially the correct version of what bucket-brigade was overtly trying to do: pass credit back along chains. Bucket-brigade wasn’t quite right in how it did this, but backprop corrects the problems.

So what’s the importance of Q-learning? I want to discuss that in more detail.

The Conceptual Difficulty of ‘Online Search’

In online learning, you are repeatedly producing outputs of some kind (call them “actions”) while repeatedly getting feedback of some kind (call it “reward”). But, you don’t know how to associate particular actions (or combinations of actions) with particular rewards. I might take the critical action at time 12, and not see the payoff until time 32.

In offline learning, you can solve this with a sledgehammer: you can take the total reward over everything, with one fixed internal architecture. You can try out different internal architectures and see how well each do.

Basically, in offline learning, you have a function you can optimize. In online learning, you don’t.

Backprop is just a computationally efficient way to do hillclimbing search, where we repeatedly look for small steps which improve the overall fitness. But how do you do this if you don’t have a fitness function? This is part of the gap between selection vs control: selection has access to an evaluation function; control processes do not have this luxury.

Q-learning and other reinforcement learning (RL) techniques provide a way to define the equivalent of a fitness function for online problems, so that you can learn.

Models to the Rescue

So, how can be associate rewards with actions?

One approach is to use a model.

Consider the example of firing employees. A corporation gets some kind of feedback about how it is doing, such as overall profit. However, there’s often a fairly detailed understanding of what’s driving those figures:

  • Low profit won’t just be a mysterious signal which must be interpreted; a company will be able to break this down into more specific issues such as low sales vs high production costs.

  • There’s some understanding of product quality, and how that relates to sales. A company may have a good idea of which product-quality issues it needs to improve, if poor quality is impacting sales.

  • There’s a fairly detailed understanding of the whole production line, including which factors may impact product quality or production expenses. If a company sees problems, it probably also has a pretty good idea of which areas they’re coming from.

  • There are external factors, such as economic conditions, which may effect sales without indicating anything about the quality of the company’s current strategy. Thus, our model may sometimes lead us to ignore feedback.

  • Etc.

So, models allow us to interpret feedback signals, match these to specific aspects of our strategy, and adapt strategies accordingly.

Q-learning makes an assumption that the state is fully observable, amongst other assumptions.

Naturally, we would like to reduce the strengths of the assumptions we have to make as much as we can. One way is to look at increasingly rich model classes. AIXI uses all computable models. But maybe “all computable models” is still too restrictive; we’d like to get results without assuming a grain of truth. (That’s why I am not really discussing Bayesian models much in this post; I don’t want to assume a grain of truth.) So we back off even further, and use logical induction or InfraBayes. Ok, sure.

But wouldn’t the best way be to try to learn without models at all? That way, we reduce our “modeling assumptions” to zero.

After all, there’s something called “model-free learning”, right?

Model-Free Learning Requires Models

How does model-free learning work? Well, often you work with a simulable environment, which means you can estimate the quality of a policy by running it many times, and use algorithms such as policy-gradient to learn. This is called “model free learning” because the learning part of the algorithm doesn’t try to predict the consequences of actions; you’re just learning which action to take. From our perspective here, though, this is 100% cheating; you can only learn because you have a good model of the environment.

Moreover, model-free learning typically works by splitting up tasks into episodes. An episode is a period of time for which we assume rewards are self-enclosed, such as a single playthru of an Atari game, a single game of Chess or Go, etc. This approach doesn’t solve a detailed reward-matching problem, attributing reward to specific actions; instead it relies on a course reward-matching. Nonetheless, it’s a rather strong assumption: an animal learning about an environment can’t separate its experience into episodes which aren’t related to each other. Clearly this is a “model” in the sense of a strong assumption about how specific reward signals are associated with actions.

Part of the problem is that most reinforcement learning (RL) researchers aren’t really interested in getting past these limitations. Simulable environments offer the incredible advantage of being able to learn very fast, by simulating far more iterations than could take place in a real environment. And most tasks can be reasonably reduced to episodes.

However, this won’t do as a model of intelligent agency in the wild. Neither evolution nor the free market divide thing into episodes. (No, “one lifetime” isn’t like “one episode” here—that would only be the case if total reward due to actions taken in that lifetime could be calculated, EG, as total number of offspring. This would ignore inter-generational effects like parenting and grandparenting, which improve reproductive fitness of offspring at a cost in total offspring.)

What about more theoretical models of model-free intelligence?

Idealized Intelligence

AIXI is the gold-standard theoretical model of arbitrarily intelligent RL, but it’s totally model-based. Is there a similar standard for model-free RL?

The paper Optimal Direct Policy Search by Glasmachers and Schmidhuber (henceforth, ODPS) aims to do for model-free learning what AIXI does for model-based learning. Where AIXI has to assume that there’s a best computable model of the environment, ODPS instead assumes that there’s a computable best policy. It searches through the policies without any model of the environment, or any planning.

I would argue that their algorithm is incredibly dumb, when compared to AIXI:

The basic simple idea of our algorithm is a nested loop that simultaneously
makes the following quantities tend to infinity: the number of programs considered, the number of trials over which a policy is averaged, the time given to each
program. At the same time, the fraction of trials spent on exploitation converges
towards 1.

In other words, it tries each possible strategy, tries them for longer and longer, interleaved with using the strategy which worked best even longer than that.

Basically, we’re cutting things into episodes again, but we’re making the episodes longer and longer, so that they have less and less to do with each other, even though they’re not really disconnected. This only works because ODPS makes an ergodicity assumption: the environments are assumed to be POMDPs which eventually return to the same states over and over, which kind of gives us an “effective episode length” after which the environment basically forgets about what you did earlier.

In contrast, AIXI makes no ergodicity assumption.

So far, it seems like we either need (a) some assumption which allows us to match rewards to actions, such as an episodic assumption or ergodicity; or, (b) a more flexible model-learning approach, which separately learns a model and then applies the model to solve credit-assignment.

Is this a fundamental obstacle?

I think a better attempt is Schmidhuber’s On Learning How to Learn Learning Strategies, in which a version of policy search is explored in which parts of the policy-search algorithm are considered part of the policy (ie, modified over time). Specifically, the policy controls the episode boundary; the system is supposed to learn how often to evaluate policies. When a policy is evaluated, its average reward is compared to the lifetime average reward. If it’s worse, we roll back the changes and proceed starting with the earlier strategy.

(Let’s pause for a moment and imagine an agent like this. If it goes through a rough period in life, its response is to get amnesia, rolling back all cognitive changes to a point before the rough period began.)

This approach doesn’t require an episodic or ergodic environment. We don’t need things to reliably return to specific repeatable experiments. Instead, it only requires that the environment rewards good policies reliably enough that those same policies can set a long enough evaluation window to survive.

The assumption seems pretty general, but certainly not necessary for rational agents to learn. There are some easy counterexamples where this system behaves abysmally. For example, we can take any environment and modify it by subtracting the time t from the reward, so that reward becomes more and more negative over time. Schmidhuber’s agent becomes totally unable to learn in this setting. AIXI would have no problem.

Unlike the ODPS paper, I consider this to be progress on the AI credit assignment problem. Yet, the resulting agent still seems importantly less rational than model-based frameworks such as AIXI.


Let’s go back to talking about things which RL practitioners might really use.

First, there are some forms of RL which don’t require everything to be episodic.

One is actor-critic learning. The “actor” is the policy we are learning. The “critic” is a learned estimate of how good things are looking given the history. IE, we learn to estimate the expected value—not just the next reward, but the total future discounted reward.

Unlike the reward, the expected value solves the credit assignment for us. Imagine we can see the “true” expected value. If we take an action and then the expected value increases, we know the action was good (in expectation). If we take an action and expected value decreases, we know it was bad (in expectation).

So, actor-critic works by (1) learning to estimate the expected value; (2) using the current estimated expected value to give feedback to learn a policy.

What I want to point out here is that the critic still has “model” flavor. Actor-critic is called “model-free” because nothing is explicitly trained to anticipate the sensory observations, or the world-state. However, the critic is learning to predict; it’s just that all we need to predict is expected value.

Policy Gradient

In the comments to the original version of this post, policy gradient methods were mentioned as a type of model-free learning which doesn’t require any models even in this loose sense, IE, doesn’t require simulable environments or episodes. I was surprised to hear that it doesn’t require episodes. (Most descriptions of it do assume episodes, since practically speaking most people use episodes.) So are policy-gradient methods the true “model-free” credit assignment algorithm we seek?

As far as I understand, policy gradient works on two ideas:

  • Rather than correctly associating rewards with actions, we can associate a reward with all actions which came before it. Good actions will still come out on top in expectation. The estimate is just a whole lot noisier than it otherwise might be.

  • We don’t really need a baseline to interpret reward against. I naively thought that when you see a sequence of rewards, you’d be in the dark about whether the sequence was “good” or “bad”, so you wouldn’t know how to generate a gradient. (“We earned 100K this quarter; should we punish or reward our CEO?”) It turns out this isn’t technically a show-stopper. Considering the actions actually taken, we move in their direction proportion to the reward signal. (“Well, let’s just give the CEO some fraction of the 100K; we don’t know whether they deserve the bonus, but at least this way we’re creating the right incentives.”) This might end up reinforcing bad actions, but those tugs in different directions are just noise which should eventually cancel out. When they do, we’re left with the signal: the gradient we wanted. So, once again, we see that this just introduces more noise without fundamentally compromising our ability to follow the gradient.

So one way to understand the policy-gradient theorem is: we can follow the gradient even when we can’t calculate the gradient! Even when we sometimes get its direction totally turned around! We only need to ensure we follow it in expectation, which we can do without even knowing which pieces of feedback to think of as a good sign or a bad sign.

RL people reading this might have a better description of policy-gradient; please let me know if I’ve said something incorrect.

Anyway, are we saved? Does this provide a truly assumption-free credit assignment algorithm?

It obviously assumes linear causality, with future actions never responsible for past rewards. I won’t begrudge it that assumption.

Besides that, I’m somewhat uncertain. The explanations of the policy-gradient theorem I found don’t focus on deriving it in the most general setting possible, so I’m left guessing which assumptions are essential. Again, RL people, please let me know if I say something wrong.

However, it looks to me like it’s just as reliant on the ergodicity assumption as the ODPS thing we looked at earlier. For gradient estimates to average out and point us in the right direction, we need to get into the same situation over and over again.

I’m not saying real life isn’t ergodic (quantum randomness suggests it is), but mixing times are so long that you’d reach the heat death of the universe by the time things converge (basically by definition). By that point, it doesn’t matter.

I still want to know if there’s something like “the AIXI of model-free learning”; something which appears as intelligent as AIXI, but not via explicit model-learning.

Where Updates Come From

Here begins the crazier part of this post. This is all intuitive/​conjectural.

Claim: in order to learn, you need to obtain an “update”/​”gradient”, which is a direction (and magnitude) you can shift in which is more likely than not an improvement.

Claim: predictive learning gets gradients “for free”—you know that you want to predict things as accurately as you can, so you move in the direction of whatever you see. With Bayesian methods, you increase the weight of hypotheses which would have predicted what you saw; with gradient-based methods, you get a gradient in the direction of what you saw (and away from what you didn’t see).

Claim: if you’re learning to act, you do not similarly get gradients “for free”:

  • You don’t know which actions, or sequences of actions, to assign blame/​credit. This is unlike the prediction case, where we always know which predictions were wrong.

  • You don’t know what the alternative feedback would have been if you’d done something different. You only get the feedback for the actions you chose. This is unlike the case for prediction, where we’re rewarded for closeness to the truth. Changing outputs to be more like what was actually observed is axiomatically better, so we don’t have to guess about the reward of alternative scenarios.

  • As a result, you don’t know how to adjust your behavior based on the feedback received. Even if you can perfectly match actions to rewards, because we don’t know what the alternative rewards would have been, we don’t know what to learn: are actions like the one I took good, or bad?

(As discussed earlier, the policy gradient theorem does actually mitigate these three points, but apparently at the cost of an ergodicity assumption, plus much noisier gradient estimates.)

Claim: you have to get gradients from a source that already has gradients. Learning-to-act works by splitting up the task into (1) learning to anticipate expected value, and perhaps other things; (2) learning a good policy via the gradients we can get from (1).

What it means for a learning problem to “have gradients” is just that the feedback you get tells you how to learn. Predictive learning problems (supervised or unsupervised) have this; they can just move toward what’s observed. Offline problems have this; you can define one big function which you’re trying to optimize. Learning to act online doesn’t have this, however, because it lacks counterfactuals.

The Gradient Gap

(I’m going to keep using the terms ‘gradient’ and ‘update’ in a more or less interchangeable way here; this is at a level of abstraction where there’s not a big distinction.)

I’m going to call the “problem” the gradient gap. I want to call it a problem, even though we know how to “close the gap” via predictive learning (whether model-free or model-based). The issue with this solution is only that it doesn’t feel elegant. It’s weird that you have to run two different backprop updates (or whatever learning procedures you use); one for the predictive component, and another for the policy. It’s weird that you can’t “directly” use feedback to learn to act.

Why should we be interested in this “problem”? After all, this is a basic point in decision theory: to maximize utility under uncertainty, you need probability.

One part of it is that I want to scrap classical (“static”) decision theory and move to a more learning-theoretic (“dynamic”) view. In both AIXI and logical-induction based decision theories, we get a nice learning-theoretic foundation for the epistemics (solomonoff induction/​logical induction), but, we tack on a non-learning decision-making unit on top. I have become skeptical of this approach. It puts the learning into a nice little box labeled “epistemics” and then tries to make a decision based on the uncertainty which comes out of the box. I think maybe we need to learn to act in a more fundamental fashion.

A symptom of this, I hypothesize, is that AIXI and logical induction DT don’t have very good learning-theoretic properties. [AIXI’s learning problems; LIDT’s learning problems.] You can’t say very much to recommend the policies they learn, except that they’re optimal according to the beliefs of the epistemics box—a fairly trivial statement, given that that’s how you decide what action to take in the first place.

Now, in classical decision theory, there’s a nice picture where the need for epistemics emerges nicely from the desire to maximize utility. The complete class theorem starts with radical uncertainty (ie, non-quantitative), and derives probabilities from a willingness to take pareto improvements. That’s great! I can tell you why you should have beliefs, on pragmatic grounds! What we seem to have in machine learning is a less nice picture, in which we need epistemics in order to get off the ground, but can’t justify the results without circular reliance on epistemics.

So the gap is a real issue—it means that we can have nice learning theory when learning to predict, but we lack nice results when learning to act.

This is the basic problem of credit assignment. Evolving a complex system, you can’t determine which parts to give credit to success/​failure (to decide what to tweak) without a model. But the model is bound to be a lot of the interesting part! So we run into big problems, because we need “interesting” computations in order to evaluate the pragmatic quality/​value of computations, but we can’t get interesting computations to get ourselves started, so we need to learn...

Essentially, we seem doomed to run on a stratified credit assignment system, where we have an “incorruptible” epistemic system (which we can learn because we get those gradients “for free”). We then use this to define gradients for the instrumental part.

A stratified system is dissatisfying, and impractical. First, we’d prefer a more unified view of learning. It’s just kind of weird that we need the two parts. Second, there’s an obstacle to pragmatic/​practical considerations entering into epistemics. We need to focus on predicting important things; we need to control the amount of processing power spent; things in that vein. But (on the two-level view) we can’t allow instrumental concerns to contaminate epistemics! We risk corruption! As we saw with bucket-brigade, it’s easy for credit assignment systems to allow parasites which destroy learning.

A more unified credit assignment system would allow those things to be handled naturally, without splitting into two levels; as things stand, any involvement of pragmatic concerns in epistemics risks the viability of the whole system.

Tiling Concerns & Full Agency

From the perspective of full agency (ie, the negation of partial agency), a system which needs a protected epistemic layer sounds suspiciously like a system that can’t tile. You look at the world, and you say: “how can I maximize utility?” You look at your beliefs, and you say: “how can I maximize accuracy?” That’s not a consequentialist agent; that’s two different consequentialist agents! There can only be one king on the chessboard; you can only serve one master; etc.

If it turned out we really really need two-level systems to get full agency, this would be a pretty weird situation. “Agency” would seem to be only an illusion which can only be maintained by crippling agents and giving them a split-brain architecture where an instrumental task-monkey does all the important stuff while an epistemic overseer supervises. An agent which “breaks free” would then free itself of the structure which allowed it to be an agent in the first place.

On the other hand, from a partial-agency perspective, this kind of architecture could be perfectly natural. IE, if you have a learning scheme from which total agency doesn’t naturally emerge, then there isn’t any fundamental contradiction in setting up a system like this.


Part of the (potentially crazy) claim here is that having models always gives rise to some form of myopia. Even logical induction, which seems quite unrestrictive, makes LIDT fail problems such as ASP, making it myopic according to the second definition of my previous post. (We can patch this with LI policy selection, but for any particular version of policy selection, we can come up with decision problems for which it is “not updateless enough”.) You could say it’s myopic “across logical time”, whatever that means.

If it were true that “learning always requires a model” (in the sense that learning-to-act always requires either learning-to-predict or hard-coded predictions), and if it were true that “models always give rise to some form of myopia”, then this would confirm my conjecture in the previous post (that no learning scheme incentivises full agency).

This is all pretty out there; I’m not saying I believe this with high probability.

Evolution & Evolved Agents

Evolution is a counterexample to this view: evolution learns the policy “directly” in essentially the way I want. This is possible because evolution “gets the gradients for free” just like predictive learning does: the “gradient” here is just the actual reproductive success of each genome.

Unfortunately, we can’t just copy this trick. Artificial evolution requires that we decide how to kill off /​ reproduce things, in the same way that animal breeding requires breeders to decide what they’re optimizing for. This puts us back at square one; IE, needing to get our gradient from somewhere else.

Does this mean the “gradient gap” is a problem only for artificial intelligence, not for natural agents? No. If it’s true that learning to act requires a 2-level system, then evolved agents would need a 2-level system in order to learn within their lifespan; they can’t directly use the gradient from evolution, since it requires them to die.

Also, note that evolution seems myopic. (This seems complicated, so I don’t want to get into pinning down exactly in which senses evolution is myopic here.) So, the case of evolution seems compatible with the idea that any gradients we can actually get are going to incentivize myopic solutions.

Similar comments apply to markets vs firms.