Basics of Human Reinforcement

Today: some more concepts from reinforcement learning and some discussion on their applicability to human behavior.

For example: most humans do things even when they seem unlikely to result in delicious sugar water. Is this a violation of behaviorist principles?

No. For one thing, yesterday’s post included a description of secondary reinforcers, those reinforcers which are not hard-coded evolutionary goods like food and sex, but which nevertheless have a conditioned association with good things. Money is the classic case of a secondary reinforcer among humans. Little colored rectangles are not naturally reinforcing, but from a very young age most humans learn that they can be used to buy pleasant things, like candy or toys or friends. Behaviorist-inspired experiments on humans often use money as a reward, and have yet to run into many experimental subjects whom it fails to motivate1.

Speaking of friends, status may be a primary reinforcer specific to social animals. I don’t know if being able to literally feel reinforcement going on is a real thing, but I maintain I can feel the rush of reward when someone gives me a compliment. If that’s too unscientific for you, consider studies in which monkeys will “exchange” sugary juice for the opportunity to look at pictures of high status monkeys, but demand extra juice in exchange for looking at pictures of low status monkeys.

Although certain cynics might consider money and status an exhaustive list, we may also add moral, aesthetic, and value-based considerations. Evolutionary psychology explains why these might exist and Bandura called some of them “internal reinforcement”.

But more complicated reinforcers alone are not sufficient to bridge the gap between lever-pushing pigeons and human behavior. Humans have an ability to select for or against behaviors without trying them. For example: most of us would avoid going up to Mr. T and giving him the finger. But most of us have not personally tried this behavior and observed the consequences.

Is this the result of pure reason? No; the rational part of our mind is the part telling us that Mr. T is probably sixty years old by now and far too deep in the media spotlight to want to risk a scandal and jail time by beating up a random stranger. So where exactly is the reluctance coming from?


Roko wrote in his post Ugh Fields that “your brain propagates psychological pain back to the earliest reliable stimulus for the punishment”. This deserves more investigation.

Suppose you did go into a bar one night, see Mr. T, give him the finger, and get beaten up. What behavior would you avoid in the future based on this experience? The event itself does not immediately provide enough information to distinguish among “don’t go into bars”, “don’t go out at night”, “don’t interact with people who have facial hair”, and the correct answer “don’t offend scary-looking people”. This information has to come from your pre-existing model of reality, your brain’s evolved background assumptions, and some clever guesswork.

Let’s get back to the hilariously unethical experiments. Little Albert was an eight month old child who briefly starred in an experiment by behaviorist John Watson. Watson showed him a fuzzy white rat. Albert seemed to like the rat well enough. After Albert liking the rat had been confirmed, Watson showed him the rat again, but this time also played a very loud and scary noise; he repeated this intervention until, as expected, Albert was terrified of the white rat.

But it wasn’t just fuzzy white rats Albert didn’t like. Further investigation determined that Albert was also afraid of brown rabbits (fuzzy animal) and Santa Claus (fuzzy white beard). With his incipient powers of categorization, he had learned to associate punishment with a broad category of things vaguely clustered around fuzzy white rats.

B.F. Skinner had an even more interesting experiment that showed what happened when feedback of consequences went wrong. He put pigeons in a box that gave them rewards randomly. The pigeons ended up developing what he called “superstitions”; if a reward arrived by coincidence when a pigeon was tilting its head in a certain direction, the pigeon would continue tilting its head in that direction in the hope of gaining more rewards; when the reward randomly arrived, the pigeon took this as “justification” of its head-tilting and head-tilted even more2.

This provides one piece of the puzzle in the Mr. T question. None of us have ever given Mr. T the finger before. But we may have offended scary-looking people and had bad things happen to us, which our brains correctly generalize to “don’t offend scary-looking people”.


Or maybe not. Maybe you’ve never offended a scary-looking person before. what then?

Social learning theory is held up as opposed to behaviorism a lot, but it seems more like a natural extension of it. Humans and animals learn behaviors not just by being rewarded or punished themselves, but in observing whether a behavior is rewarded or punished in others.

Even if we ourselves have never offended scary-looking people, we have seen other people do so, or heard stories about people doing so, or watched people do so on TV.

At this point I have to mention my favorite social learning story ever, which also illustrates the pitfalls of trying to feedback consequences to their proximal cause. There has been some hand-wringing lately about children’s TV shows and whether they lead to developmental problems in children. A study by Ostrov and Gentile cited in NurtureShock found the expected correlation between violent TV shows and physical aggression, but also found a an even stronger correlation between educational TV shows and so-called “relational aggression”—things like bullying, name-calling, and deliberate ostracism. The shows most strongly correlated with bad behavior were heart-warming educational programs intended to teach morality. Why?

The researchers theorize that the structure of these shows often involved a child committing an immoral action, the child looking cool and strong, and then at the end of the show the child eventually gets a comeuppance (think Harry Potter, where evil character Draco Malfoy is the coolest and most popular kid in Hogwarts and usually gets away with it, whereas supposedly sympathetic character Ron Weasley is at best a lovable loser who spends most of his time as the butt of Draco’s jokes). The theory is that children are just not good enough at the whole feedback of conseqeunces thing to realize that the bully’s comeuppance in the end is supposed to be the inevitable result of their evil ways. All they see is someone being a bully and then being treated as obviously popular and high-status.

Behavior is selection by consequences, and status is a strong reinforcer. If children see other children behaving as bullies and having high status, then all else being equal, they will be more likely to behave as bullies.

These two phenomena—feedback to categories and social learning—go part of the way to explaining the original question of how people have strong preferences for or against behaviors they’ve never tried before.


The phrase “internal reinforcement” would make good behaviorists cringe, seeing as it takes a perfectly good predictive model of behavior and tries to pin it on invisible mental phenomena.

But all reinforcement has to be at least a little internal; an animal wouldn’t know that eating food was good and eating rocks was bad unless some internal structure knew to reinforce food-eating behavior but not rock-eating behavior. Some reinforcement seems even more internal than that; people may continue an activity solely because it makes them feel good about themselves.

This is not any more mysterious than eating behavior—the drive for food and the drive for status as measured in self-esteem are both perfectly legitimate biological drives, and it’s not surprising that we have structures for noticing when we satisfy them and reinforcing the behavior involved—but it sure does sound less scientific.


Much to the chagrin of behaviorists, all these mechanisms are still not sufficient to completely explain human behavior. Some cases—for example a patient who quits an enjoyable smoking habit because the doctor says it will cause cancer—may not fit any of these patterns. The patient may not previously have encountered any problems, personally or vicariously, with smoking or anything sufficiently similar to smoking to justify generalization, and positing internal reinforcement just moves the problem to another level.

Daniel Dennett speaks of

a sort of inner environment, in which tryouts can be safely executed—an inner something-or-other structured in such a way that the surrogate actions it favors are more often than not the very actions the real world would also bless, if they were actually performed. In short, the inner environment, whatever it is, must contain lots of information about the outer environment and its regularities. Nothing else (except magic) could provide preselection worth having.

There is some evidence for this sort of thing in certain cases: in experiments on fictive reinforcement, people who stayed out of a simulated rising stock market, thus breaking even when they could have won a lot of money, were found on MRI to have a reinforcement signal almost as if they were simulating the case in which they had entered the stock market and been reinforced for doing so.

But overall this idea involves too much magic and doesn’t correspond to the way we really make decisions, either as perceived intuitively or as detected by most experiments. It also doesn’t explain why we’re so bad at being motivated by this sort of reinforcement: for example, since I know that heroin is really really enjoyable, why can’t I become addicted to heroin just by thinking about it? And how come the overwhelming majority of patients don’t quit smoking when their doctor tells them to do so, but people often do quit smoking after they’ve personally experienced the negative consequences (eg had their first heart attack)?

I am more favorable to the idea of a neural net model in which medical advice can forge a weak connection between the “smoking” pattern and the “cancer” pattern through cognition alone, separate from reinforcement processes but allowing such processes to propagate down it. Not a whole lot of motivational force can travel down such a weak link, blocking it from being effective against a strong desire to keep smoking. But I’ve got to admit that’s a wild guess.

The important point, though, is that just as utility theory posits not just utility but expected utility, reinforcement learning posits not just reward but expected reward. Many processes by which we compute expected reward remain vague. Others have been explored in some detail. The next two posts will make up for the vagueness of this one by discussing some properties of the expected reward function.


1. Humans are not the only species that can become attracted to secondary reinforcers; monkeys have been successfully trained to use currency.

2: You can see the same effect at work in human athletes. If a certain behavior correlates with a winning streak, they will continue that behavior no matter how unlikely a causal link. But these athletes are curiosities precisely because people are so good at feeding back consequences to the correct stimulus.