Some Technical Problems With Measuring Happiness (and some solutions)

[Epistemic status: Probably true, but very incomplete. I’m just some guy who spent a few days reading studies and literature reviews on this subject. This is just a short look at some of the technical (i.e. non-fundamental; could be solved with better techniques) issues that are encountered in the process of measuring happiness.]

Terms and Definitions Apply

The tradition in scientific literature about happiness seems to be to start by mentioning how much we humans care about happiness. Far be it from me to break tradition, so: obviously, we care a lot about happiness. It’s the thing most people spend most of their lives trying to attain (either for themselves or for others).

Um … how, exactly, do we know we have it, though? It’s not like you can just grab a Happiness Thermometer, stick it up your nose into your brain, and measure how happy you are.

Worse yet, people don’t even seem to agree on what happiness is. Aristotle calls it

“activity of the soul in conformity with excellence or virtue”,

while the psychologist Kahneman defines it as

“what I experience here and now”.

Modern happiness research by social scientists seeks to incorporate both definitions as aspects of a broader concept of welfare. Much effort has gone into differentiating the two; Huta & Waterman (2014) is the prime example, a valiant attempt at classifying both terms. I will be drawing heavily from this study throughout this post, starting with these definitions:

Aristotle’s concept is called eudaimonia, the Greek term he himself used; “good spirit” is a strict translation of the word, but more broadly it means “flourishing”. The core concepts are virtue, identity, meaning and self-actualization. “Psychological well-being” is a synonym often used in the literature.

Kahneman’s concept, to keep the Greek vibe going, is called hedonia. Hedonia is about good feelings: pleasure and happiness are core concepts. Many authors add comfort and/or absence of distress. “Subjective well-being” is the corresponding synonym.

There’s another important distinction here, between state levels and trait levels.

Someone’s state level of hedonia or eudaimonia measures the degree of the relevant concept in a specific moment (how happy you are right now; how virtuous you have been today) while their trait level is their degree of hedonia or eudaimonia throughout their life, like an inherent trait of that person.

Having completed the ritual of defining the things we’re speaking about (maybe it would’ve been better to just Taboo the words), we can finally get on to the meat. How do we measure this stuff?

Eudaimonia and hedonia are both typically measured by self-report surveys. People are given a survey with statements like “I feel that life is very rewarding” or “I am very happy” and asked to rate how much they agree with the statement on a scale from 0 to X (X usually being 5, 6 or 7). The scientists then take the average of those numbers: there’s your happiness score!

Problem #1: Poor Memory

A Famous Number

This self-report method has been used to produce findings like Deaton & Kahneman (2010)’s $75000 figure (if you haven’t heard of it, don’t worry: countless pop science articles will be happy to explain how “The Price of Happiness [is] $75000”).

Of course, all those articles’ headlines are a bit misleading. Why? Because they leave out the other finding that was also in the title of the study: “High income improves evaluation of life but not emotional well-being”.

According to the study, emotional well-being (state level hedonia, as measured by questions like “did you feel X emotion yesterday?”) stayed constant above $75000. That wasn’t the only variable being measured, though. They measured life satisfaction (a trait-level correlate of both hedonia and eudaimonia) with one simple question:

Please imagine a ladder with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you, and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?

Notice how the “Ladder” line, which represents life satisfaction, keeps rising with log income after $80000. The “Positive affect”, “Not blue”, and “Stress free” lines, all different measures of state hedonia, do not.

So over $75000/year, people continue to get more life satisfaction, but not more experienced well being / state hedonia. That makes sense, right? I can easily think of an explanation: maybe after $75000, we run out of things we really care about spending money on, and start throwing it at frivolous luxuries that don’t actually make us happier, but our life satisfaction keeps rising because we’re comparing our lives to others’. The authors mention a similar explanation:

Perhaps $75,000 is a threshold beyond which further increases in income no longer improve individuals’ ability to do what matters most to their emotional well-being, such as spending time with people they like, avoiding pain and disease, and enjoying leisure.

We’ve got our empirical findings and a mechanical explanation for them. Case closed, right? Wrong.

Hi there! How are you feeling right now?

So what’s the problem here? Well, the study’s authors were a bit out of date. See, Deaton and Kahneman neglected to use the hottest new tech in happiness measurement: experience sampling.

Experience sampling is when instead of asking people to report on how they were feeling yesterday or last week, you prompt them with surveys throughout the day to ask them how they’re feeling at that exact moment. This was made possible by the rise of smartphones, which people are carrying with them at every point of the day and always checking. For the researchers, it’s as easy as sending an automated text message a few times per day!^[1]

Aside from more thoroughly annoying your test subjects, though, what’s the point of this new method? Well, human memory is notoriously unreliable. We’ve known that ever since Elizabeth Loftus’ countless memory experiments. So there’s no reason to expect our memory of how we were feeling yesterday to be any better.

Indeed, there are plenty of studies showing that our memory of emotions is quite fallible. There’s the peak-end rule, first discovered in Kahneman (1993), which states that when remembering an experience, we only really judge its valence on two data points: the peak (the most intense point) and the end.^[2]

For example, in the Kahneman (1993) study I just mentioned, people judged sticking their hand in painfully cold water for 1 minute, then switching to slightly less cold water for another thirty seconds, less unpleasant than only doing the 1 minute in ice cold water part. It’s the same experience! They just tacked on another painful bit at the end! If anything, people should be judging the longer ice bath worse. The peak-end rule is a form of duration neglect: we don’t really care about how long the experience lasted, just how good or bad it was at certain points.

Another one: Breckler (1994) asked blood donors 2, 10, or 49 days after their donation to remember how they felt donating blood. They then compared the memories to surveys taken at the time of donation by some of the subjects. Not only did they find that people remembered their emotions differently depending on their mood while taking the survey, but also (very important!) that donors remembered more anxiety the more time had passed since the donation.

That is why we do experience sampling. If you ask people how they feel right that second, there’s no time for the memory goblins to have gotten into their brains and messed up their memories yet. The emotions are still there.

So predictably, when someone [Killingsworth (2020)] tried to replicate the $75000 study using experience sampling (and a few other design improvements like including more rich people), the results for experienced well-being—aka state hedonia—were quite different.

Line go up means world more happier!

This graph clearly shows that even after $75000, experienced well-being continues to rise with log income—that is, doubling your income will increase your happiness by roughly the same amount each time. That makes sense if you consider diminishing marginal utility—your 75001st dollar will increase your happiness a lot less than your 101st, but it isn’t completely useless either.

There we are. We’ve found an elegant solution for our memory problems: cut out the memory middleman and just ask people how they’re feeling right now.

(An alternative is the Day Reconstruction Method: asking people to write all their experiences of the day in a diary, clearly define the duration of each experience, then evaluate how they felt during each experience. Kahneman, Krueger, Schkade, Schwarz and Stone (2004) show that this method produces basically the same results as experience sampling, implying that it is at least nearly as good; maybe even better. The authors say that this method works because writing down the experiences makes the test subjects relive their feelings in the process of remembering.)

Solution #1: Experience Sampling (or Day Reconstruction)

Problem #2a: Asymmetry

Four ~~Humours~~ ~~Elements~~ Categories

Huta & Waterman say that, in happiness research, there are four categories of analysis:

(a) orientations: orientations, values, motives, and goals
(b) behaviors: behavioral content and activity characteristics
(c) experiences: subjective experiences, emotions, and cognitive appraisals
(d) functioning: indices of positive psychological functioning, mental health, and flourishing

A hedonic orientation might be something like “I really value having fun with my friends”, while a eudaimonic orientation could be “I want to focus on self-realization”.

An example of a hedonic behavior is eating ice cream with ALL THE TOPPINGS; reading classic literature is a quintessential eudaimonic behavior. Behavior is unique among these four because it can be reliably measured by something besides self-report: other-report. Just ask someone’s parents, friends or spouse what they enjoy doing and bam, you’ve got data!

Feeling the tastes of all the strawberry ice cream and chocolate sprinkles and caramel sauce mix on your tongue is a hedonic experience; untangling the metaphor in La Peste or the satire of Max Havelaar is a eudaimonic experience.

Finally, there’s functioning. Self-acceptance is an important kind of eudaimonic functioning; as for hedonic functioning, Huta & Waterman say:

[N]o measures or theory have been proposed regarding hedonia in the functioning category of analysis. It would be worth contemplating whether there is such a thing as positive hedonic functioning, and what such functioning would look like, or whether the entire notion of positive functioning falls under the concept of eudaimonia.

An imbalance in the Force

Huta & Waterman identify an asymmetry here: often within the same study, hedonia will be assessed in terms of cognitive-affective experiences, while eudaimonia is assessed in terms of positive mental functioning or orientations.

[This asymmetry originates from the philosophical distinction between eudaimonia and hedonia; note that Aristotle’s definition of eudaimonia refers to “activity of the soul”, aka functioning rather than experience, while Pigou defines hedonia in terms of “states of consciousness”—experience, not functioning.]

This makes these concepts harder to study because studies that incorporate this asymmetry might just be measuring differences between positive cognitive-affective experiences and positive mental functioning, rather than differences between hedonia and eudaimonia.

And here’s the authors’ explanation of another problem with the asymmetry:

Asymmetrical treatment of eudaimonia and hedonia also makes it difficult to directly compare the empirical relationships they have with predictors or outcomes. For example, when studying the links between various predictors and well-being outcomes, and when the aim is to include both eudaimonic and hedonic conceptions of well-being outcomes, researchers often operationalize eudaimonia using Ryff’s Scales of Psychological Well-Being (Ryff 1989), which assess eudaimonia in terms of qualities associated with positive psychological functioning; on the other hand, studies often operationalize hedonia using one or more component(s) of subjective well-being (Diener 2000)—positive affect, negative affect, and/or life satisfaction.

The example they use is an optimism intervention: say you give your test subjects some self-help books and then make them do team-building exercises while shouting inspirational quotes at them. Then you measure their well-being, and it seems like hedonia improved more than eudaimonia.

Now, if you assessed both terms by e.g. experience, there’s no problem here—go publish your results in a journal, watch all the business magazines run articles on it, and soon enough every corporation in America will be running your optimism interventions. Good job!

But if you assess hedonia in terms of experience and eudaimonia in terms of functioning, then—well, realistically, the business magazines and corporations will run with it anyway—but the ghost of Thomas Bayes will be very disappointed in you, because you can’t actually be sure that this is a causal effect. There are quite a few possible confounders here; for example, maybe your cognitive state just changes more easily than your psychological functioning!

Solution #2a: Use the same analysis categories when comparing things

Problem #2b: Unclear Definitions

Who’s that entering the arena? Oh my god, it’s Semantics with a steel chair!

You believed me earlier when I said we were done with definitions? HA! We are never done with definitions.

In fact, the definitions I posted at the beginning of this post were synthesized by Huta & Waterman in this very same paper. They took a bunch of studies on well-being and looked at the elements included in the definitions in each study. Here’s the table.

It’s clear that while there are some concepts everyone considers to be core to eudaimonia and hedonia, there’s also a fair amount of disagreement in the literature. (And some researchers just want to see the world burn; what was Vittersø smoking when he wrote his definition of hedonia?)

This is a severe impediment to happiness research, especially if researchers are unaware that others are using different definitions. Quote:

The most striking example involves the correlation between measures of eudaimonia and hedonia, as shown in Table 1. When focusing on the trait level—i.e., a person’s typical or average degrees of eudaimonia and hedonia—the correlation has ranged from as low as .0 to as high as .6. When focusing on the state level—i.e., a particular point in time, a given span of time, or a particular type of activity—the correlation between eudaimonia and hedonia has ranged even more widely, from −.3 to .8.

Those are huge differences in correlation, even for social science! There are clearly big differences in our results depending on the definitions we use.

Looking at the table, we can see that the strongest links between eudaimonia and hedonia are when they are defined as experiences: “feeling engaged (measured as a common factor)” and “feeling pleasure (measured as a common factor)” are correlated at .6; “eudaimonic well-being” and life satisfaction at .5. It seems that one contributes to the other—or perhaps eudaimonic and hedonic experience partially overlap, with people placing them into the same bucket of “good feelings”.

The weakest correlations, meanwhile, are at the behavior level of definition. Eudaimonic behavior and hedonic behavior only correlate at .1. This passes the sniff test, too—the common image of a hedonist is not necessarily that of someone who behaves virtuously or puts effort into self-realization.

Orientation is in-between. Eudaimonic constitutive goals & hedonic instrumental goals correlate at .4; same for eudaimonic & hedonic motives. However, eudaimonic and hedonic orientations to happiness only correlate at .2, and there’s no connection at all between “intrinsic aspirations” and hedonic aspirations.

Clearly, we need to get all the researchers on the same page. Huta & Waterman have taken a great first step in this regard by making the distinctions clear and classifying all of the core components of hedonia and eudaimonia. Given the 358 citations on this paper, it seems as if most of the field has accepted these norms.

Solution #2b: Get everyone to use the same definitions

Problem #3: Response Style

You got a ⁵⁄₁₀ on your Happiness exam, try harder next time.

Have you ever tried comparing school grades with someone from a different country? It’s not simple. Tell your American friend that you got a ¹⁷⁄₂₀, they’ll just look at you confused and wonder what you’re so excited about. Meanwhile, when a Belgian student hears of someone who got straight A’s, their jaw will drop—that’s almost unheard of here. It’s not that the Belgian school system is harder, or that American kids are smarter; they’re just using different scales.

Two economists, Jorge Alvarez and Fernanda Marquez-Padilla, wondered if people in countries with different grading systems have different response styles. For example, if you’re in the Philippines where 75% is the passing grade, you might report your life satisfaction at ⁸⁄₁₀ despite feeling meh about your life—after all, that’s barely above passing, right? A Finnish person on the other hand might rate their life a ⁵⁄₁₀ even if they’re quite happy, because for them, a ¹⁄₅ is still a pass.^[3]

Alvarez & Marquez-Padilla (2018) is the study where they test this. The results are pretty much exactly what the hypothesis predicted—pass-fail threshold (PFT) was significantly correlated with questions that required a numerical assessment, but not at all with categorical questions (e.g. “have you ever felt on top of the world”, “have you ever felt depressed or very unhappy”). When they try doing a regression to correct for this bias, the “imputed happiness” (i.e. happiness after correction) is more strongly correlated with log income than is reported happiness.

This suggests a relatively straightforward solution: correct for grading bias with a simple regression. And indeed, that would solve this specific instance of response style bias. But there’s more!

Krueger, Kahneman, Fischler et al. (2009) says that French people use the far ends of the scale less frequently than Americans do, thus biasing the results so it looks like Americans are happier. And according to Meisenberg & Williams (2008), the less intelligent someone is, the more they exhibit the opposite bias—choosing the 1 and 7 more often on a 1-7 scale, for example.

This bias is called extreme response style (ERS) and is found basically everywhere. A 2016 meta-analysis indicates further differences between races (Blacks and Hispanics exhibit more ERS, Whites and Asians less) and gender (females exhibit slightly more ERS than men), as well as a positive correlation with acquiescence, which is the tendency to say ‘yes’ to questions asked in surveys. (Acquiescent response set (ARS) is a confounder in its own right, and should be controlled for.)^[4]

There’s a whole array of other response styles/sets out there:

People respond in socially desirable ways; if they believe that happiness is normatively appropriate, they may report that they are happier than other types of assessments may indicate. Social desirability of happiness may also vary from country to country.

Evasiveness; maybe people of a certain social category or with a certain mindset are more likely to answer “Not Sure”, thus removing them from the data and giving a skewed picture.

Speed vs. accuracy; some people will take their time and think deeply about their answers, while others will fill in the first thing that comes to mind.

If you want to control for all of these and more, that’s a lot of effort. You can never be sure if there are more confounders out there, either, which is why some scientists don’t believe controlling for confounders can ever prove causality; there’s always one you’ve missed.

Luckily, we have another measure to help mitigate the damage: item direction balance. Instead of having all the “good” answers on one side of the scale and all the “bad” answers on the other, switch it up once in a while! This is most effective against acquiescence, because someone tempted to answer ‘yes’ or answer on the high side of the scale won’t bias your results any more.

Solution #3: Controls and Direction Balance

Conclusion

There you have it: three problems and three solutions! To summarize:

Imperfect memory effects can be solved by using experience sampling or day reconstruction.
Asymmetry and unclear definitions can be solved by getting all the researchers on the same page and making them use the same definitions.
Response style differences can be solved by directional balancing of questions and ungodly amounts of controls.

This is a very long post, even after I cut out the aspiration treadmill part (Kahneman’s thoughts about the aspiration treadmill, for those interested). And I haven’t even touched on any of the deeper critiques of the field yet! There’s philosophical critiques, like Angner (2013) and sociological ones like Frawley (2015). All worth reading, all beyond the scope of this post!

Happiness research is something certainly worth continuing, for the sake of everyone who would like to enjoy their life. Considering that happiness studies only really started in the 1980s with positive psychology, and that economics and sociology joined in the 2000s and 2010s respectively, there’s certainly a lot left to be discovered in this latest subject to move from the domain of philosophy to that of science.

Let’s put in the effort.

Sources

The ones I actually used

Batchelor, J., & Miao, C. (2016). Extreme Response Style: A Meta-Analysis. Journal of Organizational Psychology, 16. https://www.researchgate.net/publication/316820164_Extreme_Response_Style_A_Meta-Analysis

Breckler, S. J. (1994). Memory for the Experience of Donating Blood: Just How Bad Was It? Basic and Applied Social Psychology, 15(4), 467–488. https://doi.org/10.1207/s15324834basp1504_5

Huta, V., & Waterman, A. S. (2013). Eudaimonia and Its Distinction from Hedonia: Developing a Classification and Terminology for Understanding Conceptual and Operational Definitions. Journal of Happiness Studies, 15(6), 1425–1456. https://doi.org/10.1007/s10902-013-9485-0

Kahneman, D., Fredrickson, B. L., Schreiber, C. A., & Redelmeier, D. A. (1993). When More Pain Is Preferred to Less: Adding a Better End. Psychological Science, 4(6), 401–405. https://doi.org/10.1111/j.1467-9280.1993.tb00589.x

Kahneman, D., Krueger, A. B., Schkade, D. A., Schwarz, N., & Stone, A. A. (2004). A Survey Method for Characterizing Daily Life Experience: The Day Reconstruction Method. Science, 306(5702), 1776–1780. https://doi.org/10.1126/science.1103572

Kahneman, D., & Krueger, A. B. (2006). Developments in the Measurement of Subjective Well-Being. Journal of Economic Perspectives, 20(1), 3–24. https://doi.org/10.1257/089533006776526030

Kahneman, D., & Deaton, A. (2010). High income improves evaluation of life but not emotional well-being. Proceedings of the National Academy of Sciences, 107(38). https://doi.org/10.1073/pnas.1011492107

Killingsworth, M. A. (2021). Experienced well-being rises with income, even above $75,000 per year. Proceedings of the National Academy of Sciences, 118(4). https://doi.org/10.1073/pnas.2016976118

Krueger, A. B., Kahneman, D., Fischler, C., Schkade, D., Schwarz, N., & Stone, A. A. (2008). Time Use and Subjective Well-Being in France and the U.S. Social Indicators Research, 93(1), 7–18. https://doi.org/10.1007/s11205-008-9415-4

Larson, R., & Csikszentmihalyi, M. (2014). The Experience Sampling Method. Flow and the Foundations of Positive Psychology, 21–34. https://doi.org/10.1007/978-94-017-9088-8_2

Marquez-Padilla, F., & Alvarez, J. (2018). Grading happiness: what grading systems tell us about cross-country wellbeing comparisons. Economics Bulletin, 38. https://ideas.repec.org/a/ebl/ecbull/eb-18-00325.html

Meisenberg, G., & Williams, A. (2008). Are acquiescent and extreme response styles related to low intelligence and education? Personality and Individual Differences, 44(7), 1539–1550. https://doi.org/10.1016/j.paid.2008.01.010

Powdthavee, N. (2007). View of Economics of Happiness: A Review of Literature and Applications. Chulalongkorn Journal of Economics, 19. https://www.researchgate.net/publication/228373537_Economics_of_Happiness_A_Review_of_Literature_and_Applications

Ryff, C. D. (1989). Happiness is everything, or is it? Explorations on the meaning of psychological well-being. Journal of Personality and Social Psychology, 57(6). https://doi.org/10.1037/0022-3514.57.6.1069

Schneider, S. (2016). Extracting Response Style Bias From Measures of Positive and Negative Affect in Aging Research. The Journals of Gerontology Series B: Psychological Sciences and Social Sciences, gbw103. https://doi.org/10.1093/geronb/gbw103

Strijbosch, W., Mitas, O., van Gisbergen, M., Doicaru, M., Gelissen, J., & Bastiaansen, M. (2019). From Experience to Memory: On the Robustness of the Peak-and-End-Rule for Complex, Heterogeneous Experiences. Frontiers in Psychology, 10. https://doi.org/10.3389/fpsyg.2019.01705

Extra reading

Angner, E. (2013). Is it possible to measure happiness? European Journal for Philosophy of Science, 3(2), 221–240. https://doi.org/10.1007/s13194-013-0065-2

Binswanger, M. (2006). Why does income growth fail to make us happier? The Journal of Socio-Economics, 35(2), 366–381. https://doi.org/10.1016/j.socec.2005.11.040

Diener, E. (2000). Subjective well-being: The science of happiness and a proposal for a national index. American Psychologist, 55(1), 34–43. https://doi.org/10.1037/0003-066x.55.1.34

Frawley, A. (2015). Happiness Research: A Review of Critiques. Sociology Compass, 9(1), 62–77. https://doi.org/10.1111/soc4.12236

Jain, M., Sharma, G. D., & Mahendru, M. (2019). Can I Sustain My Happiness? A Review, Critique and Research Agenda for Economics of Happiness. Sustainability, 11(22). https://doi.org/10.3390/su11226375

OECD. (2013). OECD Guidelines on Measuring Subjective Well-being | READ online. Oecd-Ilibrary.Org. Retrieved February 6, 2022, from https://read.oecd-ilibrary.org/economics/oecd-guidelines-on-measuring-subjective-well-being_9789264191655-en#page10

^
The invention of the experience sampling method came earlier, though: Reed Larson & Mihaly Csikszentmihalyi were already talking about it in 1983. They mention using pagers to alert their subjects. Not really sure why this method only took off in the 2010s, then; maybe the smartphones were just that much more convenient?
^
It has been debated whether or not the peak-end rule can be generalized from simple experiments to real-world situations; Strijbosch et al. (2019) finds that average valence & arousal is a better predictor than peak-end in the case of more realistic, heterogeneous experiences. They also find that averages, compared to peak-end measures, are better predictors the more time has passed.
^
The Finnish grading system is actually analogous to the US letter grading system, but instead of A/B/C/D/F, the grades are 5/4/3/2/1. So a 1 will be between <40% and <60%, depending on the curve.
^
Age is an interesting one: researchers all agree it does something to ERS, but really seem to disagree on what it does. The meta-analysis I just mentioned says that ERS drops sharply with age, consistent with either a linear effect or a curvilinear effect where ERS rises until one’s early 20s, then drops. Schneider (2016), which also analyses the existing literature and adds its own study, says the likelihood of ERS increases significantly with age. I have no clue what to think about this, except that the decrease-over-age conclusion seems intuitively more correct—aren’t people supposed to moderate with time?