Using machine learning to predict romantic compatibility: empirical results

Overview

For many people, having a satisfying romantic relationship is one of the most important aspects of life. Over the past 10 years, online dating websites have gained traction, and dating websites have access to large amounts of data that could be used to build predictive models to achieve this goal. Such data is seldom public, but Columbia business school professors Ray Fisman and Sheena Iyengar compiled a rich and relevant data set for their paper Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment. Their main results were:

Women put greater weight on the intelligence and the race of partner, while men respond more to physical attractiveness. Moreover, men do not value women’s intelligence or ambition when it exceeds their own. Also, we find that women exhibit a preference for men who grew up in affluent neighborhoods. Finally, male selectivity is invariant to group size, while female selectivity is strongly increasing in group size.

I found the study through Andrew Gelman’s blog, where he wrote:

What I really want to do with these data is what I suggested to Ray and Sheena several years ago when they first told me about the study: a multilevel model that allows preferences to vary by person, not just by sex. Multilevel modeling would definitely be useful here, since you have something like 10 binary observations and 6 parameters to estimate for each person.

Several months ago I decided to pursue a career in data science, and with a view toward building my skills, I worked to build a model to predict when an individual participant will express interest in seeing a given partner again. Along with the goal of learning, I had the dual intent of contributing knowledge that had the potential, however slight, to help people find satisfying romantic relationships.

It’s unlikely that what I did will have practical applications (as basic research seldom does), but I did learn a great deal about many things, most having to do with data science methodology in general, but also some about human behavior.

This is the first of a series of posts where I report on my findings. A linear narrative would degenerate to a sprawling blog post that would be of little interest to anybody but me. In this post, I’ll restrict focus to the question: how much predictive power can we get by estimating the generic selectivity and desirability of the people involved, without using information about the interactions between their traits?

I’ll ultimately go into the details of the methodology that I used, including discussion of statistical significance, the rationale for making the decisions that I did the, and links to relevant code, but here I’ll suppress technical detail in favor of relegating it to separate blog posts that might be of interest to a more specialized audience. In several places I speculate as to the meaning of the results. I’ve made efforts to subject my reasoning to cross checks, but haven’t gotten almost any external feedback yet, and I’d welcome counter considerations, alternative hypotheses, etc. I’m aware that there are places where claims that I make don’t logically follow from what precedes them, and I’m not so much looking for examples of this in general as much as I am instances where there’s a sizable probability that I’ve missed something that alters the bottom line conclusions.

Context on the dataset

The dataset was derived from 21 round-robin speed dating events held for Columbia graduate students, each attended by between 6 to 22 participants of each gender. To avoid small sample size issues, I restricted consideration to the 9 events with 14 or more people. These consisted of ~2500 speed dates, involving a total of ~160 participants of each gender. The subset that I looked at contains events that the authors of the original study excluded in their own paper because they involved the experimental intervention of asking subjects to bring a book to the event. The effect size of the intervention was small enough so that it doesn’t alter bottom line conclusions.

The dataset has a large number of features. I found that their predictive power was almost all contained in participants’ ratings of one another, which extended beyond a “yes/​no” decision, also including ratings on dimensions such as attractiveness, sincerity, intelligence, fun and ambition. For brevity I’ll refer to the person who made decision as the “rater” and his or her partner as the “ratee.” The core features that I used were essentially:

  • The frequency with which the rater’s decision on other ratees was ‘yes.’

  • The frequency with which other raters’ decision on the ratee was ‘yes.’

  • Averages of ratings that others gave the rater and ratee.

The actual features that I used were slightly modified versions of these. I’ll give details in a future post.

Why ratings add incremental predictive power

In this blog post I’m restricting consideration to signals of the partners’ general selectivity and general desirability, without considering how their traits interact. First approximations to a participant’s desirability and selectivity come from the frequency with which members of the opposite sex expressed to see them again, and the frequency with which the participant expressed interest in seeing members of the opposite sex again.

If the dataset contained information on a sufficiently large number of dates for each participant, we could not improve on using these. But the number of dates that each participant went on was small enough so that the decision frequencies are noisy, and we can do better by supplementing them with other features. There’s a gender asymmetry to the situation: on average, men said yes 48.5% of the time and woman said yes 33.5% of the time, which means that the decision frequency metrics are noisier than is otherwise the case when the rater is a woman and the ratee is a man, so there’s more room for improvement when predicting women’s decisions than there is when predicting men’s decisions.

It’s in principle possible for the average of a single type of rating to carry more information than decision frequencies. This is because the ratings were given on a scale from 1 to 10, whereas decisions have only two possible values (yes and no). This means that ratings can (in principle) reflect desirability or lack thereof at a greater level of granularity than decisions. As an example of this, if a rater has impossibly high standards and rejects everyone, the rater’s decisions carry no information about the ratees, but the rater’s ratings of ratees still contain relevant information that we can use to better estimate ratee desirability.

The ratings that are most useful are those that correlate best with decisions. To this end, we examine correlations between average ratings and individual decisions. We also look at correlations between average ratings of each type and individual ratings of each type, to get a sense for the degree to which the ratings are independent, and the degree to which there tends to be a consensus as to whether somebody possesses a given trait. In the figures below, we’re abbreviate the ratings of different types owing to space considerations. The abbreviations are given by the following dictionary:

dec ---> the rater’s decision

like ---> how much a rater liked a ratee overall

attr ---> the ratee’s attractiveness

fun ---> how fun the ratee is

amb ---> the ratee’s ambition

intel ---> the ratee’s intelligence

sinc ---> the ratee’s sincerity

The image below matrix shows the correlations when the ratee is a woman and the raters are men. The rows of the figure correspond to the ratee’s average ratings, and the columns correspond to individual rater’s ratings:

As the scale on the right indicates, darker shades of red correspond to stronger correlations. Three things to highlight are:

  1. The correlations are positive: higher ratings on one direction are always associated with higher ratings on all dimensions. Even the squares that might initially appear to be white are in fact very faintly red. So ratings of a given type contain information about ratings of other types.

  2. Consider the 5x5 square in the lower right, which corresponds to interrelations between attractiveness, fun, ambition, intelligence and sincerity. For each column, the square with the darkest shade of red is the square on the diagonal. This corresponds to the fact that given a rating type R, the average rating that’s most predictive of R is the average of R itself.

  3. Consider the leftmost column, which corresponds to individual men’s decisions. The darkest shades of red correspond to the average of other men’s decisions on a woman, the average of their ratings of how much they like her overall, and their ratings of her attractiveness. Moreover, these three squares are essentially the same shade of red as one another, corresponding to the three averages having equal predictive power.

  4. Consider the intersection between the top 3 rows and the rightmost 4 columns. The darkest shades of red appear in the “liking” row and the lightest shades of red appear in the “attractiveness” row. This corresponds to how much a man likes a woman general reflecting a broader range of his or her characteristics than just physical attractiveness, and this being true of his receptiveness to dating her as well, but to a lesser degree.

Each of these observations deserves comment:

  1. It seems implausible to me that each of the 10 distinct correlations between the five traits of attractiveness, fun, ambition, intelligence and sincerity is positive. The first thing that jumped to mind when I saw the correlation matrix was the Halo Effect: people’s positive perceptions of people on one dimension tend to spill over and affect their perceptions of the person on all dimensions. Here the role of attractiveness specifically has been highlighted. Later on we’ll see evidence that the halo effect is in fact a contributing factor.

    But one also can’t dismiss the possibility that the correlations between the ratings are partially driven by correlations between the underlying traits. As a concrete hypothetical, quality of early childhood nutrition could potentially impact all five dimensions.

    The slightest correlations between ratings are also sufficiently small so that one can imagine them reflecting genuine correlations between the underlying traits without them being large enough for us to notice in our day to day experience.

    One can also imagine the ratings understating correlations between the different traits owing to anchoring biases: if two people are the same on one dimension D, and one of them is very high on another dimension D’, the one with very high D’ could be rated lower on D because the degree to which he or she possesses D looks small relative to the degree to which he or she possesses D’.

  2. In view of the Halo Effect, one could imagine that ratings of a given type are essentially noise, picking up only on the degree to which someone possesses other traits. One could also imagine a rating type being ill-defined on account of there being no consensus on what the word means.

    The fact that the average consensus on how much someone possesses each trait is the most predictive of individual ratings of that trait strongly suggests that the 5 different rating types are in fact picking up on 5 distinct traits. They might not be the traits that come to mind when we think of the words: for example, it could be that the distinct underlying trait that intelligence ratings are picking up on is “wears glasses.” But if men give a woman high ratings on intelligence and not sincerity (for example), it means something.

  3. At least two of the “decision,” “liking,” and “attractiveness” averages reflect different things, as one can see from the differences in how they correlate with other traits. But they correlate well with one another, and when it comes to using them to predict decisions, one gets a close approximation to the truth if one adopts the view that they’re essentially measures of the same thing.

  4. The “liking” average captures some of the predictive power of the fun, ambition, intelligence and sincerity ratings, but it reflects sincerity too much from the point of view of predicting decisions.

With (3) in mind, we obtain our first conclusion, one totally uncontroversial in some circles, though not all:

On average, of the five dimensions on which men rated women, the one that most drove men’s decisions on a woman is her attractiveness. The gap between the predictive power of attractiveness and the predictive power of the other traits is large. “Fun,” is the closest to attractiveness in predictive power, but the predictive power may derive in part from attractive women being perceived as more fun.

Gender Differences

So far I’ve only discussed men’s preferences. The analysis above applies to women’s preferences nearly word for word: to a first approximation, the table for women and the table for men are identical to one another.




How surprising this is to the reader will depend on his or her assumptions about the world. As a thought experiment, you might ask yourself: suppose that you had generated the two images without giving them headings, and had only examined one of them. If you later came across the other on your computer without remembering which it was, how confident would you be that it was the one that you had already seen?

The correlation matrixes give the impression of contradicting a claim in the original study:

Women put greater weight on the intelligence [...] while men respond more to physical attractiveness.

The apparent contradiction is explained by the fact that the subsets of events that I used were different from the subset of events that the authors reported on in their paper. On one hand, I omitted the events with fewer than 14 people. On the other hand, the authors omitted others:

Seven have been omitted...four because they involved an experimental intervention where participants were asked to bring their favorite book. These four sessions were run specifically to study how decision weights and selectivity would be affected by an intervention designed to shift subjects’ attention away from superficial physical attributes.

The intervention of asking participants to bring their favorite book seems to have had the intended effect. One could argue that the sample that I used is unrepresentative on account of the intervention. But to my mind, the intervention falls within the range of heterogeneity that one might expect across real world events, and it’s unclear to me that the events without the intervention give a better sense for gender differences in mate selection across contexts than the events with the intervention do.

A priori one might still be concerned that my choice of sample would lead to me developing a model that gives too much weight to intelligence when the rater is a man. But I chose the features that I did specifically with the intent of creating a model that would work well across heterogeneous speed dating events, and made no use of intelligence ratings to predict men’s decisions.

There are some gender differences in the correlations even in the sample that I use – in particular, the correlations tend to be stronger when the rater is male. This could be because of actual differences in preferences, or because of differences with respect to susceptibility to the halo effect, or a number of other things.

Whatever the case may be, the first three points that I made about the correlation matrix for male raters are also true of the correlation matrix for female raters.

The fourth point needs to be supplemented by the observation that from the point of view of predicting decisions, when the raters are women, not only do the “liking” ratings reflect the sincerity ratings too much, they also reflect the ambition and intelligence ratings too much.

A composite index to closely approximate a ratee’s desirability

When the rater is a man, the liking average captures the predictive power present in the fun, ambition, and intelligence averages, and we lose nothing by dropping them. We would like to use all three of the most predictive averages, decision, liking and attractiveness to predict men’s decisions. But the three vary in predictive power from event to event, and if we use all three we end up overfitting and get suboptimal results. Inspired by the fact that their predictive power is roughly equal, when the raters are men, we simply average the three to form a composite index of a woman’s desirability. Using it in a model gives better results than using any combination of the three individually.

When the raters are women, the liking average does a poor job of picking up on the predictive power of the attractiveness, fun, ambition, intelligence and sincerity to the right degrees. So in this case, we form the same composite index as the composite index for men, but without omitting the liking average

Fun and intelligence as positive predictors when the rater is a woman

When the rater is a woman, the fun average gives substantial incremental power (moreso than when the rater is a man, even when we don’t include the ‘liking’ average in the composite index when predicting his decision), and we include it. The ambition average offers no additional predictive power. We get a little bit more predictive power by including the intelligence average. It’s nearly negligible until we make the modification described below.

Ratees’ sincerity as an incremental negative predictor

While the ratees’ average sincerity ratings are correlated with the raters’ decisions being yes, the effect vanishes after we control for the decision and attractiveness averages, suggesting that the sincerity/​decision correlation is driven by the halo effect rather than by people’s decisions being influenced by genuine sincerity.

As mentioned above, liking average is more strongly correlated with the sincerity average than the decision average is, and so when the rater is man, our composite index of a woman’s desirability is weakened by the fact that it indirectly picks up on sincerity.

Similarly, the predictive power of the intelligence average that we included to predict women’s decisions is degraded by the fact that it indirectly picks up on sincerity ratings.

We can correct for this problem by including the sincerity averages in our model to control for them: this allows for our model to give more weight to the factors that actually drive decision than it would otherwise be able to.

Desirability as a signal of selectivity

More desirable people are more selective. This is nearly a tautology in general, and doesn’t necessarily reflect snobbishness: barring polyamory, somebody with many suitors has to reject a larger percentage of them than somebody with few. The pattern is present in the dataset, even though raters were generally allowed to decide yes on as many people as they wanted to.

I found that a ratee’s selectivity didn’t consistently yield incremental predictive power of the ratee’s desirability, but so far we only used one metric of the rater’s selectivity, and we can improve on it by adding a measure of the rater’s desirability. For this I simply used the man’s composite desirability index when the rater is a man. When the rater is a woman, the composite index includes the average of “liking” ratings given to women, which in turn reflects traits such as ambition, intelligence and sincerity which may affect selectivity for reasons unrelated to the woman’s desirability. So we use the version of the composite index that excludes the “liking” rating averages.

The effect is largely restricted to raters of below average desirability: the rise in selectivity generally flattens out once one passes to consideration of people who are of above average desirability. It’s unclear to me why this should be so. My best guess is that there’s an extroversion/​desirability connection that emerges at the upper end of the spectrum of desirability, which dampened the connection between desirability and selectivity in this study.

The predictive power that we obtain

Recall that men’s decision were yes for 48% of the dates in the sample, and women’s decisions were yes for 33% of the dates in the sample. Matches occurred 15.5% of the time. These establish baselines that we can judge our model against: if we predicted that everyone rejected everyone, the error rates would simply be the percentages listed. Using the features that I allude to above, I obtained predictions with the error rates indicated in the table below:

<col><col><col><col><col>

Total Error

False Positives

False Negs

% Yes Found

Women by Men

25.1%

25.3%

24.8%

72.8%

Men by Women

23.9%

31.6%

20.9%

54.2%

Matches

14.3%

37.3%

13.3%

17.2%

In my next post I’ll expand the model to include interactions between individual men’s traits and individual women’s traits.