Behind Blue Eyes - Part 3 : Comparing the Distributions

May 3, 2026·

Now that we have our data, we can finally compare the eye color distribution of actors to the baseline established in Part 1. As a reminder, the baseline from 29 US states is : Blue/Grey (27.3%), Brown/Hazel (62.8%) and Green/Other (9.9%).

Full Cast

We start by aggregating appearance counts by year and eye color. For each group, we compute an appearance rate. Before running any test, it is worth looking at the data visually.

The following results can be computed using the 10_analyze_eye_colors_full_cast.ipynb notebook.

Overview of the data

A few things stand out. The distributions differ between movies and series. There is year-to-year variation, but no visible trend in either direction. It’s also hard to distinguish any seasonality. This looks like statistical noise. It therefore seems safe to aggregate across all years for more robust results. Looking at individual years would not be very informative anyway : the question is whether a structural bias exists, not whether a single year deviates.

About statistical tests

Several statistical tests are used to assess the data. The main one is the χ² goodness-of-fit test. The idea is straightforward : given the baseline proportions, we know how many actors in each eye color category we would expect to see if the cast were drawn at random from the general population. The test compares these expected counts to the observed ones. For each category

i

, we compute :

\frac{(O_i - E_i)^2}{E_i}

where $O_i$ is the observed count and $E_i = N \cdot p_i$ is the expected count (with $N$ the total number of actors and $p_i$ the baseline proportion). The sum of these terms across all categories gives the test statistic, which follows a χ² distribution with $k - 1$ degrees of freedom $df$ (here $k = 3$ , so $df = 2$ ). A small p-value means the overall distribution deviates significantly from the baseline. Informally, the p-value is the probability of observing a result at least as extreme as ours if the null hypothesis (here, “the cast follows the baseline distribution”) were true : the smaller it is, the harder it becomes to attribute the deviation to chance alone.

However, the χ² test only tells us that something is different, not what. To answer the original question directly, we also perform a one-proportion z-test on Blue/Grey alone : Is the observed proportion significantly different from 27.3% ? A z-test measures how far an observed value is from a reference value in units of standard error. Here it compares the observed Blue/Grey proportion to the baseline of 27.3%, producing a z-score and an associated p-value. With the sample sizes involved (thousands of actor entries), the normal approximation is well justified.

We report Cohen’s h as an effect size measure. With large sample sizes, even tiny deviations from the baseline can produce a significant p-value, so significance alone does not tell us whether the difference is meaningful in practice. Cohen’s h quantifies the magnitude of the difference between two proportions, regardless of sample size. Conventional thresholds are $|h| = 0.2$ for a small effect, $0.5$ for medium, and $0.8$ for large. To check for temporal trends, a linear regression of the Blue/Grey proportion against year is fitted separately. Finally, we compute standardized residuals from the χ² test (the signed square roots of each term) to identify which categories drive the overall deviation.

Because we run the z-test on Blue/Grey four times (see later), the probability of at least one false positive grows beyond the nominal $\alpha = 0.05$ . To control this family-wise error rate, we apply the Bonferroni correction : each test is evaluated at $\alpha / m = 0.05 / 4 = 0.0125$ . Only p-values below this adjusted threshold are considered significant. This is conservative, as it reduces statistical power, but it ensures any surviving result is not an artefact of running multiple tests.

For movies, the results are as follows.


Metric
χ² statistic	92.36
χ² p-value	8.81 × 10⁻²¹
Degrees of freedom	2
Blue/Grey observed	27.03%
Blue/Grey baseline	27.3%
Z-test statistic	-0.421
Z-test p-value	0.6741
Cohen’s h	-0.006
95% CI for Blue/Grey	[25.77%, 28.29%]
Temporal trend
Slope	-0.0030/yr
R²	0.092
p-value	0.3639
Standardized residuals
Blue/Grey	-0.359
Brown/Hazel	+3.747
Green/Other	-8.842

The χ² test is highly significant (p < 10⁻²⁰), meaning the overall distribution of eye colors among movie actors does not match the baseline. However, the z-test on Blue/Grey specifically is not significant (p = 0.674), with an observed proportion (27.03%) essentially identical to the baseline (27.3%). The effect size is negligible (Cohen’s h = -0.006). The temporal slope is not significant either (p = 0.364) : no trend over 2015-2025.

The standardized residuals tell the real story. The χ² significance comes almost entirely from Green/Other (residual = -8.842), which is heavily underrepresented. Brown/Hazel is slightly overrepresented (residual = +3.747). Blue/Grey barely deviates at all.

For TV series, we observe the following.


Metric
χ² statistic	72.47
χ² p-value	1.83 × 10⁻¹⁶
Degrees of freedom	2
Blue/Grey observed	25.35%
Blue/Grey baseline	27.3%
Z-test statistic	-1.997
Z-test p-value	0.0458
Cohen’s h	-0.044
95% CI for Blue/Grey	[23.48%, 27.22%]
Temporal trend
Slope	-0.0052/yr
R²	0.245
p-value	0.1215
Standardized residuals
Blue/Grey	-1.703
Brown/Hazel	+4.024
Green/Other	-7.306

The picture is similar. The χ² test is again highly significant, but the deviation is again driven by Green/Other (residual = -7.306) rather than by Blue/Grey. The z-test on Blue/Grey gives p = 0.046, which lands right on the conventional 0.05 threshold. Technically significant, but barely. The effect size is small (Cohen’s h = -0.044), and the baseline (27.3%) sits just above the upper bound of the 95% confidence interval [23.48%, 27.22%], consistent with this borderline p-value. The direction of the effect is negative (fewer blue eyes than expected, not more), which runs opposite to the original hypothesis. The temporal slope is not significant (p = 0.122).

In both cases, the baseline for Blue/Grey falls within (or at the edge of) the confidence interval. We cannot conclude that blue-eyed actors are overrepresented in the full cast. The striking result is actually that Green/Other is consistently underrepresented, with large negative residuals in both movies and series. It would be interesting to dive into the reasons, but it’s actually outside the scope of this article. Part of the gap could stem from classifier noise (green eyes occasionally labeled as brown/hazel), but the residuals are too large (-8.8 and -7.3) for misclassification alone to explain it (deviation is extreme). This is likely a real effect (and explanations could be many).

It would be tempting to stop here, but we are not quite answering the original question. The gut feeling that started this whole thing was about lead actors, the faces you keep seeing on screen, not about the entire cast down to the last extra. Perhaps we cast our net too wide, which would explain why we recover a distribution close to the general population.

Restricting to Lead Actors

Choosing a cutoff

We go back to our initial data and keep only actors for whom we have photos. As shown in Part 2, photo availability correlates with billing order : top-billed actors almost always have a photo, while minor roles often do not. If anything, this filtering skews toward the actors we actually care about.

Before deciding on a cutoff, it is worth examining the distribution of billing positions.

The same pattern appears for both movies and series. Up to

N = 10

, the billing order distribution roughly follows a geometric decay (which we could fit, but it does not matter much). Beyond

N \ge 10

, there is a sharp drop-off. Zooming in reveals another geometric-like tail, but this is not particularly informative.

This motivates being much more restrictive. We repeat the analysis using only the top 3 billed actors per title. A top 1 would be too restrictive : not enough data for reliable results. A top 5 would be slightly too broad, as the true principal cast is rarely that large. A top 3 seems like a reasonable compromise. It should capture the actors who are genuinely prominent on screen, while keeping enough data to produce tight confidence intervals.

An alternative would be to analyze at the title level : compute the proportion of Blue/Grey eyes per title, then test the mean proportion across titles against the baseline. This would equalize each title’s contribution and better respect the clustering structure (actors within the same title share a casting director and are not truly independent). However, with at most 3 actors per title, no single title can dominate the pooled result, and the clustering concern is largely moot. We stick with the simpler pooled approach.

Results

We recompute the data accordingly and plot them per year to visually check for any trend. We use raw actor counts : each actor appears at most once per title in the top 3, so weighting by appearances adds nothing. You can find a notebook with all of the details in 11_analyze_eye_colors_top_3.ipynb.

As before, no notable temporal trend emerges. We then re-run the hypothesis tests on this restricted dataset.


Metric
χ² statistic	15.71
χ² p-value	3.89 × 10⁻⁴
Degrees of freedom	2
Blue/Grey observed	30.82%
Blue/Grey baseline	27.3%
Z-test statistic	2.987
Z-test p-value	0.003
Cohen’s h	0.078
95% CI for Blue/Grey	[28.43%, 33.21%]
Temporal trend
Slope	-0.0053/yr
R²	0.134
p-value	0.268
Standardized residuals
Blue/Grey	+2.547
Brown/Hazel	-0.489
Green/Other	-2.997

This time the result is different. The z-test on Blue/Grey is significant (p = 0.003). The observed proportion of blue/grey-eyed lead actors in movies is 30.82%, compared to 27.3% in the baseline. The 95% confidence interval [28.43%, 33.21%] does not include the baseline. The effect is small (Cohen’s h = 0.078), but it is there. The standardized residual for Blue/Grey is now positive (+2.55), meaning lead actors in movies do skew slightly toward blue/grey eyes compared to the general population. Green/Other remains underrepresented (residual = -3.00). The temporal slope is still not significant (p = 0.268) : no evidence that this effect is growing or shrinking over 2015-2025.


Metric
χ² statistic	8.29
χ² p-value	0.016
Degrees of freedom	2
Blue/Grey observed	26.84%
Blue/Grey baseline	27.3%
Z-test statistic	-0.251
Z-test p-value	0.802
Cohen’s h	-0.010
95% CI for Blue/Grey	[23.25%, 30.43%]
Temporal trend
Slope	-0.0047/yr
R²	0.044
p-value	0.534
Standardized residuals
Blue/Grey	-0.214
Brown/Hazel	+1.180
Green/Other	-2.617

For series, the picture does not change. The z-test on Blue/Grey is far from significant (p = 0.802), with an observed proportion (26.84%) almost identical to the baseline. The confidence interval [23.25%, 30.43%] is wide, roughly 7 percentage points, which reflects the smaller sample size. With this much uncertainty, we simply do not have the statistical power to detect a small effect even if one existed. The temporal slope is not significant either.

For movies, restricting to lead actors reveals a small but statistically significant overrepresentation of Blue/Grey eyes. The effect is modest : about 3.5 percentage points above the baseline. Importantly, this result survives Bonferroni correction (p = 0.003 < 0.0125). For series, there is nothing to report. The confidence intervals are too wide to draw any conclusion, most likely because the series dataset is substantially smaller.

Note that the borderline result from the full-cast series analysis (p = 0.046) does not survive this correction, which further supports treating it as inconclusive, especially since the effect ran in the opposite direction (fewer blue eyes than expected).

Going Extreme

In Part 2, we committed to re-running the primary tests under two extreme scenarios for the 282 missing actors (all assigned blue/grey, and all assigned brown/hazel) to confirm conclusions hold regardless of missingness mechanism. Before doing so, it is worth checking how many of those 282 actors actually appear in the top-3 restricted dataset. The answer is zero : for both movies and series, every actor in the top 3 billing positions has a photo and a classified eye color. This is consistent with the MAR finding from Part 2 (missingness is driven by billing order, and top-billed actors almost always have photos).

The sensitivity analysis is therefore trivial for the top-3 results : the missing actors cannot affect them. For the full-cast analysis, the 282 missing actors represent about 6.3% of the pool. Even in the worst case (all assigned blue/grey), the full-cast proportion would shift by a few percentage points at most, so not enough to reverse the null finding. The conclusions are robust to any plausible missingness mechanism.

Conclusion

So, are blue-eyed actors overrepresented on screen ? Yes, among movie leads. The top 3 billed actors in movies show 30.82% Blue/Grey eyes versus 27.3% in the general US population. The result survives Bonferroni correction (p = 0.003), the confidence interval excludes the baseline, and the effect is consistent across 2015–2025 with no temporal trend. The overrepresentation is modest (~3.5 percentage points), but it is statistically clear.

The effect does not extend everywhere. When considering the full cast (all billing positions), the Blue/Grey proportion is indistinguishable from the baseline in both movies and series. For TV series, even restricted to leads, the dataset is too small to draw conclusions : confidence intervals are wide and the z-test is far from significant. The most consistent finding across all analyses is actually an underrepresentation of Green/Other eyes, which shows up in every subset.

A personal note : whether looking at France or the US, I probably tend to underestimate how common light eyes (blue/grey) actually are in the population, about one in four people. That is more than intuition might suggest, and it makes a 31% figure among lead actors less dramatic than it first sounds, but 31% is still significantly above 27%, and the data says so clearly.

Of course, all of this assumes we trust the data provided by IMDB. Billing order, which drives the “top 3” analysis, might not always reflect actual screen time or prominence. The initial filtering for US titles is imperfect and may let through some non-US productions. Restricting to US-born actors might yield different results, but IMDB does not provide nationality in its dumps. Cross-referencing with Wikidata could help, but coverage is uneven and the effort would be disproportionate for a curiosity project.

Bonus : Comparing Against France

The French baseline

We compared against the US, but since I live in France, my perception bias mostly originates there. It makes sense to also compare against a French baseline. We rely on the Walsh et al. ¹ study already mentioned in Part 1. The study is part of the EUREYE project, which collected high-resolution iris photographs from elderly Europeans (aged 65+) across seven sites. Eye color was classified by a single grader from these images into three categories : blue, brown, and “undefined” (a catch-all for green eyes, heterochromatic irises, and other ambiguous phenotypes). For the Paris-Creteil site (N = 616), the distribution is : brown (63.6%), blue (28.8%, rounded to ensure 100% distribution, with conservative rounding) and undefined (7.6%).

The mapping to our categories is imperfect. Their “blue” maps reasonably to our Blue/Grey, and “brown” to our Brown/Hazel. However, “undefined” is not equivalent to Green/Other : it includes heterochromatic irises (e.g. blue with a brown peripupillary ring), golden/amber tones, and only a handful of truly green-eyed individuals (3 out of 47 “undefined” in France, per their Table 1). Still, the proportions are close enough to serve as a rough comparison point.

Using this as the reference, the observed proportions fall within the confidence intervals (in the lead actors case), so no significant difference can be established. We can rerun our 11_analyze_eye_colors_top_3.ipynb notebook with this baseline to see the results. I’ll leave that as an exercise.

Interpretation

The data choice is debatable. The cohort is small (616 individuals), elderly (65+), drawn from a single metropolitan area (Paris-Creteil), and 17% of participants were born in North Africa without further information on their ethnic origin. These limitations make it a poor proxy for the general French population.

There is also a conceptual subtlety worth noting. Comparing US actors against a French baseline answers a different question than comparing them against a US baseline. The US baseline tests for casting bias : Does Hollywood over-select blue-eyed actors relative to the population they draw from ? A French baseline tests for perception mismatch : Do I see more blue eyes on screen than I see around me ? These are two distinct questions. If US actors had more blue eyes than the French population, it could simply reflect demographic differences between the two countries rather than any casting preference. In practice, the US and French baselines happen to be close (~27.3% vs ~28.8%), so the distinction barely matters here, but it means the French comparison can only speak to personal perception, not to the structure of the acting pool. Still, the result is consistent with what we observed against the US baseline : no strong evidence of overrepresentation, at least not in the full cast.

Walsh, S., Wollstein, A., Liu, F., Chakravarthy, U., Rahu, M., Seland, J. H., Soubrane, G., Tomazzoli, L., Topouzis, F., Vingerling, J. R., Vioque, J., Fletcher, A. E., Ballantyne, K. N., & Kayser, M. (2012). DNA-based eye colour prediction across Europe with the IrisPlex system. Forensic Science International: Genetics, 6(3), 330–340. https://doi.org/10.1016/j.fsigen.2011.07.009 ↩︎

Last updated on May 3, 2026

Behind Blue Eyes - Part 2 : Collecting Data