Behind Blue Eyes - Part 2 : Collecting Data

May 1, 2026·

To answer the question raised in Part 1, we need data on the eye color of lead actors. To my knowledge, no database exists with this information. Wikidata sometimes contains it, but the coverage is very patchy. The simplest and most reliable approach is probably to build an actor dataset, collect their photos, and annotate them. Luckily, that’s exactly what we’re going to do today. The analysis will then follow in a third post.

Source Data

The most well-known data source for movies and TV shows is probably IMDB. The data is very rich, and fortunately IMDB publishes it daily, so no scraping is needed. It is available for non-commercial use at https://datasets.imdbws.com. The dumps used here are dated 2026/04/11. The following files are used :

We extract them with gunzip. To limit memory usage and loading time, a rough first pass filters the relevant lines. It is a bit hacky, but gets the job done.

1
2
3
4
5
6
7
cat title.principals.tsv | grep -E "actor|actress|self|category" > title.principals.filtered.tsv
cat title.akas.tsv | grep -E "region|US" | grep -E "types|imdbDisplay" > title.akas.filtered.tsv
cat title.basics.tsv | grep -E "titleType|movie|tvMovie|tvMiniSeries|tvSeries" > title.basics.filtered.tsv

mv title.principals.filtered.tsv title.principals.tsv
mv title.akas.filtered.tsv title.akas.tsv
mv title.basics.filtered.tsv title.basics.tsv

The files are too large to include here, but filtered versions will be provided later.

Filtering Titles

We are interested in movies and TV shows. Documentaries, animated productions, and similar content will be excluded later. Movies and TV shows are treated separately, as we can expect different actor distributions between the two (and at worst, keeping them separate changes nothing).

We scope the data to 2015-2025, to stay roughly consistent with the baseline study (2016 data, published in 2025). We can expect (hope) there was no major distribution shift over the last 10 years. Perhaps that is something we’ll be able to see in the data though.

We filter for English-language titles from the US region. IMDB does not provide much more detail than that, but this should be a reasonable heuristic. Popularity will help correct for any noise. If we restrict the scope enough, then it will be possible to manually fix remaining issues.

Movies are grouped under titleType "movie" or "tvMovie", and TV shows under "tvSeries" or "tvMiniSeries".

The filters applied are detailed in the notebook 03_extract_titles.ipynb.

Ranking by Popularity

We use vote count as a proxy for viewership. The rating alone is not ideal : the highest-rated film is not necessarily the most watched. Vote count is a better proxy, as more votes generally means more viewers. However, this introduces a bias towards older titles, which have had more time to accumulate votes, so a correction is needed. We compute a popularity score by normalizing vote counts per year (with a lower bound of 0), then retain the top N titles per year.

We can easily see a difference in votes distributions between movies and TV series by simply looking at them. For movies, it seems to be pretty stable over the years, but it would be worth investigating more rigorously.

Looking at the densities directly clearly shows a difference between movies and series. In order to have something readable, the number of votes is plotted on a logarithmic scale.

The rating distributions do differ, confirming the need to keep movies and TV shows separate. We go with 50 movies per year and 25 series per year. There are fewer series, but they span multiple episodes and seasons over several years, which justifies the lower selection count.

Manual Cleaning

A quick manual review of the selected titles confirms the results look reasonable. With 550 movies and 275 series, the full list is manageable to check by hand, and lends itself well to automation with a deep research agent. The goal is to filter out titles with no connection to the US that could introduce bias. The following titles were removed. It might not be perfect but should define a pretty good baseline for our study.


12th Fail	Godzilla Minus One	Roma	The Kashmir Files
Adipurush	I’m Still Here	RRR	The Killing of a Sacred Deer
Aftersun	Jai Bhim	Sadak 2	The Lobster
All Quiet on the Western Front	K.G.F: Chapter 2	Shershaah	The Platform
Anatomy of a Fall	Kantara	Sisu	The Power of the Dog
Another Round	Laal Singh Chaddha	Society of the Snow	The Substance
Belfast	Last Night in Soho	Soorarai Pottru	The Worst Person in the World
Dangal	Late Night with the Devil	Talk to Me	The Zone of Interest
Dara of Jasenovac	Napoleon	The Banshees of Inisherin	Train to Busan
Dhurandhar	Parasite	The Father	Triangle of Sadness
Dil Bechara	Pathaan	The Handmaiden	Yesterday
Emilia Pérez	Radhe	The Invisible Guest


1899	Derry Girls	Lupin	Squid Game
A Discovery of Witches	Dhindora	Mirzapur	Taaza Khabar
Adolescence	Elite	MobLand	Taboo
After Life	Farzi	Money Heist	Ted Lasso
Alice in Borderland	Fleabag	Normal People	The Ba***ds of Bollywood
Ash vs Evil Dead	Fool Me Once	One Day	The Crown
Aspirants	Good Omens	Paatal Lok	The Day of the Jackal
Asur: Welcome to Your Dark Side	Heartstopper	Panchayat	The End of the F***ing World
Baby Reindeer	Heated Rivalry	Paranormal	The Family Man
Behind Her Eyes	Hijack	Sacred Games	The Gentlemen
Black Doves	His Dark Materials	Sandeep Bhaiya	The Grand Tour
Bodies	Indian Police Force	Sapne Vs Everyone	The Last Kingdom
Bodyguard	Killing Eve	Scam 1992: The Harshad Mehta Story	The Night Manager
Chamak	King the Land	Schitt’s Creek	Travelers
Dark	Kingdom	Sex Education	TVF Pitchers
Dept. Q	Kota Factory	Slow Horses	When the Phone Rings

This reduces the dataset from 550 to 503 movies, and from 275 to 211 series. The notebooks 04_filter_series.ipynb and 05_filter_movies.ipynb detail the filtering. The filtered data is available as CSV files: movies_50_us.csv and series_25_us.csv.

Extracting Cast Members

We then extract the actors. For each actor, we count the number of appearances per year, separately for movies and TV shows. Counting each actor once tells you about the casting pool, whether blue-eyed people get hired more, but the intuition behind this study is more visceral : I feel like I keep seeing blue eyes on screen. That’s an exposure question. An actor cast across 15 titles shapes your viewing experience far more than someone who appeared in one film. Weighting by appearances captures this compounding effect, and if blue-eyed actors are also booked more often, that’s a stronger finding than over-representation in the pool alone.

We maintain a separate list for movies and for series, with per-year detail, which will later be used in two ways : aggregated across years as weights in the primary analysis, and broken down per year to test for temporal trends. The full selection process is available in the notebook 06_extract_actors.ipynb. The filtered data is available as CSV files: actors_movies_50_us.csv and actors_series_25_us.csv. The extraction is straightforward : we join our selected titles against the principals data to retrieve the corresponding cast.

Inferring Eye Color

With our actor dataset assembled, the next step is to determine the eye color of each person. Since no structured source provides this information reliably, we resort to visual inference : find a photo for each actor, run it through a vision model, and assign a color category.

Finding Actor Photos

With the actor list in hand, we can look up a photo for each of them and attempt to infer eye color. We use the TMDB API for this as getting images from IMDB is less trivial. Such images are not in the data dumps they provide, and scraping the IMDB website is a bit more complex that simply using a web API. I’d rather not spend my life on this. There are 4,509 unique actors in total (across movies and series). For 282 of them, no photo could be found. A fallback to Wikidata would be possible but unreliable, and rate limiting makes it impractical. The process is straightforward and detailed in the notebook 07_extract_pictures_urls.ipynb.

The real question is what to do with actors for whom no photo is available. Before dropping them, it is worth examining the data to understand the situation better. Missing data in studies like this falls into three mechanisms (which are pretty well covered on Wikipedia) :

MCAR (Missing Completely At Random) : missingness is unrelated to any variable, observed or not. Safe to just drop and adjust denominators.
MAR (Missing At Random) : missingness depends on observed variables (billing order, genre, year, ethnicity, etc.) but not on eye color itself. Can correct with weighting or imputation.
MNAR (Missing Not At Random) : missingness depends on eye color itself. That’s the annoying case.

The analysis below is detailed in the notebook 08_missing_pictures_analysis.ipynb. First, we check whether missingness correlates with billing order and year (the only available covariates), but not with eye color itself. The goal is to determine whether we are in a MCAR, MAR, or MNAR case. A simple logistic regression serves this purpose. If no strong correlation emerges, the data can be considered MCAR and dropping the missing actors is safe. The analysis is run separately for movies and series.

We fit a logistic regression missing ~ ordering + startYear. We then look at a few metrics :

Pseudo R² measures how much of the variation in missingness the model explains. If its value is close to 0, then the model explains almost nothing. Billing order has a real effect, but missingness is mostly unpredictable noise, which is good news.
z and P>|z| are from the logistic regression coefficients. The z-score measures how many standard errors the coefficient is from zero. P>|z| is the probability of seeing that z-score by chance if the true coefficient were zero. If p is very small (let’s say smaller than 0.01), then we can assume a real effect of the predictor. If p is big enough (let’s say greater than 0.05), then the effect of the predictor is indistinguishable from noise.

If no predictor is significant and pseudo R² is near 0, missingness is unrelated to observables, meaning MCAR is plausible and it is safe to drop. If ordering or year predicts missingness, we are in the MAR case, which is still safe to drop since missingness depends on observed variables and not on eye color itself.

For movies, there are 3,193 actor entries (spread across multiple years, so duplicates exist) and 253 missing photos (7.9%). The logistic regression results are shown below, with a McFadden pseudo R² of 0.0320 (computed using Python statsmodel package).

	Coefficient	Standard Error	z	P>\|z\|
Intercept	-54.372326	40.922522	-1.328665	1.839585e-01
`ordering`	0.172923	0.023797	7.266465	3.690157e-13
`startYear`	0.025098	0.020254	1.239177	2.152800e-01

The conclusion is straightforward :

ordering is highly significant (p < 10⁻¹²). Lower-billed actors are more likely to have no picture. This makes complete sense : leads are well-photographed, minor roles aren’t.
startYear is not significant (p ≈ 0.22) : no systematic year effect.
Pseudo R² = 0.032. Observables explain very little of the variance in missingness, meaning it’s not strongly structured.

This is textbook MAR : missingness depends on billing order (an observed variable), not on eye color. Dropping actors with no picture is valid. Actors in minor roles are less likely to have photos. This is unsurprising, and not particularly problematic. Lead actors are more prominent and more visible on screen, which is precisely where the original question comes from. For movies, we can therefore safely drop actors with missing photos.

For series, there are 1,890 actor entries (again across multiple years) and 31 missing photos (1.6%, very low). The logistic regression results are shown below, with a McFadden pseudo R² of 0.0217.

	Coefficient	Standard Error	z	P>\|z\|
Intercept	3.162756	116.336859	0.027186	0.978311
`ordering`	0.112743	0.040851	2.759852	0.005783
`startYear`	-0.003987	0.057571	-0.069253	0.944788

Same story, even cleaner :

ordering significant (p ≈ 0.006), same prominence effect as movies but weaker, which makes sense given only 1.6% missing overall.
startYear not significant (p ≈ 0.94).
Pseudo R² = 0.022 : even less structured than movies.

This is again textbook MAR : missingness is driven by billing order, not by anything related to eye color. With only 31 missing entries out of 1,890, the practical impact is negligible regardless of mechanism. We can safely drop these actors without meaningfully affecting the results.

In both cases, billing order is the main driver of missingness. Therefore, actors without photos are dropped.

As an additional safeguard, we will later re-run the primary tests under two extreme scenarios :

all missing actors assigned blue/grey,
and all assigned brown/hazel.

This will help us to confirm the conclusions hold regardless of the missingness mechanism.

We then download the photos for each actor. There is not much to detail, it is simply parallelizing GET requests over a list of URLs. The URLs are available here : actors_urls.csv.

Classifying Eye Color

To annotate the images, I use Gemma 4 E2B (under Apache 2.0 licence). It runs well on a consumer GPU and processes the batches quickly. The model is served with Ollama, which exposes an OpenAI-compatible API. The procedure is detailed in this notebook : 09_classify_eye_color.ipynb.

No model is perfect, and eye color classification is a genuinely ambiguous task. The boundaries between categories like hazel, green, and brown are not always clear-cut, even for a human observer, and image quality or lighting conditions can make things worse. It is therefore important to get an estimate of the model’s error rate, and more specifically to understand the direction of any systematic bias, since not all misclassifications are equally harmful to the analysis.

With so many actors, manually verifying everything is not practical. Instead, we sample a subset for estimation. I asked Claude Sonnet to quickly put together a small annotation app for this purpose. The code is dirty and definitely not maintainable, but it does the job. It’s perfect for a one-shot trashable app. You can see the result in 01_annotation_validation.py.

Annotation interface generated by Claude Sonnet

We manually review 90 images (30 per class). The class imbalance is heavy : brown/hazel at 69.1%, blue/grey at 25.7% and green/other at 5.2%.

These figures reflect the raw class distribution across unique actors, before any weighting by number of appearances and before applying the classifier correction. They should not be compared directly to the baseline. The actual test proportions, weighted and corrected, will be reported in the analysis section.

For validation, stratified sampling is needed, otherwise you would mostly be reviewing brown eyes. It ensures each class gets a fair evaluation (at the cost of an overall accuracy figure that no longer reflects real-world distribution). The images come from the same source and share a similar framing. Stratified sampling at random should therefore yield a representative subset with no obvious subgroup stratification. This gives a good picture of the model’s performance. Stratified sampling is an acceptable tradeoff : per-class accuracy and the confusion matrix are what actually matter for understanding and correcting classifier bias.

Manual annotation took about 5 minutes. Overall accuracy is 97.8% (88/90). One is a clear error : actress nm1409365 is classified as brown/hazel despite obvious blue/grey eyes. The second is debatable : actress nm3908228 is classified as green/other, but depending on the screen, brown/hazel would be equally defensible. It’s a typical boundary case between hazel, brown, and green colors. Excluding it would push accuracy to 98.9%, but the conservative figure is kept. Some images are low quality or dark, making it genuinely hard to distinguish brown from blue in borderline cases.

Class	Predicted	Truth	Precision	Recall	F1	Accuracy
Brown / Hazel	30	30	96.7%	96.7%	96.7%	96.7%
Blue / Grey	30	31	100%	96.8%	98.4%	100.0%
Green / Other	30	29	96.7%	100.0%	98.3%	96.7%

	(True) Brown / Hazel	(True) Blue / Grey	(True) Green / Other	Total
(Predicted) Brown / Hazel	29	1	0	30
(Predicted) Blue / Grey	0	30	0	30
(Predicted) Green / Other	1	0	29	30
Total	30	31	29	90

A few patterns stand out from the results :

blue/grey → brown/hazel : 1 misclassification. The only directionally meaningful error. The classifier occasionally downgrades borderline blue/grey eyes to brown/hazel.
green/other errors : 1 misclassification (brown/hazel → green/other), no directional impact on blue/grey.

The single blue/grey → brown/hazel misclassification out of 31 true blue/grey cases (3.2%) means the corpus likely slightly under-counts blue/grey. With blue/grey at 25.7% of the 4,509 actors (~1,159), that translates to roughly 37 expected false negatives (considering that our validation set was representative). The bias is small and conservative : the model will underestimate the proportion of blue/grey actors, so any conclusion reached with these deflated figures would hold equally (or more strongly) against the true proportions. With 30 samples per class, a zero false-positive count should be small enough to be comfortable, but not proof of a perfect classifier.

Overall, the annotations appear trustworthy, though we will account for the uncertainty in the analysis. The annotation results are available here: actors_colors.csv.

Conclusion

We now have a dataset of 713 US English-language titles (503 movies, 211 series) spanning 2015–2025, with eye color annotations for 4,227 out of 4,509 actors. The classifier is reliable, with a small conservative bias on blue/grey : it slightly underestimates the true proportion, which works in favour of a null result. Any finding that survives these deflated figures is the more robust for it. The data is ready. In the next post, we will finally test whether blue-eyed actors are over-represented on screen.

Last updated on May 1, 2026

Behind Blue Eyes - Part 1 : Establishing a Baseline