Data science shows us how to create the perfect horror movie this October 31st

With Halloween just around the corner we decided to challenge our data scientists to come up with an idea of what the perfect horror movie might look like.  What does it take to scare the daylights out of an audience?

Would the menacing silence of early genre examples such as Dr Jekyll and Mr Hyde win the day? Perhaps the shock of Psycho, the creepiness of Poltergeist or the abject and profound terror of Hellraiser?

The data science team at Yard studied the available data from previous successful films in the genre and with some clever number crunching and analysis came up with a view on what works best.  

Read on and you’ll see.

The Data

We gathered data from the official IMDb website.

Each dataset was contained within a gzipped, tab-separated-values (TSV) file, which we merged and transformed using unique identifiers.

The final, merged dataset consisted of several metrics such as:

  • Movie title
  • Genre
  • Number of votes
  • Average ranking
  • Run time
  • Release year
  • Lead actor
  • Lead actor’s DOB

Example of Horror Movies subset:

As you can see from the list there are plenty of massive blockbusters in the genre and a great deal of breadth in the sub-genres, from Sigourney Weaver fighting off stomach upsets in Alien to Simon Pegg trying to have a quiet pint in Shaun of the Dead.

The Methodology

#1. MALE OR FEMALE LEAD

Should we have a male, or female lead?

The data seems to suggest that a male lead in a horror movie will give us a slightly better chance of achieving a higher rating. The density plot shows a clear concentration of average ratings just over five for horror films with a male lead compared to just below five for those with a female lead.

But before we make a final decision on the gender of our lead character, we’ll look deeper at the lead actor’s gender and age.

#2. AGE OF LEAD

How old should our lead be?

Age discrimination in movie casting is a recurring issue in Hollywood. In fact 2017 saw a law being signed which forced IMDb to remove an actor’s age upon request. Age discrimination is more pronounced when it comes to gender as female actresses tend to be younger than male actors.

But the data shows that from 2000 to 2010 the median age of lead actors, male and female, started to increase. Both the upper and lower bounds increased too.

There’s about a 10-year gap between the ages of male and female leads, and the gap doesn’t change over time. But both start to rise at the same time.

Before determining which age bracket our lead character should be in, we decided to analyse the number of films which belonged to each age segment, by gender.

The highest proportion of female actors belong to the age segment 20-30 years old, compared to the highest proportion for male actors being 30-40 years old. We have very few lead actors belonging to segment < 20 years old, so we would be hesitant in choosing this age segment for our lead character.

What age bracket should our lead character be in for the best chance of a high rating for our horror film? Given the above results, we ran our analysis on horror movies from the year 2000 onwards.

The plot shows us the distribution of a lead actor’s age and gender against movie ratings.

For male (LHS) and female actors (RHS), the optimal age segment is less than twenty years old (<20), with greater distribution towards the right (higher ratings). However, the sample size for this segment is significantly less.

The next optimal age segment is harder to decipher so we plotted lead age against average ratings with a relational view. The plot below shows that although films with younger male and female leads have better ratings the risk/uncertainty is much larger.

For male actors, the optimal age segment is thirty to forty (30-40). Whereas for female actors, although the sample size is much smaller, there is clear evidence that the optimal age is below twenty (<20).

Given the results of the data, our decision for the lead character of our Horror movie, is a young woman, who is around twenty years old or less, with a supporting actor role, of a male, aged 30-40.

#3. RUN TIME

How long should a horror movie run for?

There is a definite concentration of movies running between 80 to 100 minutes.

It’s difficult to see a trend here, but there seems to be a relationship whereby films with a longer run time have a higher rating.

We plotted a generalised smoothed curve to understand the relationship further.

The graph shows us that as run time increases from 75 minutes to 110 minutes, there is a high positive correlation with average ratings, suggesting that as run time increases so do ratings.

However, we need to watch for increase in risk/uncertainty as run time passes 110 minutes.

Therefore, for our Horror movie, we’ll choose an optimal run time of around 110 minutes.

#4. WHAT’S IN A TITLE?

The goal of text mining is to discover relevant information that is possibly unknown or hidden within the narrative of a piece of text.

Natural Language Processing (NLP) is one methodology used. It tries to decipher the ambiguities in written language by tokenization, clustering, extracting entity and word relationships, and using algorithms to identify themes and quantify subjective information.

To predict the perfect Horror movie title, we needed to break out the titles into individual words and begin mining for insights. This process is called tokenization. We grouped the data into subsets by rating, to better understand the relationship between rating and the language used in titles.

The most frequently used words across the Horror genre are ones which all of us are familiar with.

By splitting the data into two groups, high (LHS) and low (RHS) rankings, we can start to recognise trends within the language of titles, and which words have a positive correlation with movie rankings. Whilst it’s clear to see that words like dead and night, are used frequently across all move ratings, there are words to avoid, which generally only appear in lower ranked movies, include zombie, monster, hell, killer whereas words which havea more of a positive impact on higher ratings are black, ghost, witch and curse.   

Modelling the relationship between number of Words in a horror title, and average Movie Rating.

Modelling the relationship between number of letters in a horror title, and average Movie Rating. The below graph is only analysing one-worded movie titles, to identify a sweet spot in the number of letters to use.

Our Conclusion

What does all of this mean for our highly rated horror blockbuster? Well the safe bet is clearly a single worded title which should rate well. In fact, data further shows a sweet spot of somewhere between 5-7 characters. 

We have created a few posters that give you an idea of what might bring us some box office success this Halloween, if the cinemas were open that is.  Perhaps it’s straight to Netflix for these efforts.

There’s nothing spooky to the art of data science – lean on experts, unlock your value.  The results will send a chill down your spine.

How will our audience react when the data scientists go bad?

 How will our audience feel when they learn Halloween is not the only scary time of year?

How will our audience react learning where the Halloween story began?

Emily Davies