Throughout the case study/analysis we'll be using the following libraries:

Library Purpose
sklearn Modelling
matplotlib, bokeh Visualization
numpy, pandas Data Manipulation

In this case study I am going to do several things first, I want to predict the revenue of a movie based on the characteristics of the movie, second I want to predict whether a movie's revenue will exceed its budget or not.

Drudgery: import and take a look at the data

import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score

import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.style.use("ggplot")
%matplotlib inline

df = pd.read_csv("data/processed_data.csv", index_col=0)

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4775 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4391 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4452 non-null   object 
 10  production_countries  4629 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4716 non-null   object 
 15  status                4803 non-null   object 
 16  tagline               3959 non-null   object 
 17  title                 4803 non-null   object 
 18  vote_average          4803 non-null   float64
 19  vote_count            4803 non-null   int64  
 20  movie_id              4803 non-null   int64  
 21  cast                  4760 non-null   object 
dtypes: float64(3), int64(5), object(14)
memory usage: 863.0+ KB
df.head()
budget genres homepage id keywords original_language original_title overview popularity production_companies ... revenue runtime spoken_languages status tagline title vote_average vote_count movie_id cast
0 237000000 Action, Adventure, Fantasy, Science Fiction http://www.avatarmovie.com/ 19995 culture clash, future, space war, space colony... en Avatar In the 22nd century, a paraplegic Marine is di... 150.437577 Ingenious Film Partners, Twentieth Century Fox... ... 2787965087 162.0 English, Español Released Enter the World of Pandora. Avatar 7.2 11800 19995 Sam Worthington, Zoe Saldana, Sigourney Weaver...
1 300000000 Adventure, Fantasy, Action http://disney.go.com/disneypictures/pirates/ 285 ocean, drug abuse, exotic island, east india t... en Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.082615 Walt Disney Pictures, Jerry Bruckheimer Films,... ... 961000000 169.0 English Released At the end of the world, the adventure begins. Pirates of the Caribbean: At World's End 6.9 4500 285 Johnny Depp, Orlando Bloom, Keira Knightley, S...
2 245000000 Action, Adventure, Crime http://www.sonypictures.com/movies/spectre/ 206647 spy, based on novel, secret agent, sequel, mi6... en Spectre A cryptic message from Bond’s past sends him o... 107.376788 Columbia Pictures, Danjaq, B24 ... 880674609 148.0 Français, English, Español, Italiano, Deutsch Released A Plan No One Escapes Spectre 6.3 4466 206647 Daniel Craig, Christoph Waltz, Léa Seydoux, Ra...
3 250000000 Action, Crime, Drama, Thriller http://www.thedarkknightrises.com/ 49026 dc comics, crime fighter, terrorist, secret id... en The Dark Knight Rises Following the death of District Attorney Harve... 112.312950 Legendary Pictures, Warner Bros., DC Entertain... ... 1084939099 165.0 English Released The Legend Ends The Dark Knight Rises 7.6 9106 49026 Christian Bale, Michael Caine, Gary Oldman, An...
4 260000000 Action, Adventure, Science Fiction http://movies.disney.com/john-carter 49529 based on novel, mars, medallion, space travel,... en John Carter John Carter is a war-weary, former military ca... 43.926995 Walt Disney Pictures ... 284139100 132.0 English Released Lost in our world, found in another. John Carter 6.1 2124 49529 Taylor Kitsch, Lynn Collins, Samantha Morton, ...

5 rows × 22 columns

Data Preprocessing

Our Second step would be to clean and transform the data so that we could apply Regression or Classification algorithm on the data.

Defining Regression and Classification Outcomes

For regression we'll be using revenue as the target for outcomes, and for classification we'll construct an indicator of profitability for each movie. Let's define new column profitable such that: $$ profitable = 1\ \ if\ revenue > budget,\ 0\ \ otherwise $$

df['profitable'] = df.revenue > df.budget
df['profitable'] = df['profitable'].astype(int)

regression_target = 'revenue'
classification_target = 'profitable'

df['profitable'].value_counts()
1    2585
0    2218
Name: profitable, dtype: int64

2585 out of all movies in the dataset were profitable

Handling missing and infinite values

Looking at the data we can easily guess that many of the columns are non-numeric and using a technique other than ommiting the columns might be a bit overhead. So I'm going to stick with plane and simple technique of ommiting the column with missing or infinite values.

  1. Replace any np.inf or -np.inf occuring in the dataset with np.nan
df = df.replace([np.inf, -np.inf], np.nan)
print(df.shape)
df.info()
(4803, 23)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4775 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4391 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4452 non-null   object 
 10  production_countries  4629 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4716 non-null   object 
 15  status                4803 non-null   object 
 16  tagline               3959 non-null   object 
 17  title                 4803 non-null   object 
 18  vote_average          4803 non-null   float64
 19  vote_count            4803 non-null   int64  
 20  movie_id              4803 non-null   int64  
 21  cast                  4760 non-null   object 
 22  profitable            4803 non-null   int32  
dtypes: float64(3), int32(1), int64(5), object(14)
memory usage: 881.8+ KB

Notice that homepage column accounts for maximun null or minimun non-null values in the dataset, and we can discard it as a feature for more data.

  1. Drop any column with na and drop homepage column
df.drop('homepage', axis=1, inplace=True)
df = df.dropna(how="any")
df.shape
(3665, 22)

Transform genre column using OneHotEncoding

Since genres column consists of strings with comma separated genres e.g. "Action, Adventure, Fantasy" as a value for a particular movie, I'll convert string to list, then extract all unique genres in the list and finally add a column for each unique genre. Value of a specific genre will be 0 if it is present in genres otherwise 0.

list_genres = df.genres.apply(lambda x: x.split(","))
genres = []
for row in list_genres:
    row = [genre.strip() for genre in row]
    for genre in row:
        if genre not in genres:
            genres.append(genre)

for genre in genres:
    df[genre] = df['genres'].str.contains(genre).astype(int)

df[genres].head()
Action Adventure Fantasy Science Fiction Crime Drama Thriller Animation Family Western Comedy Romance Horror Mystery History War Music Documentary Foreign TV Movie
0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Extract numerical variables

Many of the variables in the dataset are already numerical which will be useful in regression, we'll be extracting these variables and we'll also calculate skew of the continuous variables, and plot these variables.

continuous_covariates = ['budget', 'popularity',
                         'runtime', 'vote_count', 'vote_average']
outcomes_and_continuous_covariates = continuous_covariates + \
    [regression_target, classification_target]
plotting_variables = ['budget', 'popularity', regression_target]

axes = pd.plotting.scatter_matrix(df[plotting_variables], alpha=0.15,
                                  color=(0, 0, 0), hist_kwds={"color": (0, 0, 0)}, facecolor=(1, 0, 0))
plt.show()
df[outcomes_and_continuous_covariates].skew()
budget          2.173814
popularity      9.491844
runtime         1.738205
vote_count      3.430295
vote_average   -0.852070
revenue         4.021570
profitable     -0.606385
dtype: float64

Since "Linear algorithms love normally distributed data", and several of the variables budget, popularity, runtime, vote_count, revenue are right skewed. So now we'll remove skewness from these variables using np.log10 to make it symmetric. But first we need to add very small positive number to all the columns as some of values are 0 and log10(0) = -inf.

We are not actually removing skewness from the data instead we're only appliying a non-linear transformation on the variables to make it symmetric. If you transform skewed data to make it symmetric, and then fit it to a symmetric distribution (e.g., the normal distribution) that is implicitly the same as just fitting the raw data to a skewed distribution in the first place.

for covariate in ['budget', 'popularity', 'runtime', 'vote_count', 'revenue']:
    df[covariate] = df[covariate].apply(lambda x: np.log10(1+x))
    
df[outcomes_and_continuous_covariates].skew()
budget         -2.094710
popularity     -0.282719
runtime         0.553294
vote_count     -0.431371
vote_average   -0.852070
revenue        -1.467186
profitable     -0.606385
dtype: float64

Save this dataframe separately for modelling

df.to_csv("data/movies_clean.csv")