Throughout the case study/analysis we'll be using the following libraries:

Library	Purpose
`sklearn`	Modelling
`matplotlib`, `bokeh`	Visualization
`numpy`, `pandas`	Data Manipulation

In this case study I am going to do several things first, I want to predict the revenue of a movie based on the characteristics of the movie, second I want to predict whether a movie's revenue will exceed its budget or not.

Drudgery: import and take a look at the data

import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score

import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.style.use("ggplot")
%matplotlib inline

df = pd.read_csv("data/processed_data.csv", index_col=0)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4775 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4391 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4452 non-null   object 
 10  production_countries  4629 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4716 non-null   object 
 15  status                4803 non-null   object 
 16  tagline               3959 non-null   object 
 17  title                 4803 non-null   object 
 18  vote_average          4803 non-null   float64
 19  vote_count            4803 non-null   int64  
 20  movie_id              4803 non-null   int64  
 21  cast                  4760 non-null   object 
dtypes: float64(3), int64(5), object(14)
memory usage: 863.0+ KB

df.head()

Data Preprocessing

Our Second step would be to clean and transform the data so that we could apply Regression or Classification algorithm on the data.

Defining Regression and Classification Outcomes

For regression we'll be using revenue as the target for outcomes, and for classification we'll construct an indicator of profitability for each movie. Let's define new column profitable such that: $$ profitable = 1\ \ if\ revenue > budget,\ 0\ \ otherwise $$

df['profitable'] = df.revenue > df.budget
df['profitable'] = df['profitable'].astype(int)

regression_target = 'revenue'
classification_target = 'profitable'

df['profitable'].value_counts()

1    2585
0    2218
Name: profitable, dtype: int64

2585 out of all movies in the dataset were profitable

Handling missing and infinite values

Looking at the data we can easily guess that many of the columns are non-numeric and using a technique other than ommiting the columns might be a bit overhead. So I'm going to stick with plane and simple technique of ommiting the column with missing or infinite values.

Replace any np.inf or -np.inf occuring in the dataset with np.nan

df = df.replace([np.inf, -np.inf], np.nan)
print(df.shape)
df.info()

(4803, 23)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4775 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4391 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4452 non-null   object 
 10  production_countries  4629 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4716 non-null   object 
 15  status                4803 non-null   object 
 16  tagline               3959 non-null   object 
 17  title                 4803 non-null   object 
 18  vote_average          4803 non-null   float64
 19  vote_count            4803 non-null   int64  
 20  movie_id              4803 non-null   int64  
 21  cast                  4760 non-null   object 
 22  profitable            4803 non-null   int32  
dtypes: float64(3), int32(1), int64(5), object(14)
memory usage: 881.8+ KB

Notice that homepage column accounts for maximun null or minimun non-null values in the dataset, and we can discard it as a feature for more data.

Drop any column with na and drop homepage column

df.drop('homepage', axis=1, inplace=True)
df = df.dropna(how="any")
df.shape

(3665, 22)

Transform `genre` column using `OneHotEncoding`

Since genres column consists of strings with comma separated genres e.g. "Action, Adventure, Fantasy" as a value for a particular movie, I'll convert string to list, then extract all unique genres in the list and finally add a column for each unique genre. Value of a specific genre will be 0 if it is present in genres otherwise 0.

list_genres = df.genres.apply(lambda x: x.split(","))
genres = []
for row in list_genres:
    row = [genre.strip() for genre in row]
    for genre in row:
        if genre not in genres:
            genres.append(genre)

for genre in genres:
    df[genre] = df['genres'].str.contains(genre).astype(int)

df[genres].head()

Extract numerical variables

Many of the variables in the dataset are already numerical which will be useful in regression, we'll be extracting these variables and we'll also calculate skew of the continuous variables, and plot these variables.

continuous_covariates = ['budget', 'popularity',
                         'runtime', 'vote_count', 'vote_average']
outcomes_and_continuous_covariates = continuous_covariates + \
    [regression_target, classification_target]
plotting_variables = ['budget', 'popularity', regression_target]

axes = pd.plotting.scatter_matrix(df[plotting_variables], alpha=0.15,
                                  color=(0, 0, 0), hist_kwds={"color": (0, 0, 0)}, facecolor=(1, 0, 0))
plt.show()

df[outcomes_and_continuous_covariates].skew()

budget          2.173814
popularity      9.491844
runtime         1.738205
vote_count      3.430295
vote_average   -0.852070
revenue         4.021570
profitable     -0.606385
dtype: float64

Since "Linear algorithms love normally distributed data", and several of the variables budget, popularity, runtime, vote_count, revenue are right skewed. So now we'll remove skewness from these variables using np.log10 to make it symmetric. But first we need to add very small positive number to all the columns as some of values are 0 and log10(0) = -inf.

We are not actually removing skewness from the data instead we're only appliying a non-linear transformation on the variables to make it symmetric. If you transform skewed data to make it symmetric, and then fit it to a symmetric distribution (e.g., the normal distribution) that is implicitly the same as just fitting the raw data to a skewed distribution in the first place.

for covariate in ['budget', 'popularity', 'runtime', 'vote_count', 'revenue']:
    df[covariate] = df[covariate].apply(lambda x: np.log10(1+x))
    
df[outcomes_and_continuous_covariates].skew()

budget         -2.094710
popularity     -0.282719
runtime         0.553294
vote_count     -0.431371
vote_average   -0.852070
revenue        -1.467186
profitable     -0.606385
dtype: float64

Save this dataframe separately for modelling

df.to_csv("data/movies_clean.csv")

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	...	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count	movie_id	cast
0	237000000	Action, Adventure, Fantasy, Science Fiction	http://www.avatarmovie.com/	19995	culture clash, future, space war, space colony...	en	Avatar	In the 22nd century, a paraplegic Marine is di...	150.437577	Ingenious Film Partners, Twentieth Century Fox...	...	2787965087	162.0	English, Español	Released	Enter the World of Pandora.	Avatar	7.2	11800	19995	Sam Worthington, Zoe Saldana, Sigourney Weaver...
1	300000000	Adventure, Fantasy, Action	http://disney.go.com/disneypictures/pirates/	285	ocean, drug abuse, exotic island, east india t...	en	Pirates of the Caribbean: At World's End	Captain Barbossa, long believed to be dead, ha...	139.082615	Walt Disney Pictures, Jerry Bruckheimer Films,...	...	961000000	169.0	English	Released	At the end of the world, the adventure begins.	Pirates of the Caribbean: At World's End	6.9	4500	285	Johnny Depp, Orlando Bloom, Keira Knightley, S...
2	245000000	Action, Adventure, Crime	http://www.sonypictures.com/movies/spectre/	206647	spy, based on novel, secret agent, sequel, mi6...	en	Spectre	A cryptic message from Bond’s past sends him o...	107.376788	Columbia Pictures, Danjaq, B24	...	880674609	148.0	Français, English, Español, Italiano, Deutsch	Released	A Plan No One Escapes	Spectre	6.3	4466	206647	Daniel Craig, Christoph Waltz, Léa Seydoux, Ra...
3	250000000	Action, Crime, Drama, Thriller	http://www.thedarkknightrises.com/	49026	dc comics, crime fighter, terrorist, secret id...	en	The Dark Knight Rises	Following the death of District Attorney Harve...	112.312950	Legendary Pictures, Warner Bros., DC Entertain...	...	1084939099	165.0	English	Released	The Legend Ends	The Dark Knight Rises	7.6	9106	49026	Christian Bale, Michael Caine, Gary Oldman, An...
4	260000000	Action, Adventure, Science Fiction	http://movies.disney.com/john-carter	49529	based on novel, mars, medallion, space travel,...	en	John Carter	John Carter is a war-weary, former military ca...	43.926995	Walt Disney Pictures	...	284139100	132.0	English	Released	Lost in our world, found in another.	John Carter	6.1	2124	49529	Taylor Kitsch, Lynn Collins, Samantha Morton, ...

	Action	Adventure	Fantasy	Science Fiction	Crime	Drama	Thriller
0	1	1	1	1	0	0	0
1	1	1	1	0	0	0	0
2	1	1	0	0	1	0	0
3	1	0	0	0	1	1	1
4	1	1	0	1	0	0	0

	Action	Adventure	Fantasy	Science Fiction	Crime	Drama	Thriller
0	1	1	1	1	0	0	0
1	1	1	1	0	0	0	0
2	1	1	0	0	1	0	0
3	1	0	0	0	1	1	1
4	1	1	0	1	0	0	0

	Action	Adventure	Fantasy	Science Fiction	Crime	Drama	Thriller
0	1	1	1	1	0	0	0
1	1	1	1	0	0	0	0
2	1	1	0	0	1	0	0
3	1	0	0	0	1	1	1
4	1	1	0	1	0	0	0