Using Regression for Revenue Prediction of movies
Using TMDb dataset we'll try to predict the revenue of a movie based on the characteristics of the movie, and predict whether a movie's revenue will exceed its budget or not
Throughout the case study/analysis we'll be using the following libraries:
Library | Purpose |
---|---|
sklearn |
Modelling |
matplotlib , bokeh |
Visualization |
numpy , pandas |
Data Manipulation |
In this case study I am going to do several things first, I want to predict the revenue of a movie based on the characteristics of the movie, second I want to predict whether a movie's revenue will exceed its budget or not.
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.style.use("ggplot")
%matplotlib inline
df = pd.read_csv("data/processed_data.csv", index_col=0)
df.info()
df.head()
Defining Regression and Classification Outcomes
For regression we'll be using revenue
as the target for outcomes, and for classification we'll construct an indicator of profitability for each movie. Let's define new column profitable
such that:
$$
profitable = 1\ \ if\ revenue > budget,\ 0\ \ otherwise
$$
df['profitable'] = df.revenue > df.budget
df['profitable'] = df['profitable'].astype(int)
regression_target = 'revenue'
classification_target = 'profitable'
df['profitable'].value_counts()
2585 out of all movies in the dataset were profitable
Handling missing and infinite values
Looking at the data we can easily guess that many of the columns are non-numeric and using a technique other than ommiting the columns might be a bit overhead. So I'm going to stick with plane and simple technique of ommiting the column with missing or infinite values.
- Replace any
np.inf
or-np.inf
occuring in the dataset with np.nan
df = df.replace([np.inf, -np.inf], np.nan)
print(df.shape)
df.info()
Notice that homepage
column accounts for maximun null
or minimun non-null
values in the dataset, and we can discard it as a feature for more data.
- Drop any column with
na
and drophomepage
column
df.drop('homepage', axis=1, inplace=True)
df = df.dropna(how="any")
df.shape
Since genres
column consists of strings with comma separated genres e.g. "Action, Adventure, Fantasy"
as a value for a particular movie, I'll convert string to list, then extract all unique genres in the list and finally add a column for each unique genre. Value of a specific genre will be 0
if it is present in genres
otherwise 0
.
list_genres = df.genres.apply(lambda x: x.split(","))
genres = []
for row in list_genres:
row = [genre.strip() for genre in row]
for genre in row:
if genre not in genres:
genres.append(genre)
for genre in genres:
df[genre] = df['genres'].str.contains(genre).astype(int)
df[genres].head()
continuous_covariates = ['budget', 'popularity',
'runtime', 'vote_count', 'vote_average']
outcomes_and_continuous_covariates = continuous_covariates + \
[regression_target, classification_target]
plotting_variables = ['budget', 'popularity', regression_target]
axes = pd.plotting.scatter_matrix(df[plotting_variables], alpha=0.15,
color=(0, 0, 0), hist_kwds={"color": (0, 0, 0)}, facecolor=(1, 0, 0))
plt.show()
df[outcomes_and_continuous_covariates].skew()
Since "Linear algorithms love normally distributed data", and several of the variables budget, popularity, runtime, vote_count, revenue
are right skewed. So now we'll remove skewness from these variables using np.log10
to make it symmetric. But first we need to add very small positive number to all the columns as some of values are 0
and log10(0) = -inf
.
for covariate in ['budget', 'popularity', 'runtime', 'vote_count', 'revenue']:
df[covariate] = df[covariate].apply(lambda x: np.log10(1+x))
df[outcomes_and_continuous_covariates].skew()
df.to_csv("data/movies_clean.csv")