The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023 with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc, it is now preferable to traditional education.
In the present scenario due to the Covid-19, the online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like
The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.
ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:
The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.
Data Dictionary
last_activity: Last interaction between the lead and ExtraaLearn.
print_media_type1: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Newspaper.
import warnings
warnings.filterwarnings("ignore")
# Libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# For training and testing the data
from sklearn.model_selection import train_test_split
# Algorithms to use
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
# Metrics to evaluate the model
from sklearn.metrics import confusion_matrix, classification_report, f1_score, recall_score
from sklearn import metrics
# For hyperparameter tuning
from sklearn.model_selection import GridSearchCV
# Load the data - original data set
customers = pd.read_csv('ExtraaLearn.csv')
# create a copy of the data set to work with
df = customers.copy()
#check that the data is loaded and to look at the dataframe
df.head()
| ID | age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | EXT001 | 57 | Unemployed | Website | High | 7 | 1639 | 1.861 | Website Activity | Yes | No | Yes | No | No | 1 |
| 1 | EXT002 | 56 | Professional | Mobile App | Medium | 2 | 83 | 0.320 | Website Activity | No | No | No | Yes | No | 0 |
| 2 | EXT003 | 52 | Professional | Website | Medium | 3 | 330 | 0.074 | Website Activity | No | No | Yes | No | No | 0 |
| 3 | EXT004 | 53 | Unemployed | Website | High | 4 | 464 | 2.057 | Website Activity | No | No | No | No | No | 1 |
| 4 | EXT005 | 23 | Student | Website | High | 4 | 600 | 16.914 | Email Activity | No | No | No | No | No | 0 |
# check the data last five rows
df.tail()
| ID | age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4607 | EXT4608 | 35 | Unemployed | Mobile App | Medium | 15 | 360 | 2.170 | Phone Activity | No | No | No | Yes | No | 0 |
| 4608 | EXT4609 | 55 | Professional | Mobile App | Medium | 8 | 2327 | 5.393 | Email Activity | No | No | No | No | No | 0 |
| 4609 | EXT4610 | 58 | Professional | Website | High | 2 | 212 | 2.692 | Email Activity | No | No | No | No | No | 1 |
| 4610 | EXT4611 | 57 | Professional | Mobile App | Medium | 1 | 154 | 3.879 | Website Activity | Yes | No | No | No | No | 0 |
| 4611 | EXT4612 | 55 | Professional | Website | Medium | 4 | 2290 | 2.075 | Phone Activity | No | No | No | No | No | 0 |
df.shape
(4612, 15)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4612 entries, 0 to 4611 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 4612 non-null object 1 age 4612 non-null int64 2 current_occupation 4612 non-null object 3 first_interaction 4612 non-null object 4 profile_completed 4612 non-null object 5 website_visits 4612 non-null int64 6 time_spent_on_website 4612 non-null int64 7 page_views_per_visit 4612 non-null float64 8 last_activity 4612 non-null object 9 print_media_type1 4612 non-null object 10 print_media_type2 4612 non-null object 11 digital_media 4612 non-null object 12 educational_channels 4612 non-null object 13 referral 4612 non-null object 14 status 4612 non-null int64 dtypes: float64(1), int64(4), object(10) memory usage: 540.6+ KB
# find number of unique IDs
df.ID.nunique()
4612
# drop the ID column
df = df.drop(['ID'], axis = 1)
# look at the first five rows again
df.head()
| age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 57 | Unemployed | Website | High | 7 | 1639 | 1.861 | Website Activity | Yes | No | Yes | No | No | 1 |
| 1 | 56 | Professional | Mobile App | Medium | 2 | 83 | 0.320 | Website Activity | No | No | No | Yes | No | 0 |
| 2 | 52 | Professional | Website | Medium | 3 | 330 | 0.074 | Website Activity | No | No | Yes | No | No | 0 |
| 3 | 53 | Unemployed | Website | High | 4 | 464 | 2.057 | Website Activity | No | No | No | No | No | 1 |
| 4 | 23 | Student | Website | High | 4 | 600 | 16.914 | Email Activity | No | No | No | No | No | 0 |
df.shape
(4612, 14)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4612 entries, 0 to 4611 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 4612 non-null int64 1 current_occupation 4612 non-null object 2 first_interaction 4612 non-null object 3 profile_completed 4612 non-null object 4 website_visits 4612 non-null int64 5 time_spent_on_website 4612 non-null int64 6 page_views_per_visit 4612 non-null float64 7 last_activity 4612 non-null object 8 print_media_type1 4612 non-null object 9 print_media_type2 4612 non-null object 10 digital_media 4612 non-null object 11 educational_channels 4612 non-null object 12 referral 4612 non-null object 13 status 4612 non-null int64 dtypes: float64(1), int64(4), object(9) memory usage: 504.6+ KB
Questions
# get a summary of the data
df.describe(include = 'all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| age | 4612.0 | NaN | NaN | NaN | 46.201214 | 13.161454 | 18.0 | 36.0 | 51.0 | 57.0 | 63.0 |
| current_occupation | 4612 | 3 | Professional | 2616 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| first_interaction | 4612 | 2 | Website | 2542 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| profile_completed | 4612 | 3 | High | 2264 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| website_visits | 4612.0 | NaN | NaN | NaN | 3.566782 | 2.829134 | 0.0 | 2.0 | 3.0 | 5.0 | 30.0 |
| time_spent_on_website | 4612.0 | NaN | NaN | NaN | 724.011275 | 743.828683 | 0.0 | 148.75 | 376.0 | 1336.75 | 2537.0 |
| page_views_per_visit | 4612.0 | NaN | NaN | NaN | 3.026126 | 1.968125 | 0.0 | 2.07775 | 2.792 | 3.75625 | 18.434 |
| last_activity | 4612 | 3 | Email Activity | 2278 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| print_media_type1 | 4612 | 2 | No | 4115 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| print_media_type2 | 4612 | 2 | No | 4379 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| digital_media | 4612 | 2 | No | 4085 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| educational_channels | 4612 | 2 | No | 3907 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| referral | 4612 | 2 | No | 4519 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| status | 4612.0 | NaN | NaN | NaN | 0.298569 | 0.45768 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
# Making a list of all categorical variables
cat_col = list(df.select_dtypes("object").columns)
# Printing percnetage of each unique value in each categorical column
for column in cat_col:
print(df[column].value_counts(normalize = True))
print("-" * 50)
Professional 0.567216 Unemployed 0.312446 Student 0.120338 Name: current_occupation, dtype: float64 -------------------------------------------------- Website 0.551171 Mobile App 0.448829 Name: first_interaction, dtype: float64 -------------------------------------------------- High 0.490893 Medium 0.485906 Low 0.023200 Name: profile_completed, dtype: float64 -------------------------------------------------- Email Activity 0.493929 Phone Activity 0.267563 Website Activity 0.238508 Name: last_activity, dtype: float64 -------------------------------------------------- No 0.892238 Yes 0.107762 Name: print_media_type1, dtype: float64 -------------------------------------------------- No 0.94948 Yes 0.05052 Name: print_media_type2, dtype: float64 -------------------------------------------------- No 0.885733 Yes 0.114267 Name: digital_media, dtype: float64 -------------------------------------------------- No 0.847138 Yes 0.152862 Name: educational_channels, dtype: float64 -------------------------------------------------- No 0.979835 Yes 0.020165 Name: referral, dtype: float64 --------------------------------------------------
# Making a list of all categorical variables
cat_col = list(df.select_dtypes("object").columns)
# Printing count of each unique value in each categorical column
for column in cat_col:
print(df[column].value_counts(normalize = False))
print("-" * 50)
Professional 2616 Unemployed 1441 Student 555 Name: current_occupation, dtype: int64 -------------------------------------------------- Website 2542 Mobile App 2070 Name: first_interaction, dtype: int64 -------------------------------------------------- High 2264 Medium 2241 Low 107 Name: profile_completed, dtype: int64 -------------------------------------------------- Email Activity 2278 Phone Activity 1234 Website Activity 1100 Name: last_activity, dtype: int64 -------------------------------------------------- No 4115 Yes 497 Name: print_media_type1, dtype: int64 -------------------------------------------------- No 4379 Yes 233 Name: print_media_type2, dtype: int64 -------------------------------------------------- No 4085 Yes 527 Name: digital_media, dtype: int64 -------------------------------------------------- No 3907 Yes 705 Name: educational_channels, dtype: int64 -------------------------------------------------- No 4519 Yes 93 Name: referral, dtype: int64 --------------------------------------------------
current_occupation - there are 3 responses allowed
-- Professional is the top with 2616 (57%)first_interaction - there are 2 responses allowed
-- Website is the top with 2542 (55%)profile_completed - there are 3 possible
-- High is the top with 2264 (49%)last_activity - there are 3 responses
-- Email Activity is the top with 2278 (49%)print_media_type1 - there are 2 responses
-- No is the top with 4115 (89%) (not seen the Newspaper Ad)print_media_type2 - there are 2 responses
-- No is the top with 4379 (94%) (not seen the Magazine Ad)digital_media - there are 2 responses
-- No is the top with 4085 (89%) (not seen an ad on the digital platforms)educational_channels - there are 2 responses
-- No is the top with 3907(85%) (not heard of ExtraaLearn via online forums, educ websites or discussion threads...)referral - there are 2 responses
-- No is the top with 4519 (98%) (not referred)
#### Numeric Dataage The mean age is 46.2 with the range from 18 to 63 years.website_visits The mean number of website visits are 3.566782 with a range from 0 to 30 visits. 75% of the visitors visit up to 5 times. An outlier may be present.time_spent_on_website The mean time spent was 743.828683 with a range from 0 to 2537. 75% of the visitors spent only up to 1336.75 page_views_per_visit The mean page views were 3.026126 pages with a range from 0 to 18.434. 75% of the viewers viewed up to 3.75625 pages. An outlier may be present.#create a bar chart to determine the number of visitors which are converted to customers (1)
plt.figure(figsize = (10, 6))
ax = sns.countplot(x = 'status', data = df)
# Place the exact count on the top of the bar for each category using annotate
for p in ax.patches:
ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x(), p.get_height()))
The majority of the visitors are professionals (56.7%). The website (55%) is the primary first interaction with ExtraaLearn. Only 2% of the visitors have a low completion of the profile. Email (49%) is the highest last activity of the visitor. Very few visitors have had interactions with ExtraaLearn through advertisements or referrals seen the newspaper ads (10%), magazine ads (5%), digital media (11%), educational channels (15%) or have been referred (2%).
Only 1377 (approx 30%) from a total of 4612 visitors are converted to customers.
#create countplots and box plots to visualize data to identify the distribution and outliers
for col in ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']:
print(col)
print('The skew is :',round(df[col].skew(), 2))
plt.figure(figsize = (20, 4))
# histogram
plt.subplot(1, 2, 1)
df[col].hist(bins = 10, grid = False)
plt.ylabel('count')
#box plot
plt.subplot(1, 2, 2)
sns.boxplot(df[col])
plt.show()
age The skew is : -0.72
website_visits The skew is : 2.16
time_spent_on_website The skew is : 0.95
page_views_per_visit The skew is : 1.27
age
There is a negative skew (-0.72) with most visitors approx 55 years of age.
website_visits
There is positive skew (2.16) with highest frequency visiting from 0 to 5 times decreasing from there. The box plot shows outliers.
time_spent_on_website
There is a positive skew (0.95) with the highest frequency visitors spending between 0 and 250 on the site.
page_views_per_visit
There is a positive skew (1.27) with the highest frequency of page views between 2.5 and 5. The box plot shows outliers.
find_outliers_IQR which takes in a dataframe as an input and returns a dataframe as an output. The returned data frame contains the outliers as numerical values and others as NaN# defining the definition to identify outliers
def find_outliers_IQR(df):
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
outliers = df[((df<(Q1 - 1.5*IQR)) | (df>(Q3 + 1.5*IQR)))]
return outliers
#identifying the outliers for the website_visits
outliers = find_outliers_IQR(df['website_visits'])
print('website_visits number of outliers: ' + str(outliers.count()))
print('website_visits min outlier value: ' + str(outliers.min()))
print('website_visits max outlier value: ' + str(outliers.max()))
outliers
website_visits number of outliers: 154 website_visits min outlier value: 10 website_visits max outlier value: 30
6 13
31 13
32 12
66 25
201 14
..
4566 13
4571 12
4583 24
4589 16
4607 15
Name: website_visits, Length: 154, dtype: int64
#identifying the outliers for the page_views_per_visit
outliers = find_outliers_IQR(df['page_views_per_visit'])
print('page_views_per_visit number of outliers: ' + str(outliers.count()))
print('page_views_per_visit min outlier value: ' + str(outliers.min()))
print('page_views_per_visit max outlier value: ' + str(outliers.max()))
outliers
page_views_per_visit number of outliers: 257 page_views_per_visit min outlier value: 6.313 page_views_per_visit max outlier value: 18.434
4 16.914
32 18.434
47 7.050
110 7.364
121 6.887
...
4470 6.810
4507 6.822
4514 7.997
4572 7.397
4597 8.246
Name: page_views_per_visit, Length: 257, dtype: float64
# create a pair plot to see if there are any relationships
# distingueished the paiplot by adding the status as an additional parameter for the pairplot
sns.pairplot(df, hue ='status')
<seaborn.axisgrid.PairGrid at 0x7fe0cfa5de20>
Pairplot visualizes given data to find the relationship between them where the variables can be continuous or categorical. The color represents if the visitor was converted to a customer or not. Orange shows those that are now customers.
age¶#visualize what the age means for creating a paying customer
plt.figure(figsize = (15, 10))
sns.boxplot(df["status"], df["age"])
plt.show()
current_occupation¶Analyze the data based on the current_occupation.
current_occupationcurrent_occupationcurrent_occupation# visualize the current occupations of the visitors based on conversion
plt.figure(figsize = (15, 10))
sns.countplot(x = 'current_occupation', hue = 'status', data = df)
plt.show()
#look at the age of the visitor by current occupation
df.groupby(["current_occupation"])["age"].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| current_occupation | ||||||||
| Professional | 2616.0 | 49.347477 | 9.890744 | 25.0 | 42.0 | 54.0 | 57.0 | 60.0 |
| Student | 555.0 | 21.144144 | 2.001114 | 18.0 | 19.0 | 21.0 | 23.0 | 25.0 |
| Unemployed | 1441.0 | 50.140180 | 9.999503 | 32.0 | 42.0 | 54.0 | 58.0 | 63.0 |
# box plots of the age of the visitor grouped by the current occupation
plt.figure(figsize = (10, 5))
sns.boxplot(df["current_occupation"], df["age"])
plt.show()
first_interactions¶# visualize the first_interactions of the visitors based on conversion
plt.figure(figsize = (15, 10))
sns.countplot(x = 'first_interaction', hue = 'status', data = df)
plt.show()
time_spent_on_website¶# visualize the time_spent_on_website of the visitors based on conversion
plt.figure(figsize = (10, 5))
sns.boxplot(df["status"], df["time_spent_on_website"])
plt.show()
profile_completed¶#visualize what the completion of the profile means for creating a paying customer
plt.figure(figsize = (15, 10))
sns.countplot(x = 'profile_completed', hue = 'status', data = df)
plt.show()
referral¶#visualize what a referral means for creating a paying customer
plt.figure(figsize = (15, 10))
sns.countplot(x = 'referral', hue = 'status', data = df)
plt.show()
agecurrent_occupationfirst_interactionstime_spent_on_website
-the blue represents those that did not convert to paying customers. profile_completedreferral# Correlation matrix (no grouping)
plt.figure(figsize=(15,10))
sns.heatmap(df.corr().round(2),annot=True)
plt.title('Correlation matrix of data',fontsize = 30)
plt.show()
time_spent_on_website and statuswebsite_visits and statusTo determine which variable will lead to a visitor conversion to a paying customer
# Separating the target variable and other variables
# make a copy called X which is a dataframe with "status" column removed
X = df.drop(columns = 'status')
# Y is a series containing the "status" (column)
Y = df['status']
# Creating dummy variables, drop_first=True is used to avoid redundant variables
#pd.get_dummies => working on X dataframe converts all categorical variables into binary 1(yes) / 0(no).
X = pd.get_dummies(X, drop_first = True)
# Splitting the data into train (70%) and test (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 1)
print("Shape of the training set: ", X_train.shape)
print("Shape of the test set: ", X_test.shape)
print("Percentage of classes in the training set:")
print(y_train.value_counts(normalize = True))
print("Percentage of classes in the test set:")
print(y_test.value_counts(normalize = True))
Shape of the training set: (3228, 16) Shape of the test set: (1384, 16) Percentage of classes in the training set: 0 0.704151 1 0.295849 Name: status, dtype: float64 Percentage of classes in the test set: 0 0.695087 1 0.304913 Name: status, dtype: float64
Before training the model, let's choose the appropriate model evaluation criterion as per the problem at hand.
Model can make wrong predictions as:
Which case is more important?
False Negatives: Predicting a visitor will not convert to a paying customer but in reality, the visitor would convert to a paying customer:
If we predict that a visitor will not convert to a paying customer and would have converted to a paying customer, then the company has lost on potential revenue for the company. (False Negative)
If we predict that a visitor will convert to a paying customer and they do not, the company has lost the resource of time and effort, where they could have been working with other potential customers. (False Positive)
The number of False Negatives should be minimized.
How to reduce the losses?
Recall which will reduce the number of false negatives thereby not missing the visitors that will become paying customersAlso, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.
# Function to print the classification report and get confusion matrix in a proper format
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize = (8, 5))
sns.heatmap(cm, annot = True, fmt = '.2f', xticklabels = ['Not Converted', 'Converted'], yticklabels = ['Not Converted', 'Converted'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
# Fitting the decision tree classifier on the training data
d_tree = DecisionTreeClassifier(random_state = 7)
d_tree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=7)
# Checking performance on the training data
y_pred_train1 = d_tree.predict(X_train)
metrics_score(y_train, y_pred_train1)
precision recall f1-score support
0 1.00 1.00 1.00 2273
1 1.00 1.00 1.00 955
accuracy 1.00 3228
macro avg 1.00 1.00 1.00 3228
weighted avg 1.00 1.00 1.00 3228
Reading confusion matrix (clockwise):
There is no error on the training set, i.e., each sample has been classified.
The model this perfectly on the training data, it is likely overfitted.
# Checking performance on the testing data
y_pred_test1 = d_tree.predict(X_test)
metrics_score(y_test, y_pred_test1)
precision recall f1-score support
0 0.87 0.86 0.87 962
1 0.69 0.70 0.70 422
accuracy 0.81 1384
macro avg 0.78 0.78 0.78 1384
weighted avg 0.81 0.81 0.81 1384
The model did not fit as well on the test data. Therefore the model is overfitting the training data.
To reduce overfitting the model let's try hyperparameter tuning using GridSearchCV to find the optimal max_depth. We can tune some other hyperparameters as well.
We will use the class_weight hyperparameter with the value equal to {0: 0.3, 1: 0.7} which is approximately the opposite of the imbalance in the original data.
This would tell the model that 1 is the important class here.
# Choose the type of classifier
d_tree_tuned = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.3, 1: 0.7})
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 10), #depth [2, 3, 4, 5, 6, 7, 8, 9]
'criterion': ['gini', 'entropy'], #use both gini and entropy to measure split quality
'min_samples_leaf': [5, 10, 20, 25] #minimum number of samples to be a leaf node
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(d_tree_tuned, parameters, scoring = scorer, cv = 5) #=> chooses the best hyperparameters to use
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
d_tree_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data
d_tree_tuned.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7}, criterion='entropy',
max_depth=3, min_samples_leaf=5, random_state=7)
We have tuned the model and fit the tuned model on the training data. Now, let's check the model performance on the training and testing data.
# Checking performance on the training data
y_pred_train2 = d_tree_tuned.predict(X_train)
metrics_score(y_train, y_pred_train2)
precision recall f1-score support
0 0.94 0.77 0.85 2273
1 0.62 0.88 0.73 955
accuracy 0.80 3228
macro avg 0.78 0.83 0.79 3228
weighted avg 0.84 0.80 0.81 3228
Let's check the model performance on the testing data
# Checking performance on the testing data
y_pred_test2 = d_tree_tuned.predict(X_test)
metrics_score(y_test, y_pred_test2)
precision recall f1-score support
0 0.93 0.77 0.84 962
1 0.62 0.86 0.72 422
accuracy 0.80 1384
macro avg 0.77 0.82 0.78 1384
weighted avg 0.83 0.80 0.80 1384
Let's visualize the tuned decision tree and observe the decision rules:
#Visualize the tree
features = list(X.columns)
plt.figure(figsize = (20, 20))
tree.plot_tree(d_tree_tuned, feature_names = features, filled = True, fontsize = 9, node_ids = True, class_names = True)
plt.show()
Blue represents those who converted to paying customers class = y[1]
Orange represents those who did not convert to paying customers class = y[0]
first_interaction which implies it is one of the most important features (as observed in EDA). Visitors who first interacted with the website versus the mobile app had a much higher conversion rate.
Second split is on time_spent_on_website (highlighted in correlation heatmap).
Visitors who spend more time on the website have a higher chance of converting to paying customers.
Third split is has age represented twice, under each prior branch. Therefore it seems to be of importance. It was noticed that visitors >= 25 had a higher chance of converting to paying customers.
Let's look at the feature importance of the tuned decision tree model
# Importance of features in the tree building
print (pd.DataFrame(d_tree_tuned.feature_importances_, columns = ['Importance'], index = X_train.columns).sort_values(by = 'Importance', ascending = False))
Importance time_spent_on_website 0.348142 first_interaction_Website 0.327181 profile_completed_Medium 0.239274 age 0.063893 last_activity_Website Activity 0.021511 website_visits 0.000000 page_views_per_visit 0.000000 current_occupation_Student 0.000000 current_occupation_Unemployed 0.000000 profile_completed_Low 0.000000 last_activity_Phone Activity 0.000000 print_media_type1_Yes 0.000000 print_media_type2_Yes 0.000000 digital_media_Yes 0.000000 educational_channels_Yes 0.000000 referral_Yes 0.000000
# Plotting the feature importance
importances = d_tree_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize = (10, 10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color = 'blue', align = 'center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
time_spent_on_website
first_interaction (Website)
profile_completed (Medium)
age
last_activity (Website Activity)
# Fitting the random forest tree classifier on the training data
rf_estimator = RandomForestClassifier(random_state=7,criterion="entropy")
rf_estimator.fit(X_train,y_train)
RandomForestClassifier(criterion='entropy', random_state=7)
Let's check the performance of the model on the training data
# Checking performance on the training data
y_pred_train3 = rf_estimator.predict(X_train)
metrics_score(y_train, y_pred_train3)
precision recall f1-score support
0 1.00 1.00 1.00 2273
1 1.00 1.00 1.00 955
accuracy 1.00 3228
macro avg 1.00 1.00 1.00 3228
weighted avg 1.00 1.00 1.00 3228
Let's check the performance on the testing data
# Checking performance on the testing data
y_pred_test3 = rf_estimator.predict(X_test)
metrics_score(y_test, y_pred_test3)
precision recall f1-score support
0 0.88 0.93 0.90 962
1 0.81 0.70 0.75 422
accuracy 0.86 1384
macro avg 0.84 0.81 0.83 1384
weighted avg 0.86 0.86 0.86 1384
Let's see if we can get a better model by tuning the random forest classifier
Let's try tuning some of the important hyperparameters of the Random Forest Classifier.
We will not tune the criterion hyperparameter as we know from hyperparameter tuning for decision trees that entropy is a better splitting criterion for this data.
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 7)
# Grid of parameters to choose from
parameters = {"n_estimators": [100, 110, 120],
"max_depth": [5, 6, 7],
"max_features": [0.8, 0.9, 1]
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
rf_estimator_tuned_base = grid_obj.best_estimator_
# Fitting the best algorithm to the training data
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(criterion='entropy', random_state=7)
# Checking performance on the training data
y_pred_train4 = rf_estimator_tuned_base.predict(X_train)
metrics_score(y_train, y_pred_train4)
precision recall f1-score support
0 0.91 0.92 0.91 2273
1 0.80 0.78 0.79 955
accuracy 0.88 3228
macro avg 0.86 0.85 0.85 3228
weighted avg 0.88 0.88 0.88 3228
# Checking performance on the training data
y_pred_test4 = rf_estimator_tuned_base.predict(X_test)
metrics_score(y_test, y_pred_test4)
precision recall f1-score support
0 0.88 0.92 0.90 962
1 0.79 0.73 0.76 422
accuracy 0.86 1384
macro avg 0.84 0.82 0.83 1384
weighted avg 0.86 0.86 0.86 1384
While less than before the model is slightly overfitted
The precision continues to be higher than the recall, something we will need to improve upon to surpass our tuned model.
The overall accuracy is slightly higher than that of our tuned model suggesting that using a random forest is the correct option.
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 7)
# Grid of parameters to choose from
parameters = {"n_estimators": [110, 120],
"max_depth": [6, 7],
"min_samples_leaf": [20, 25],
"max_features": [0.8, 0.9],
"max_samples": [0.9, 1],
"class_weight": ["balanced",{0: 0.3, 1: 0.7}]
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search on the training data using scorer=scorer and cv=5
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Save the best estimator to variable rf_estimator_tuned
rf_estimator_tuned = grid_obj.best_estimator_
#Fit the best estimator to the training data
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(class_weight='balanced', criterion='entropy',
max_depth=6, max_features=0.8, max_samples=0.9,
min_samples_leaf=25, n_estimators=120, random_state=7)
Let's check the performance of the tuned model
# Checking performance on the training data
y_pred_train5 = rf_estimator_tuned.predict(X_train)
metrics_score(y_train, y_pred_train5)
precision recall f1-score support
0 0.94 0.83 0.88 2273
1 0.68 0.87 0.76 955
accuracy 0.84 3228
macro avg 0.81 0.85 0.82 3228
weighted avg 0.86 0.84 0.84 3228
Let's check the model performance on the test data
# Checking performance on the test data
y_pred_test5 = rf_estimator_tuned.predict(X_test)
metrics_score(y_test, y_pred_test5)
precision recall f1-score support
0 0.93 0.83 0.87 962
1 0.68 0.85 0.76 422
accuracy 0.83 1384
macro avg 0.81 0.84 0.82 1384
weighted avg 0.85 0.83 0.84 1384
importances = rf_estimator_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize = (12, 12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color = 'blue', align = 'center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Goal
The goal was to maximize the Recall value. The higher the Recall score, the greater chances of minimizing the False Negatives.
False Negatives: Predicting a visitor will not convert to a paying customer but in reality, the visitor would convert to a paying customer, therefore the business losing revenue.
Process
Two models performed very well:
Tuned Model - recall 86%, f1-score 72% and macro average 82%
Hyper Tuned Random Forest Model - recall 85%, f1-score 76% and macro average 84%
Model Recommendation
The Hyper Tuned Random Forest Model is recommended for future use as it had a 4% overall improvement. It is also giving the highest Recall score of 85% and the macro average of 84% on the test data.
4 Most Important Features identified by both models
time_spent_on_website
first_interaction (Website)
profile_completed (Medium)
age
Business Recommendations
ExtraaLearn's Representatives should prioritize visitors based on:
time_spent_on_website.first_interaction on the website.profile_completedageIf a visitor is within the identified age of 45 - 55 years then they are more likely to convert to becoming a paying customer.
Visitors which meet these criteria as demonstrated in order ExtraaLearn will be able to decide where to focus their time when talking with new visitors.