Carla A. Rudder, PMP

Logo

Data-driven professional with a strong mathematics, statistics, data analysis, and education background. I use data to find patterns and insights, drive innovation, and create meaningful change.

Technical Skills: Excel, SQL, Python, R, PowerBI, Tableau, and Certified Scrum Master

Applied Data Science & ML - MIT

BSc Applied Mathematics

MSc Mathematics Ed (Financial Math)

Ph.D. Mathematics Ed (Problem-Solving)

View My LinkedIn Profile

View My GitHub Profile

ExtraaLearn Project

Context

The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023 with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc, it is now preferable to traditional education.

In the present scenario due to Covid-19, the online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like

The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.

Objective

ExtraaLearn is an initial-stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:

Data Description

The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

import warnings
warnings.filterwarnings("ignore")

# Libraries for data manipulation and visualization
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# For training and testing the data
from sklearn.model_selection import train_test_split

# Algorithms to use
from sklearn.tree import DecisionTreeClassifier

from sklearn import tree

from sklearn.ensemble import RandomForestClassifier

# Metrics to evaluate the model
from sklearn.metrics import confusion_matrix, classification_report, f1_score, recall_score

from sklearn import metrics

# For hyperparameter tuning
from sklearn.model_selection import GridSearchCV

Load the data

# Load the data - original data set
customers = pd.read_csv('ExtraaLearn.csv')
# create a copy of the data set to work with
df = customers.copy()

Data Overview

#check that the data is loaded and look at the dataframe
df.head()
ID age current_occupation first_interaction profile_completed website_visits time_spent_on_website page_views_per_visit last_activity print_media_type1 print_media_type2 digital_media educational_channels referral status
0 EXT001 57 Unemployed Website High 7 1639 1.861 Website Activity Yes No Yes No No 1
1 EXT002 56 Professional Mobile App Medium 2 83 0.320 Website Activity No No No Yes No 0
2 EXT003 52 Professional Website Medium 3 330 0.074 Website Activity No No Yes No No 0
3 EXT004 53 Unemployed Website High 4 464 2.057 Website Activity No No No No No 1
4 EXT005 23 Student Website High 4 600 16.914 Email Activity No No No No No 0
# check the data last five rows
df.tail()
ID age current_occupation first_interaction profile_completed website_visits time_spent_on_website page_views_per_visit last_activity print_media_type1 print_media_type2 digital_media educational_channels referral status
4607 EXT4608 35 Unemployed Mobile App Medium 15 360 2.170 Phone Activity No No No Yes No 0
4608 EXT4609 55 Professional Mobile App Medium 8 2327 5.393 Email Activity No No No No No 0
4609 EXT4610 58 Professional Website High 2 212 2.692 Email Activity No No No No No 1
4610 EXT4611 57 Professional Mobile App Medium 1 154 3.879 Website Activity Yes No No No No 0
4611 EXT4612 55 Professional Website Medium 4 2290 2.075 Phone Activity No No No No No 0
df.shape
(4612, 15)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4612 entries, 0 to 4611
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ID                     4612 non-null   object 
 1   age                    4612 non-null   int64  
 2   current_occupation     4612 non-null   object 
 3   first_interaction      4612 non-null   object 
 4   profile_completed      4612 non-null   object 
 5   website_visits         4612 non-null   int64  
 6   time_spent_on_website  4612 non-null   int64  
 7   page_views_per_visit   4612 non-null   float64
 8   last_activity          4612 non-null   object 
 9   print_media_type1      4612 non-null   object 
 10  print_media_type2      4612 non-null   object 
 11  digital_media          4612 non-null   object 
 12  educational_channels   4612 non-null   object 
 13  referral               4612 non-null   object 
 14  status                 4612 non-null   int64  
dtypes: float64(1), int64(4), object(10)
memory usage: 540.6+ KB
# find number of unique IDs
df.ID.nunique()
4612

There are 4612 unique IDs. Each row is a unique ID therefore this column doesn’t add value and can be dropped

# drop the ID column
df = df.drop(['ID'], axis = 1)
# look at the first five rows again
df.head()
age current_occupation first_interaction profile_completed website_visits time_spent_on_website page_views_per_visit last_activity print_media_type1 print_media_type2 digital_media educational_channels referral status
0 57 Unemployed Website High 7 1639 1.861 Website Activity Yes No Yes No No 1
1 56 Professional Mobile App Medium 2 83 0.320 Website Activity No No No Yes No 0
2 52 Professional Website Medium 3 330 0.074 Website Activity No No Yes No No 0
3 53 Unemployed Website High 4 464 2.057 Website Activity No No No No No 1
4 23 Student Website High 4 600 16.914 Email Activity No No No No No 0
df.shape
(4612, 14)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4612 entries, 0 to 4611
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   age                    4612 non-null   int64  
 1   current_occupation     4612 non-null   object 
 2   first_interaction      4612 non-null   object 
 3   profile_completed      4612 non-null   object 
 4   website_visits         4612 non-null   int64  
 5   time_spent_on_website  4612 non-null   int64  
 6   page_views_per_visit   4612 non-null   float64
 7   last_activity          4612 non-null   object 
 8   print_media_type1      4612 non-null   object 
 9   print_media_type2      4612 non-null   object 
 10  digital_media          4612 non-null   object 
 11  educational_channels   4612 non-null   object 
 12  referral               4612 non-null   object 
 13  status                 4612 non-null   int64  
dtypes: float64(1), int64(4), object(9)
memory usage: 504.6+ KB

Observations

Exploratory Data Analysis (EDA)

Questions

  1. Leads will have different expectations from the outcome of the course and the current occupation may play a key role in getting them to participate in the program. Find out how current occupation affects lead status.
  2. The company’s first impression on the customer must have an impact. Do the first channels of interaction have an impact on the lead status?
  3. The company uses multiple modes to interact with prospects. Which way of interaction works best?
  4. The company gets leads from various channels such as print media, digital media, referrals, etc. Which of these channels has the highest lead conversion rate?
  5. People browsing the website or mobile application are generally required to create a profile by sharing their personal data before they can access additional information. Does having more details about a prospect increase the chances of conversion?

Univariate Data Analysis

Summary of all data columns

# get a summary of the data
df.describe(include = 'all').T
count unique top freq mean std min 25% 50% 75% max
age 4612.0 NaN NaN NaN 46.201214 13.161454 18.0 36.0 51.0 57.0 63.0
current_occupation 4612 3 Professional 2616 NaN NaN NaN NaN NaN NaN NaN
first_interaction 4612 2 Website 2542 NaN NaN NaN NaN NaN NaN NaN
profile_completed 4612 3 High 2264 NaN NaN NaN NaN NaN NaN NaN
website_visits 4612.0 NaN NaN NaN 3.566782 2.829134 0.0 2.0 3.0 5.0 30.0
time_spent_on_website 4612.0 NaN NaN NaN 724.011275 743.828683 0.0 148.75 376.0 1336.75 2537.0
page_views_per_visit 4612.0 NaN NaN NaN 3.026126 1.968125 0.0 2.07775 2.792 3.75625 18.434
last_activity 4612 3 Email Activity 2278 NaN NaN NaN NaN NaN NaN NaN
print_media_type1 4612 2 No 4115 NaN NaN NaN NaN NaN NaN NaN
print_media_type2 4612 2 No 4379 NaN NaN NaN NaN NaN NaN NaN
digital_media 4612 2 No 4085 NaN NaN NaN NaN NaN NaN NaN
educational_channels 4612 2 No 3907 NaN NaN NaN NaN NaN NaN NaN
referral 4612 2 No 4519 NaN NaN NaN NaN NaN NaN NaN
status 4612.0 NaN NaN NaN 0.298569 0.45768 0.0 0.0 0.0 1.0 1.0

Identify the percentage and count of each group within the categorical variables

Percentage

# Making a list of all categorical variables
cat_col = list(df.select_dtypes("object").columns)

# Printing percentage of each unique value in each categorical column
for column in cat_col:
    print(df[column].value_counts(normalize = True))
    print("-" * 50)
Professional    0.567216
Unemployed      0.312446
Student         0.120338
Name: current_occupation, dtype: float64
--------------------------------------------------
Website       0.551171
Mobile App    0.448829
Name: first_interaction, dtype: float64
--------------------------------------------------
High      0.490893
Medium    0.485906
Low       0.023200
Name: profile_completed, dtype: float64
--------------------------------------------------
Email Activity      0.493929
Phone Activity      0.267563
Website Activity    0.238508
Name: last_activity, dtype: float64
--------------------------------------------------
No     0.892238
Yes    0.107762
Name: print_media_type1, dtype: float64
--------------------------------------------------
No     0.94948
Yes    0.05052
Name: print_media_type2, dtype: float64
--------------------------------------------------
No     0.885733
Yes    0.114267
Name: digital_media, dtype: float64
--------------------------------------------------
No     0.847138
Yes    0.152862
Name: educational_channels, dtype: float64
--------------------------------------------------
No     0.979835
Yes    0.020165
Name: referral, dtype: float64
--------------------------------------------------

Count of each group within the categorical variables

# Making a list of all categorical variables
cat_col = list(df.select_dtypes("object").columns)

# Printing count of each unique value in each categorical column
for column in cat_col:
    print(df[column].value_counts(normalize = False))
    print("-" * 50)
Professional    2616
Unemployed      1441
Student          555
Name: current_occupation, dtype: int64
--------------------------------------------------
Website       2542
Mobile App    2070
Name: first_interaction, dtype: int64
--------------------------------------------------
High      2264
Medium    2241
Low        107
Name: profile_completed, dtype: int64
--------------------------------------------------
Email Activity      2278
Phone Activity      1234
Website Activity    1100
Name: last_activity, dtype: int64
--------------------------------------------------
No     4115
Yes     497
Name: print_media_type1, dtype: int64
--------------------------------------------------
No     4379
Yes     233
Name: print_media_type2, dtype: int64
--------------------------------------------------
No     4085
Yes     527
Name: digital_media, dtype: int64
--------------------------------------------------
No     3907
Yes     705
Name: educational_channels, dtype: int64
--------------------------------------------------
No     4519
Yes      93
Name: referral, dtype: int64
--------------------------------------------------

Observations

Categorical Data

Status

How many of the visitors are converted to customers

#create a bar chart to determine the number of visitors which are converted to customers (1)
plt.figure(figsize = (10, 6))

ax = sns.countplot(x = 'status', data = df)

# Place the exact count on the top of the bar for each category using annotate
for p in ax.patches:
    ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x(), p.get_height()))

Observation Summary

The majority of the visitors are professionals (56.7%). The website (55%) is the primary first interaction with ExtraaLearn. Only 2% of the visitors have a low completion of the profile. Email (49%) is the highest last activity of the visitor. Very few visitors have had interactions with ExtraaLearn through advertisements or referrals seen the newspaper ads (10%), magazine ads (5%), digital media (11%), educational channels (15%) or have been referred (2%).

Only 1377 (approx 30%) from a total of 4612 visitors are converted to customers.

Data Preprocessing

Distributions and Outliers

Create count plots to identify the distribution of the data

Create box plots to determine if there are any outliers in the data.

#create countplots and box plots to visualize data to identify the distribution and outliers

for col in ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']:
    print(col)
    
    print('The skew is :',round(df[col].skew(), 2))
    
    plt.figure(figsize = (20, 4))
# histogram    
    plt.subplot(1, 2, 1)
    df[col].hist(bins = 10, grid = False)
    plt.ylabel('count')
#box plot    
    plt.subplot(1, 2, 2)
    sns.boxplot(df[col])
   
    plt.show()

age
The skew is : -0.72

website_visits
The skew is : 2.16

time_spent_on_website
The skew is : 0.95

page_views_per_visit
The skew is : 1.27

Observations

age There is a negative skew (-0.72) with most visitors approx 55 years of age.
website_visits There is positive skew (2.16) with the highest frequency visiting from 0 to 5 times decreasing from there. The box plot shows outliers. time_spent_on_website There is a positive skew (0.95) with the highest frequency of visitors spending between 0 and 250 on the site.
page_views_per_visit There is a positive skew (1.27) with the highest frequency of page views between 2.5 and 5. The box plot shows outliers.

Identifying outliers

# defining the definition to identify outliers 
def find_outliers_IQR(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    outliers = df[((df<(Q1 - 1.5*IQR)) | (df>(Q3 + 1.5*IQR)))]
    return outliers
#identifying the outliers for the website_visits

outliers = find_outliers_IQR(df['website_visits'])

print('website_visits number of outliers: ' + str(outliers.count()))

print('website_visits min outlier value: ' + str(outliers.min()))

print('website_visits max outlier value: ' + str(outliers.max()))

outliers
website_visits number of outliers: 154
website_visits min outlier value: 10
website_visits max outlier value: 30





6       13
31      13
32      12
66      25
201     14
        ..
4566    13
4571    12
4583    24
4589    16
4607    15
Name: website_visits, Length: 154, dtype: int64
#identifying the outliers for the page_views_per_visit

outliers = find_outliers_IQR(df['page_views_per_visit'])

print('page_views_per_visit number of outliers: ' + str(outliers.count()))

print('page_views_per_visit min outlier value: ' + str(outliers.min()))

print('page_views_per_visit max outlier value: ' + str(outliers.max()))

outliers
page_views_per_visit number of outliers: 257
page_views_per_visit min outlier value: 6.313
page_views_per_visit max outlier value: 18.434





4       16.914
32      18.434
47       7.050
110      7.364
121      6.887
         ...  
4470     6.810
4507     6.822
4514     7.997
4572     7.397
4597     8.246
Name: page_views_per_visit, Length: 257, dtype: float64

Bivariate Data Analysis

# create a pair plot to see if there are any relationships 
# distingueished the pairplot by adding the status as an additional parameter for the pairplot
sns.pairplot(df, hue ='status')

<seaborn.axisgrid.PairGrid at 0x7fe0cfa5de20>

Pairplot Summary

Pairplot visualizes given data to find the relationship between them where the variables can be continuous or categorical. The color represents if the visitor was converted to a customer or not. Orange shows those that are now customers.

age

#visualize what the age means for creating a paying customer
plt.figure(figsize = (15, 10))
sns.boxplot(df["status"], df["age"])
plt.show()

current_occupation

Analyze the data based on the current_occupation.

# visualize the current occupations of the visitors based on conversion
plt.figure(figsize = (15, 10))
sns.countplot(x = 'current_occupation', hue = 'status', data = df)
plt.show()

#look at the age of the visitor by current occupation
df.groupby(["current_occupation"])["age"].describe()
count mean std min 25% 50% 75% max
current_occupation
Professional 2616.0 49.347477 9.890744 25.0 42.0 54.0 57.0 60.0
Student 555.0 21.144144 2.001114 18.0 19.0 21.0 23.0 25.0
Unemployed 1441.0 50.140180 9.999503 32.0 42.0 54.0 58.0 63.0
# box plots of the age of the visitor grouped by the current occupation
plt.figure(figsize = (10, 5))
sns.boxplot(df["current_occupation"], df["age"])
plt.show()

first_interactions

# visualize the first_interactions of the visitors based on conversion
plt.figure(figsize = (15, 10))
sns.countplot(x = 'first_interaction', hue = 'status', data = df)
plt.show()

time_spent_on_website

# visualize the time_spent_on_website of the visitors based on conversion
plt.figure(figsize = (10, 5))
sns.boxplot(df["status"], df["time_spent_on_website"])
plt.show()

profile_completed

#visualize what the completion of the profile means for creating a paying customer
plt.figure(figsize = (15, 10))
sns.countplot(x = 'profile_completed', hue = 'status', data = df)
plt.show()

referral

#visualize what a referral means for creating a paying customer
plt.figure(figsize = (15, 10))
sns.countplot(x = 'referral', hue = 'status', data = df)
plt.show()

Observations

Correlation Heat Map

# Correlation matrix (no grouping)
plt.figure(figsize=(15,10))
sns.heatmap(df.corr().round(2),annot=True)
plt.title('Correlation matrix of data',fontsize = 30)
plt.show()

Correlations Summary

Model Preparation

To determine which variable will lead to a visitor conversion to a paying customer

Encoding the data

# Separating the target variable and other variables

# make a copy called X which is a dataframe with "status" column removed
X = df.drop(columns = 'status') 

# Y is a series containing the "status" (column)
Y = df['status'] 
# Creating dummy variables, drop_first=True is used to avoid redundant variables
#pd.get_dummies => working on X dataframe converts all categorical variables into binary 1(yes) / 0(no).

X = pd.get_dummies(X, drop_first = True)
# Splitting the data into train (70%) and test (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 1)

Shape

print("Shape of the training set: ", X_train.shape)   

print("Shape of the test set: ", X_test.shape)

print("Percentage of classes in the training set:")

print(y_train.value_counts(normalize = True))

print("Percentage of classes in the test set:")

print(y_test.value_counts(normalize = True))
Shape of the training set:  (3228, 16)
Shape of the test set:  (1384, 16)
Percentage of classes in the training set:
0    0.704151
1    0.295849
Name: status, dtype: float64
Percentage of classes in the test set:
0    0.695087
1    0.304913
Name: status, dtype: float64

Building Classification Models

Before training the model, let’s choose the appropriate model evaluation criterion as per the problem at hand.

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a visitor will not convert to a paying customer but in reality, the visitor would convert to a paying customer.
  2. Predicting a visitor will convert to a paying customer but in reality, the visitor does not convert to a paying customer.

Which case is more important?

The number of False Negatives should be minimized.

How to reduce the losses?

Also, let’s create a function to calculate and print the classification report and confusion matrix so that we don’t have to rewrite the same code repeatedly for each model.

# Function to print the classification report and get confusion matrix in a proper format

def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))
    
    cm = confusion_matrix(actual, predicted)
    
    plt.figure(figsize = (8, 5))
    
    sns.heatmap(cm, annot = True,  fmt = '.2f', xticklabels = ['Not Converted', 'Converted'], yticklabels = ['Not Converted', 'Converted'])
    
    plt.ylabel('Actual')
    
    plt.xlabel('Predicted')
    
    plt.show()

Building a Decision Tree model

# Fitting the decision tree classifier on the training data
d_tree =  DecisionTreeClassifier(random_state = 7)

d_tree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=7)

Model Performance evaluation and improvement

# Checking performance on the training data
y_pred_train1 = d_tree.predict(X_train)

metrics_score(y_train, y_pred_train1)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2273
           1       1.00      1.00      1.00       955

    accuracy                           1.00      3228
   macro avg       1.00      1.00      1.00      3228
weighted avg       1.00      1.00      1.00      3228

Observations:

Reading confusion matrix (clockwise):

There is no error on the training set, i.e., each sample has been classified.

The model this perfectly on the training data, it is likely overfitted.

# Checking performance on the testing data
y_pred_test1 = d_tree.predict(X_test)

metrics_score(y_test, y_pred_test1)
              precision    recall  f1-score   support

           0       0.87      0.86      0.87       962
           1       0.69      0.70      0.70       422

    accuracy                           0.81      1384
   macro avg       0.78      0.78      0.78      1384
weighted avg       0.81      0.81      0.81      1384

Observations:

Decision Tree - Hyperparameter Tuning

We will use the class_weight hyperparameter with the value equal to {0: 0.3, 1: 0.7} which is approximately the opposite of the imbalance in the original data.

This would tell the model that 1 is an important class here.

# Choose the type of classifier 
d_tree_tuned = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.3, 1: 0.7})

# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 10), #depth [2, 3, 4, 5, 6, 7, 8, 9]
              'criterion': ['gini', 'entropy'], #use both gini and entropy to measure split quality
              'min_samples_leaf': [5, 10, 20, 25] #minimum number of samples to be a leaf node
             }

# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search 
grid_obj = GridSearchCV(d_tree_tuned, parameters, scoring = scorer, cv = 5) #=> chooses the best hyperparameters to use

grid_obj = grid_obj.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
d_tree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data
d_tree_tuned.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7}, criterion='entropy',
                       max_depth=3, min_samples_leaf=5, random_state=7)

We have tuned the model and fit the tuned model on the training data. Now, let’s check the model performance on the training and testing data.

# Checking performance on the training data
y_pred_train2 = d_tree_tuned.predict(X_train)

metrics_score(y_train, y_pred_train2)
              precision    recall  f1-score   support

           0       0.94      0.77      0.85      2273
           1       0.62      0.88      0.73       955

    accuracy                           0.80      3228
   macro avg       0.78      0.83      0.79      3228
weighted avg       0.84      0.80      0.81      3228

Observations

Let’s check the model performance on the testing data

# Checking performance on the testing data
y_pred_test2 = d_tree_tuned.predict(X_test)

metrics_score(y_test, y_pred_test2)
              precision    recall  f1-score   support

           0       0.93      0.77      0.84       962
           1       0.62      0.86      0.72       422

    accuracy                           0.80      1384
   macro avg       0.77      0.82      0.78      1384
weighted avg       0.83      0.80      0.80      1384

Observations

Let’s visualize the tuned decision tree and observe the decision rules:

#Visualize the tree
features = list(X.columns)
plt.figure(figsize = (20, 20))
tree.plot_tree(d_tree_tuned, feature_names = features, filled = True, fontsize = 9, node_ids = True, class_names = True)
plt.show()

Blue represents those who converted to paying customers class = y[1]

Orange represents those who did not convert to paying customers class = y[0]

Observations

Let’s look at the feature importance of the tuned decision tree model

# Importance of features in the tree building

print (pd.DataFrame(d_tree_tuned.feature_importances_, columns = ['Importance'], index = X_train.columns).sort_values(by = 'Importance', ascending = False))
                                Importance
time_spent_on_website             0.348142
first_interaction_Website         0.327181
profile_completed_Medium          0.239274
age                               0.063893
last_activity_Website Activity    0.021511
website_visits                    0.000000
page_views_per_visit              0.000000
current_occupation_Student        0.000000
current_occupation_Unemployed     0.000000
profile_completed_Low             0.000000
last_activity_Phone Activity      0.000000
print_media_type1_Yes             0.000000
print_media_type2_Yes             0.000000
digital_media_Yes                 0.000000
educational_channels_Yes          0.000000
referral_Yes                      0.000000
# Plotting the feature importance
importances = d_tree_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize = (10, 10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color = 'blue', align = 'center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Observations

time_spent_on_website

first_interaction (Website)

profile_completed (Medium)

age

last_activity (Website Activity)

Building a Random Forest model

Random Forest Classifier

# Fitting the random forest tree classifier on the training data
rf_estimator = RandomForestClassifier(random_state=7,criterion="entropy")

rf_estimator.fit(X_train,y_train)
RandomForestClassifier(criterion='entropy', random_state=7)

Let’s check the performance of the model on the training data

# Checking performance on the training data
y_pred_train3 = rf_estimator.predict(X_train)

metrics_score(y_train, y_pred_train3)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2273
           1       1.00      1.00      1.00       955

    accuracy                           1.00      3228
   macro avg       1.00      1.00      1.00      3228
weighted avg       1.00      1.00      1.00      3228

Observations:

Let’s check the performance on the testing data

# Checking performance on the testing data
y_pred_test3 = rf_estimator.predict(X_test)

metrics_score(y_test, y_pred_test3)
              precision    recall  f1-score   support

           0       0.88      0.93      0.90       962
           1       0.81      0.70      0.75       422

    accuracy                           0.86      1384
   macro avg       0.84      0.81      0.83      1384
weighted avg       0.86      0.86      0.86      1384

Observations:

Let’s see if we can get a better model by tuning the random forest classifier

Random Forest Classifier - Hyperparameter Tuning Model

# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 7)

# Grid of parameters to choose from
parameters = {"n_estimators": [100, 110, 120],
    "max_depth": [5, 6, 7],
    "max_features": [0.8, 0.9, 1]
             }

# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring = scorer, cv = 5)

grid_obj = grid_obj.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
rf_estimator_tuned_base = grid_obj.best_estimator_
# Fitting the best algorithm to the training data
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(criterion='entropy', random_state=7)
# Checking performance on the training data
y_pred_train4 = rf_estimator_tuned_base.predict(X_train)

metrics_score(y_train, y_pred_train4)
              precision    recall  f1-score   support

           0       0.91      0.92      0.91      2273
           1       0.80      0.78      0.79       955

    accuracy                           0.88      3228
   macro avg       0.86      0.85      0.85      3228
weighted avg       0.88      0.88      0.88      3228

Observations:

# Checking performance on the training data
y_pred_test4 = rf_estimator_tuned_base.predict(X_test)

metrics_score(y_test, y_pred_test4)
              precision    recall  f1-score   support

           0       0.88      0.92      0.90       962
           1       0.79      0.73      0.76       422

    accuracy                           0.86      1384
   macro avg       0.84      0.82      0.83      1384
weighted avg       0.86      0.86      0.86      1384

Observations

Random Forest Classifier - Hyperparameter Tuning Model 2

# Choose the type of classifier 
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 7)

# Grid of parameters to choose from
parameters = {"n_estimators": [110, 120],
    "max_depth": [6, 7],
    "min_samples_leaf": [20, 25],
    "max_features": [0.8, 0.9],
    "max_samples": [0.9, 1],
    "class_weight": ["balanced",{0: 0.3, 1: 0.7}]
             }

# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search on the training data using scorer=scorer and cv=5

grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring = scorer, cv = 5)

grid_obj = grid_obj.fit(X_train, y_train)

# Save the best estimator to variable rf_estimator_tuned
rf_estimator_tuned = grid_obj.best_estimator_

#Fit the best estimator to the training data
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=6, max_features=0.8, max_samples=0.9,
                       min_samples_leaf=25, n_estimators=120, random_state=7)

Let’s check the performance of the tuned model

# Checking performance on the training data
y_pred_train5 = rf_estimator_tuned.predict(X_train)

metrics_score(y_train, y_pred_train5)
              precision    recall  f1-score   support

           0       0.94      0.83      0.88      2273
           1       0.68      0.87      0.76       955

    accuracy                           0.84      3228
   macro avg       0.81      0.85      0.82      3228
weighted avg       0.86      0.84      0.84      3228

Let’s check the model performance on the test data

# Checking performance on the test data
y_pred_test5 = rf_estimator_tuned.predict(X_test)

metrics_score(y_test, y_pred_test5)
              precision    recall  f1-score   support

           0       0.93      0.83      0.87       962
           1       0.68      0.85      0.76       422

    accuracy                           0.83      1384
   macro avg       0.81      0.84      0.82      1384
weighted avg       0.85      0.83      0.84      1384

Observations:

Feature Importance - Random Forest

importances = rf_estimator_tuned.feature_importances_

indices = np.argsort(importances)

feature_names = list(X.columns)

plt.figure(figsize = (12, 12))

plt.title('Feature Importances')

plt.barh(range(len(indices)), importances[indices], color = 'blue', align = 'center')

plt.yticks(range(len(indices)), [feature_names[i] for i in indices])

plt.xlabel('Relative Importance')

plt.show()

Observations:

Actionable Insights and Recommendations

Goal

The goal was to maximize the Recall value. The higher the Recall score, the greater chances of minimizing the False Negatives.
False Negatives: Predicting a visitor will not convert to a paying customer but in reality, the visitor would convert to a paying customer, therefore the business losing revenue.

Process

Two models performed very well:

  1. Tuned Model - recall 86%, f1-score 72%, and macro average 82%

  2. Hyper Tuned Random Forest Model - recall 85%, f1-score 76%, and macro average 84%

Conclusions:

Model Recommendation

The Hyper Tuned Random Forest Model is recommended for future use as it had a 4% overall improvement. It is also giving the highest Recall score of 85% and the macro average of 84% on the test data.

4 Most Important Features identified by both models

time_spent_on_website

first_interaction (Website)

profile_completed (Medium)

age

Business Recommendations

ExtraaLearn’s Representatives should prioritize visitors based on:

  1. The visitors time_spent_on_website.
    • This was the top feature for both models in determining if a visitor would convert to a paying customer or not.
  2. The visitors first_interaction on the website.
    • If a visitor’s first interaction was on the website instead of the app they were more like to become paying customers. Therefore this should be another point for the representatives to pay attention to when deciding on the amount of time to dedicate to a visitor.
  3. The visitors profile_completed
    • If the visitor has been identified as a medium for the completion of the profile they are more likely to convert to a paying customer.
  4. The visitors age
    • If a visitor is within the identified age of 45 - 55 years then they are more likely to convert to becoming a paying customer.