Data-driven professional with a strong mathematics, statistics, data analysis, and education background. I use data to find patterns and insights, drive innovation, and create meaningful change.
Technical Skills: Excel, SQL, Python, R, PowerBI, Tableau, and Certified Scrum Master
Applied Data Science & ML - MIT
BSc Applied Mathematics
MSc Mathematics Ed (Financial Math)
Ph.D. Mathematics Ed (Problem-Solving)
View My LinkedIn Profile
The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023 with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc, it is now preferable to traditional education.
In the present scenario due to Covid-19, the online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like
The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.
ExtraaLearn is an initial-stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:
The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.
Data Dictionary
import warnings
warnings.filterwarnings("ignore")
# Libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# For training and testing the data
from sklearn.model_selection import train_test_split
# Algorithms to use
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
# Metrics to evaluate the model
from sklearn.metrics import confusion_matrix, classification_report, f1_score, recall_score
from sklearn import metrics
# For hyperparameter tuning
from sklearn.model_selection import GridSearchCV
# Load the data - original data set
customers = pd.read_csv('ExtraaLearn.csv')
# create a copy of the data set to work with
df = customers.copy()
Sanity checks
#check that the data is loaded and look at the dataframe
df.head()
ID | age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | EXT001 | 57 | Unemployed | Website | High | 7 | 1639 | 1.861 | Website Activity | Yes | No | Yes | No | No | 1 |
1 | EXT002 | 56 | Professional | Mobile App | Medium | 2 | 83 | 0.320 | Website Activity | No | No | No | Yes | No | 0 |
2 | EXT003 | 52 | Professional | Website | Medium | 3 | 330 | 0.074 | Website Activity | No | No | Yes | No | No | 0 |
3 | EXT004 | 53 | Unemployed | Website | High | 4 | 464 | 2.057 | Website Activity | No | No | No | No | No | 1 |
4 | EXT005 | 23 | Student | Website | High | 4 | 600 | 16.914 | Email Activity | No | No | No | No | No | 0 |
# check the data last five rows
df.tail()
ID | age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4607 | EXT4608 | 35 | Unemployed | Mobile App | Medium | 15 | 360 | 2.170 | Phone Activity | No | No | No | Yes | No | 0 |
4608 | EXT4609 | 55 | Professional | Mobile App | Medium | 8 | 2327 | 5.393 | Email Activity | No | No | No | No | No | 0 |
4609 | EXT4610 | 58 | Professional | Website | High | 2 | 212 | 2.692 | Email Activity | No | No | No | No | No | 1 |
4610 | EXT4611 | 57 | Professional | Mobile App | Medium | 1 | 154 | 3.879 | Website Activity | Yes | No | No | No | No | 0 |
4611 | EXT4612 | 55 | Professional | Website | Medium | 4 | 2290 | 2.075 | Phone Activity | No | No | No | No | No | 0 |
df.shape
(4612, 15)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4612 entries, 0 to 4611
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 4612 non-null object
1 age 4612 non-null int64
2 current_occupation 4612 non-null object
3 first_interaction 4612 non-null object
4 profile_completed 4612 non-null object
5 website_visits 4612 non-null int64
6 time_spent_on_website 4612 non-null int64
7 page_views_per_visit 4612 non-null float64
8 last_activity 4612 non-null object
9 print_media_type1 4612 non-null object
10 print_media_type2 4612 non-null object
11 digital_media 4612 non-null object
12 educational_channels 4612 non-null object
13 referral 4612 non-null object
14 status 4612 non-null int64
dtypes: float64(1), int64(4), object(10)
memory usage: 540.6+ KB
# find number of unique IDs
df.ID.nunique()
4612
# drop the ID column
df = df.drop(['ID'], axis = 1)
# look at the first five rows again
df.head()
age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 57 | Unemployed | Website | High | 7 | 1639 | 1.861 | Website Activity | Yes | No | Yes | No | No | 1 |
1 | 56 | Professional | Mobile App | Medium | 2 | 83 | 0.320 | Website Activity | No | No | No | Yes | No | 0 |
2 | 52 | Professional | Website | Medium | 3 | 330 | 0.074 | Website Activity | No | No | Yes | No | No | 0 |
3 | 53 | Unemployed | Website | High | 4 | 464 | 2.057 | Website Activity | No | No | No | No | No | 1 |
4 | 23 | Student | Website | High | 4 | 600 | 16.914 | Email Activity | No | No | No | No | No | 0 |
df.shape
(4612, 14)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4612 entries, 0 to 4611
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 4612 non-null int64
1 current_occupation 4612 non-null object
2 first_interaction 4612 non-null object
3 profile_completed 4612 non-null object
4 website_visits 4612 non-null int64
5 time_spent_on_website 4612 non-null int64
6 page_views_per_visit 4612 non-null float64
7 last_activity 4612 non-null object
8 print_media_type1 4612 non-null object
9 print_media_type2 4612 non-null object
10 digital_media 4612 non-null object
11 educational_channels 4612 non-null object
12 referral 4612 non-null object
13 status 4612 non-null int64
dtypes: float64(1), int64(4), object(9)
memory usage: 504.6+ KB
Questions
# get a summary of the data
df.describe(include = 'all').T
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
age | 4612.0 | NaN | NaN | NaN | 46.201214 | 13.161454 | 18.0 | 36.0 | 51.0 | 57.0 | 63.0 |
current_occupation | 4612 | 3 | Professional | 2616 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
first_interaction | 4612 | 2 | Website | 2542 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
profile_completed | 4612 | 3 | High | 2264 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
website_visits | 4612.0 | NaN | NaN | NaN | 3.566782 | 2.829134 | 0.0 | 2.0 | 3.0 | 5.0 | 30.0 |
time_spent_on_website | 4612.0 | NaN | NaN | NaN | 724.011275 | 743.828683 | 0.0 | 148.75 | 376.0 | 1336.75 | 2537.0 |
page_views_per_visit | 4612.0 | NaN | NaN | NaN | 3.026126 | 1.968125 | 0.0 | 2.07775 | 2.792 | 3.75625 | 18.434 |
last_activity | 4612 | 3 | Email Activity | 2278 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
print_media_type1 | 4612 | 2 | No | 4115 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
print_media_type2 | 4612 | 2 | No | 4379 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
digital_media | 4612 | 2 | No | 4085 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
educational_channels | 4612 | 2 | No | 3907 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
referral | 4612 | 2 | No | 4519 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
status | 4612.0 | NaN | NaN | NaN | 0.298569 | 0.45768 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
# Making a list of all categorical variables
cat_col = list(df.select_dtypes("object").columns)
# Printing percentage of each unique value in each categorical column
for column in cat_col:
print(df[column].value_counts(normalize = True))
print("-" * 50)
Professional 0.567216
Unemployed 0.312446
Student 0.120338
Name: current_occupation, dtype: float64
--------------------------------------------------
Website 0.551171
Mobile App 0.448829
Name: first_interaction, dtype: float64
--------------------------------------------------
High 0.490893
Medium 0.485906
Low 0.023200
Name: profile_completed, dtype: float64
--------------------------------------------------
Email Activity 0.493929
Phone Activity 0.267563
Website Activity 0.238508
Name: last_activity, dtype: float64
--------------------------------------------------
No 0.892238
Yes 0.107762
Name: print_media_type1, dtype: float64
--------------------------------------------------
No 0.94948
Yes 0.05052
Name: print_media_type2, dtype: float64
--------------------------------------------------
No 0.885733
Yes 0.114267
Name: digital_media, dtype: float64
--------------------------------------------------
No 0.847138
Yes 0.152862
Name: educational_channels, dtype: float64
--------------------------------------------------
No 0.979835
Yes 0.020165
Name: referral, dtype: float64
--------------------------------------------------
# Making a list of all categorical variables
cat_col = list(df.select_dtypes("object").columns)
# Printing count of each unique value in each categorical column
for column in cat_col:
print(df[column].value_counts(normalize = False))
print("-" * 50)
Professional 2616
Unemployed 1441
Student 555
Name: current_occupation, dtype: int64
--------------------------------------------------
Website 2542
Mobile App 2070
Name: first_interaction, dtype: int64
--------------------------------------------------
High 2264
Medium 2241
Low 107
Name: profile_completed, dtype: int64
--------------------------------------------------
Email Activity 2278
Phone Activity 1234
Website Activity 1100
Name: last_activity, dtype: int64
--------------------------------------------------
No 4115
Yes 497
Name: print_media_type1, dtype: int64
--------------------------------------------------
No 4379
Yes 233
Name: print_media_type2, dtype: int64
--------------------------------------------------
No 4085
Yes 527
Name: digital_media, dtype: int64
--------------------------------------------------
No 3907
Yes 705
Name: educational_channels, dtype: int64
--------------------------------------------------
No 4519
Yes 93
Name: referral, dtype: int64
--------------------------------------------------
current_occupation
- there are 3 responses allowed
– Professional is the top with 2616 (57%)first_interaction
- there are 2 responses allowed
– Website is the top with 2542 (55%)profile_completed
- there are 3 possible
– High is the top with 2264 (49%)last_activity
- there are 3 responses
– Email Activity is the top with 2278 (49%)print_media_type1
- there are 2 responses
– No is the top with 4115 (89%) (not seen the Newspaper Ad)print_media_type2
- there are 2 responses
– No is the top with 4379 (94%) (not seen the Magazine Ad)digital_media
- there are 2 responses
– No is the top with 4085 (89%) (not seen an ad on the digital platforms)educational_channels
- there are 2 responses
– No is the top with 3907(85%) (not heard of ExtraaLearn via online forums, educ websites or discussion threads…)referral
- there are 2 responses
– No is the top with 4519 (98%) (not referred)
age
The mean age is 46.2 with the range from 18 to 63 years.website_visits
The mean number of website visits are 3.566782 with a range from 0 to 30 visits. 75% of the visitors visit up to 5 times. An outlier may be present.time_spent_on_website
The mean time spent was 743.828683 with a range from 0 to 2537. 75% of the visitors spent only up to 1336.75page_views_per_visit
The mean page views were 3.026126 pages with a range from 0 to 18.434. 75% of the viewers viewed up to 3.75625 pages. An outlier may be present.#create a bar chart to determine the number of visitors which are converted to customers (1)
plt.figure(figsize = (10, 6))
ax = sns.countplot(x = 'status', data = df)
# Place the exact count on the top of the bar for each category using annotate
for p in ax.patches:
ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x(), p.get_height()))
The majority of the visitors are professionals (56.7%). The website (55%) is the primary first interaction with ExtraaLearn. Only 2% of the visitors have a low completion of the profile. Email (49%) is the highest last activity of the visitor. Very few visitors have had interactions with ExtraaLearn through advertisements or referrals seen the newspaper ads (10%), magazine ads (5%), digital media (11%), educational channels (15%) or have been referred (2%).
Only 1377 (approx 30%) from a total of 4612 visitors are converted to customers.
#create countplots and box plots to visualize data to identify the distribution and outliers
for col in ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']:
print(col)
print('The skew is :',round(df[col].skew(), 2))
plt.figure(figsize = (20, 4))
# histogram
plt.subplot(1, 2, 1)
df[col].hist(bins = 10, grid = False)
plt.ylabel('count')
#box plot
plt.subplot(1, 2, 2)
sns.boxplot(df[col])
plt.show()
age
The skew is : -0.72
website_visits
The skew is : 2.16
time_spent_on_website
The skew is : 0.95
page_views_per_visit
The skew is : 1.27
age
There is a negative skew (-0.72) with most visitors approx 55 years of age.
website_visits
There is positive skew (2.16) with the highest frequency visiting from 0 to 5 times decreasing from there. The box plot shows outliers.
time_spent_on_website
There is a positive skew (0.95) with the highest frequency of visitors spending between 0 and 250 on the site.
page_views_per_visit
There is a positive skew (1.27) with the highest frequency of page views between 2.5 and 5. The box plot shows outliers.
find_outliers_IQR
which takes in a dataframe as an input and returns a dataframe as an output. The returned data frame contains the outliers as numerical values and others as NaN# defining the definition to identify outliers
def find_outliers_IQR(df):
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
outliers = df[((df<(Q1 - 1.5*IQR)) | (df>(Q3 + 1.5*IQR)))]
return outliers
#identifying the outliers for the website_visits
outliers = find_outliers_IQR(df['website_visits'])
print('website_visits number of outliers: ' + str(outliers.count()))
print('website_visits min outlier value: ' + str(outliers.min()))
print('website_visits max outlier value: ' + str(outliers.max()))
outliers
website_visits number of outliers: 154
website_visits min outlier value: 10
website_visits max outlier value: 30
6 13
31 13
32 12
66 25
201 14
..
4566 13
4571 12
4583 24
4589 16
4607 15
Name: website_visits, Length: 154, dtype: int64
#identifying the outliers for the page_views_per_visit
outliers = find_outliers_IQR(df['page_views_per_visit'])
print('page_views_per_visit number of outliers: ' + str(outliers.count()))
print('page_views_per_visit min outlier value: ' + str(outliers.min()))
print('page_views_per_visit max outlier value: ' + str(outliers.max()))
outliers
page_views_per_visit number of outliers: 257
page_views_per_visit min outlier value: 6.313
page_views_per_visit max outlier value: 18.434
4 16.914
32 18.434
47 7.050
110 7.364
121 6.887
...
4470 6.810
4507 6.822
4514 7.997
4572 7.397
4597 8.246
Name: page_views_per_visit, Length: 257, dtype: float64
# create a pair plot to see if there are any relationships
# distingueished the pairplot by adding the status as an additional parameter for the pairplot
sns.pairplot(df, hue ='status')
<seaborn.axisgrid.PairGrid at 0x7fe0cfa5de20>
Pairplot visualizes given data to find the relationship between them where the variables can be continuous or categorical. The color represents if the visitor was converted to a customer or not. Orange shows those that are now customers.
age
#visualize what the age means for creating a paying customer
plt.figure(figsize = (15, 10))
sns.boxplot(df["status"], df["age"])
plt.show()
current_occupation
Analyze the data based on the current_occupation
.
current_occupation
current_occupation
current_occupation
# visualize the current occupations of the visitors based on conversion
plt.figure(figsize = (15, 10))
sns.countplot(x = 'current_occupation', hue = 'status', data = df)
plt.show()
#look at the age of the visitor by current occupation
df.groupby(["current_occupation"])["age"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
current_occupation | ||||||||
Professional | 2616.0 | 49.347477 | 9.890744 | 25.0 | 42.0 | 54.0 | 57.0 | 60.0 |
Student | 555.0 | 21.144144 | 2.001114 | 18.0 | 19.0 | 21.0 | 23.0 | 25.0 |
Unemployed | 1441.0 | 50.140180 | 9.999503 | 32.0 | 42.0 | 54.0 | 58.0 | 63.0 |
# box plots of the age of the visitor grouped by the current occupation
plt.figure(figsize = (10, 5))
sns.boxplot(df["current_occupation"], df["age"])
plt.show()
first_interactions
# visualize the first_interactions of the visitors based on conversion
plt.figure(figsize = (15, 10))
sns.countplot(x = 'first_interaction', hue = 'status', data = df)
plt.show()
time_spent_on_website
# visualize the time_spent_on_website of the visitors based on conversion
plt.figure(figsize = (10, 5))
sns.boxplot(df["status"], df["time_spent_on_website"])
plt.show()
profile_completed
#visualize what the completion of the profile means for creating a paying customer
plt.figure(figsize = (15, 10))
sns.countplot(x = 'profile_completed', hue = 'status', data = df)
plt.show()
referral
#visualize what a referral means for creating a paying customer
plt.figure(figsize = (15, 10))
sns.countplot(x = 'referral', hue = 'status', data = df)
plt.show()
age
current_occupation
first_interactions
time_spent_on_website
-the blue represents those that did not convert to paying customers.profile_completed
referral
# Correlation matrix (no grouping)
plt.figure(figsize=(15,10))
sns.heatmap(df.corr().round(2),annot=True)
plt.title('Correlation matrix of data',fontsize = 30)
plt.show()
time_spent_on_website
and status
website_visits
and status
To determine which variable will lead to a visitor conversion to a paying customer
# Separating the target variable and other variables
# make a copy called X which is a dataframe with "status" column removed
X = df.drop(columns = 'status')
# Y is a series containing the "status" (column)
Y = df['status']
# Creating dummy variables, drop_first=True is used to avoid redundant variables
#pd.get_dummies => working on X dataframe converts all categorical variables into binary 1(yes) / 0(no).
X = pd.get_dummies(X, drop_first = True)
# Splitting the data into train (70%) and test (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 1)
print("Shape of the training set: ", X_train.shape)
print("Shape of the test set: ", X_test.shape)
print("Percentage of classes in the training set:")
print(y_train.value_counts(normalize = True))
print("Percentage of classes in the test set:")
print(y_test.value_counts(normalize = True))
Shape of the training set: (3228, 16)
Shape of the test set: (1384, 16)
Percentage of classes in the training set:
0 0.704151
1 0.295849
Name: status, dtype: float64
Percentage of classes in the test set:
0 0.695087
1 0.304913
Name: status, dtype: float64
Before training the model, let’s choose the appropriate model evaluation criterion as per the problem at hand.
Model can make wrong predictions as:
Which case is more important?
False Negatives: Predicting a visitor will not convert to a paying customer but in reality, the visitor would convert to a paying customer:
If we predict that a visitor will not convert to a paying customer and would have converted to a paying customer, then the company has lost on potential revenue for the company. (False Negative)
If we predict that a visitor will convert to a paying customer and they do not, the company has lost the resource of time and effort, where they could have been working with other potential customers. (False Positive)
The number of False Negatives should be minimized.
How to reduce the losses?
Recall
which will reduce the number of false negatives thereby not missing the visitors that will become paying customersAlso, let’s create a function to calculate and print the classification report and confusion matrix so that we don’t have to rewrite the same code repeatedly for each model.
# Function to print the classification report and get confusion matrix in a proper format
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize = (8, 5))
sns.heatmap(cm, annot = True, fmt = '.2f', xticklabels = ['Not Converted', 'Converted'], yticklabels = ['Not Converted', 'Converted'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
# Fitting the decision tree classifier on the training data
d_tree = DecisionTreeClassifier(random_state = 7)
d_tree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=7)
# Checking performance on the training data
y_pred_train1 = d_tree.predict(X_train)
metrics_score(y_train, y_pred_train1)
precision recall f1-score support
0 1.00 1.00 1.00 2273
1 1.00 1.00 1.00 955
accuracy 1.00 3228
macro avg 1.00 1.00 1.00 3228
weighted avg 1.00 1.00 1.00 3228
Reading confusion matrix (clockwise):
There is no error on the training set, i.e., each sample has been classified.
The model this perfectly on the training data, it is likely overfitted.
# Checking performance on the testing data
y_pred_test1 = d_tree.predict(X_test)
metrics_score(y_test, y_pred_test1)
precision recall f1-score support
0 0.87 0.86 0.87 962
1 0.69 0.70 0.70 422
accuracy 0.81 1384
macro avg 0.78 0.78 0.78 1384
weighted avg 0.81 0.81 0.81 1384
The model did not fit as well on the test data. Therefore the model is overfitting the training data.
To reduce overfitting the model let’s try hyperparameter tuning using GridSearchCV to find the optimal max_depth. We can tune some other hyperparameters as well.
We will use the class_weight hyperparameter with the value equal to {0: 0.3, 1: 0.7} which is approximately the opposite of the imbalance in the original data.
This would tell the model that 1 is an important class here.
# Choose the type of classifier
d_tree_tuned = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.3, 1: 0.7})
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 10), #depth [2, 3, 4, 5, 6, 7, 8, 9]
'criterion': ['gini', 'entropy'], #use both gini and entropy to measure split quality
'min_samples_leaf': [5, 10, 20, 25] #minimum number of samples to be a leaf node
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(d_tree_tuned, parameters, scoring = scorer, cv = 5) #=> chooses the best hyperparameters to use
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
d_tree_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data
d_tree_tuned.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7}, criterion='entropy',
max_depth=3, min_samples_leaf=5, random_state=7)
We have tuned the model and fit the tuned model on the training data. Now, let’s check the model performance on the training and testing data.
# Checking performance on the training data
y_pred_train2 = d_tree_tuned.predict(X_train)
metrics_score(y_train, y_pred_train2)
precision recall f1-score support
0 0.94 0.77 0.85 2273
1 0.62 0.88 0.73 955
accuracy 0.80 3228
macro avg 0.78 0.83 0.79 3228
weighted avg 0.84 0.80 0.81 3228
Let’s check the model performance on the testing data
# Checking performance on the testing data
y_pred_test2 = d_tree_tuned.predict(X_test)
metrics_score(y_test, y_pred_test2)
precision recall f1-score support
0 0.93 0.77 0.84 962
1 0.62 0.86 0.72 422
accuracy 0.80 1384
macro avg 0.77 0.82 0.78 1384
weighted avg 0.83 0.80 0.80 1384
Let’s visualize the tuned decision tree and observe the decision rules:
#Visualize the tree
features = list(X.columns)
plt.figure(figsize = (20, 20))
tree.plot_tree(d_tree_tuned, feature_names = features, filled = True, fontsize = 9, node_ids = True, class_names = True)
plt.show()
Blue represents those who converted to paying customers class = y[1]
Orange represents those who did not convert to paying customers class = y[0]
first_interaction
which implies it is one of the most important features (as observed in EDA).Visitors who first interacted with the website versus the mobile app had a much higher conversion rate.
time_spent_on_website
(highlighted in correlation heatmap).Visitors who spend more time on the website have a higher chance of converting to paying customers.
age
represented twice, under each prior branch. Therefore it seems to be of importance. It was noticed that visitors >= 25 had a higher chance of converting to paying customers.Let’s look at the feature importance of the tuned decision tree model
# Importance of features in the tree building
print (pd.DataFrame(d_tree_tuned.feature_importances_, columns = ['Importance'], index = X_train.columns).sort_values(by = 'Importance', ascending = False))
Importance
time_spent_on_website 0.348142
first_interaction_Website 0.327181
profile_completed_Medium 0.239274
age 0.063893
last_activity_Website Activity 0.021511
website_visits 0.000000
page_views_per_visit 0.000000
current_occupation_Student 0.000000
current_occupation_Unemployed 0.000000
profile_completed_Low 0.000000
last_activity_Phone Activity 0.000000
print_media_type1_Yes 0.000000
print_media_type2_Yes 0.000000
digital_media_Yes 0.000000
educational_channels_Yes 0.000000
referral_Yes 0.000000
# Plotting the feature importance
importances = d_tree_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize = (10, 10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color = 'blue', align = 'center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
time_spent_on_website
first_interaction
(Website)
profile_completed
(Medium)
age
last_activity
(Website Activity)
# Fitting the random forest tree classifier on the training data
rf_estimator = RandomForestClassifier(random_state=7,criterion="entropy")
rf_estimator.fit(X_train,y_train)
RandomForestClassifier(criterion='entropy', random_state=7)
Let’s check the performance of the model on the training data
# Checking performance on the training data
y_pred_train3 = rf_estimator.predict(X_train)
metrics_score(y_train, y_pred_train3)
precision recall f1-score support
0 1.00 1.00 1.00 2273
1 1.00 1.00 1.00 955
accuracy 1.00 3228
macro avg 1.00 1.00 1.00 3228
weighted avg 1.00 1.00 1.00 3228
Let’s check the performance on the testing data
# Checking performance on the testing data
y_pred_test3 = rf_estimator.predict(X_test)
metrics_score(y_test, y_pred_test3)
precision recall f1-score support
0 0.88 0.93 0.90 962
1 0.81 0.70 0.75 422
accuracy 0.86 1384
macro avg 0.84 0.81 0.83 1384
weighted avg 0.86 0.86 0.86 1384
Let’s see if we can get a better model by tuning the random forest classifier
Let’s try tuning some of the important hyperparameters of the Random Forest Classifier.
We will not tune the criterion hyperparameter as we know from hyperparameter tuning for decision trees that entropy is a better splitting criterion for this data.
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 7)
# Grid of parameters to choose from
parameters = {"n_estimators": [100, 110, 120],
"max_depth": [5, 6, 7],
"max_features": [0.8, 0.9, 1]
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
rf_estimator_tuned_base = grid_obj.best_estimator_
# Fitting the best algorithm to the training data
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(criterion='entropy', random_state=7)
# Checking performance on the training data
y_pred_train4 = rf_estimator_tuned_base.predict(X_train)
metrics_score(y_train, y_pred_train4)
precision recall f1-score support
0 0.91 0.92 0.91 2273
1 0.80 0.78 0.79 955
accuracy 0.88 3228
macro avg 0.86 0.85 0.85 3228
weighted avg 0.88 0.88 0.88 3228
# Checking performance on the training data
y_pred_test4 = rf_estimator_tuned_base.predict(X_test)
metrics_score(y_test, y_pred_test4)
precision recall f1-score support
0 0.88 0.92 0.90 962
1 0.79 0.73 0.76 422
accuracy 0.86 1384
macro avg 0.84 0.82 0.83 1384
weighted avg 0.86 0.86 0.86 1384
While less than before the model is slightly overfitted
The precision continues to be higher than the recall, something we will need to improve upon to surpass our tuned model.
The overall accuracy is slightly higher than that of our tuned model suggesting that using a random forest is the correct option.
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 7)
# Grid of parameters to choose from
parameters = {"n_estimators": [110, 120],
"max_depth": [6, 7],
"min_samples_leaf": [20, 25],
"max_features": [0.8, 0.9],
"max_samples": [0.9, 1],
"class_weight": ["balanced",{0: 0.3, 1: 0.7}]
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search on the training data using scorer=scorer and cv=5
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Save the best estimator to variable rf_estimator_tuned
rf_estimator_tuned = grid_obj.best_estimator_
#Fit the best estimator to the training data
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(class_weight='balanced', criterion='entropy',
max_depth=6, max_features=0.8, max_samples=0.9,
min_samples_leaf=25, n_estimators=120, random_state=7)
Let’s check the performance of the tuned model
# Checking performance on the training data
y_pred_train5 = rf_estimator_tuned.predict(X_train)
metrics_score(y_train, y_pred_train5)
precision recall f1-score support
0 0.94 0.83 0.88 2273
1 0.68 0.87 0.76 955
accuracy 0.84 3228
macro avg 0.81 0.85 0.82 3228
weighted avg 0.86 0.84 0.84 3228
Let’s check the model performance on the test data
# Checking performance on the test data
y_pred_test5 = rf_estimator_tuned.predict(X_test)
metrics_score(y_test, y_pred_test5)
precision recall f1-score support
0 0.93 0.83 0.87 962
1 0.68 0.85 0.76 422
accuracy 0.83 1384
macro avg 0.81 0.84 0.82 1384
weighted avg 0.85 0.83 0.84 1384
importances = rf_estimator_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize = (12, 12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color = 'blue', align = 'center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Goal
The goal was to maximize the Recall value. The higher the Recall score, the greater chances of minimizing the False Negatives.
False Negatives: Predicting a visitor will not convert to a paying customer but in reality, the visitor would convert to a paying customer, therefore the business losing revenue.
Process
Two models performed very well:
Tuned Model - recall 86%, f1-score 72%, and macro average 82%
Hyper Tuned Random Forest Model - recall 85%, f1-score 76%, and macro average 84%
Model Recommendation
The Hyper Tuned Random Forest Model is recommended for future use as it had a 4% overall improvement. It is also giving the highest Recall score of 85% and the macro average of 84% on the test data.
4 Most Important Features identified by both models
time_spent_on_website
first_interaction
(Website)
profile_completed
(Medium)
age
Business Recommendations
ExtraaLearn’s Representatives should prioritize visitors based on:
time_spent_on_website
.
first_interaction
on the website.
profile_completed
age