ExtraaLearn Project

Context

The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023 with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc, it is now preferable to traditional education.

In the present scenario due to the Covid-19, the online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like

The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.

Objective

ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:

Data Description

The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Load the data

Data Overview

There are 4612 unique IDs. Each row is a unique ID therefore this column doesn't add value and can be dropped

Observations

Exploratory Data Analysis (EDA)

Questions

  1. Leads will have different expectations from the outcome of the course and the current occupation may play a key role in getting them to participate in the program. Find out how current occupation affects lead status.
  2. The company's first impression on the customer must have an impact. Do the first channels of interaction have an impact on the lead status?
  3. The company uses multiple modes to interact with prospects. Which way of interaction works best?
  4. The company gets leads from various channels such as print media, digital media, referrals, etc. Which of these channels have the highest lead conversion rate?
  5. People browsing the website or mobile application are generally required to create a profile by sharing their personal data before they can access additional information.Does having more details about a prospect increase the chances of conversion?

Univariate Data Analysis

Summary of all data columns

Identify the percentage and count of each group within the categorical variables

Percentage

Count of each group within the categorical variables

Observations

Categorical Data

Status

How many of the visitors are converted to customers

Observation Summary

The majority of the visitors are professionals (56.7%). The website (55%) is the primary first interaction with ExtraaLearn. Only 2% of the visitors have a low completion of the profile. Email (49%) is the highest last activity of the visitor. Very few visitors have had interactions with ExtraaLearn through advertisements or referrals seen the newspaper ads (10%), magazine ads (5%), digital media (11%), educational channels (15%) or have been referred (2%).

Only 1377 (approx 30%) from a total of 4612 visitors are converted to customers.

Data Preprocessing

Distributions and Outliers

Create count plots to identify the distribution of the data

Create box plots to determine if there are any outliers in the data.

Observations

age There is a negative skew (-0.72) with most visitors approx 55 years of age.
website_visits There is positive skew (2.16) with highest frequency visiting from 0 to 5 times decreasing from there. The box plot shows outliers. time_spent_on_website There is a positive skew (0.95) with the highest frequency visitors spending between 0 and 250 on the site.
page_views_per_visit There is a positive skew (1.27) with the highest frequency of page views between 2.5 and 5. The box plot shows outliers.

Identifying outliers

Bivariate Data Analysis

Pairplot Summary

Pairplot visualizes given data to find the relationship between them where the variables can be continuous or categorical. The color represents if the visitor was converted to a customer or not. Orange shows those that are now customers.

age

current_occupation

Analyze the data based on the current_occupation.

first_interactions

time_spent_on_website

profile_completed

referral

Observations

Correlation Heat Map

Correlations Summary

Model Preparation

To determine which variable will lead to a visitor conversion to a paying customer

Encoding the data

Shape

Building Classification Models

Before training the model, let's choose the appropriate model evaluation criterion as per the problem at hand.

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a visitor will not convert to a paying customer but in reality, the visitor would convert to a paying customer.
  2. Predicting a visitor will convert to a paying customer but in reality, the visitor does not convert to a paying customer.

Which case is more important?

The number of False Negatives should be minimized.

How to reduce the losses?

Also, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.

Building a Decision Tree model

Model Performance evaluation and improvement

Observations:

Reading confusion matrix (clockwise):

There is no error on the training set, i.e., each sample has been classified.

The model this perfectly on the training data, it is likely overfitted.

Observations:

Decision Tree - Hyperparameter Tuning

We will use the class_weight hyperparameter with the value equal to {0: 0.3, 1: 0.7} which is approximately the opposite of the imbalance in the original data.

This would tell the model that 1 is the important class here.

We have tuned the model and fit the tuned model on the training data. Now, let's check the model performance on the training and testing data.

Observations

Let's check the model performance on the testing data

Observations

Let's visualize the tuned decision tree and observe the decision rules:

Blue represents those who converted to paying customers class = y[1]

Orange represents those who did not convert to paying customers class = y[0]

Observations

Let's look at the feature importance of the tuned decision tree model

Observations

time_spent_on_website

first_interaction (Website)

profile_completed (Medium)

age

last_activity (Website Activity)

Building a Random Forest model

Random Forest Classifier

Let's check the performance of the model on the training data

Observations:

Let's check the performance on the testing data

Observations:

Let's see if we can get a better model by tuning the random forest classifier

Random Forest Classifier - Hyperparameter Tuning Model

Observations:

Observations

Random Forest Classifier - Hyperparameter Tuning Model 2

Let's check the performance of the tuned model

Let's check the model performance on the test data

Observations:

Feature Importance - Random Forest

Observations:

Actionable Insights and Recommendations

Goal

The goal was to maximize the Recall value. The higher the Recall score, the greater chances of minimizing the False Negatives.
False Negatives: Predicting a visitor will not convert to a paying customer but in reality, the visitor would convert to a paying customer, therefore the business losing revenue.

Process

Two models performed very well:

  1. Tuned Model - recall 86%, f1-score 72% and macro average 82%

  2. Hyper Tuned Random Forest Model - recall 85%, f1-score 76% and macro average 84%

Conclusions:

Model Recommendation

The Hyper Tuned Random Forest Model is recommended for future use as it had a 4% overall improvement. It is also giving the highest Recall score of 85% and the macro average of 84% on the test data.

4 Most Important Features identified by both models

time_spent_on_website

first_interaction (Website)

profile_completed (Medium)

age

Business Recommendations

ExtraaLearn's Representatives should prioritize visitors based on:

  1. The visitors time_spent_on_website.
  1. The visitors first_interaction on the website.
  1. The visitors profile_completed
  1. The visitors age