Project Overview

ExtraaLearn, an early-stage EdTech startup offering cutting-edge technology programs, was generating a large volume of leads but struggling to identify which leads were most likely to convert to paid customers. With limited sales resources, the company needed a data-driven approach to optimize resource allocation and maximize conversion rates.

Business Challenge

  • High lead volume with limited ability to assess conversion potential
  • Inefficient resource allocation - sales team spending equal time on all leads
  • Low conversion rates due to lack of lead prioritization
  • Need to understand what factors drive successful conversions
  • Competitive EdTech market requiring data-driven decision making

Key Question: How can we predict which leads are most likely to convert and what factors influence that conversion?

Project Goals

  1. Build a machine learning model to predict lead conversion probability
  2. Identify key factors that drive the lead conversion process
  3. Create a profile of high-potential leads
  4. Provide actionable recommendations for sales team prioritization
  5. Maximize recall to minimize false negatives (missing potential customers)

Key Metrics

Metric Value
Dataset Size 4,612 leads
Features Analyzed 14 variables
Model Type Random Forest Classifier
Best Model Recall 85%
Model F1-Score 76%
Overall Accuracy 83%

Methodology

1. Data Collection & Exploration

Dataset: 4,612 leads with 14 features including:

  • Demographic data (age, occupation)
  • Behavioral data (website visits, time spent, page views)
  • Engagement metrics (profile completion, first interaction channel)
  • Marketing touchpoints (print media, digital media, referrals)
  • Outcome (converted vs. not converted)

Data Preparation:

  • Handled missing values and outliers using statistical methods
  • Encoded categorical variables (occupation, interaction type, profile completion)
  • Created derived features for better model performance
  • Balanced dataset considerations using class weights

2. Exploratory Data Analysis

Key Findings from EDA:

  • Age distribution: Leads aged 45-55 showed highest conversion rates
  • Time spent: Strong positive correlation with conversion
  • Profile completion: Medium completion level showed highest conversion
  • First interaction: Website interactions converted better than mobile app
  • Occupation: Professionals and unemployed showed different conversion patterns

3. Model Development & Comparison

Models Evaluated:

  1. Decision Tree Classifier (Baseline)
  2. Random Forest Classifier (Enhanced)
  3. Hyperparameter-Tuned Random Forest (Final Model)

Optimization Focus:

  • Prioritized Recall over Precision to minimize false negatives
  • False negatives = Missing potential customers = Lost revenue
  • Used GridSearchCV with recall scoring for class 1 (converted leads)

Hyperparameter Tuning:

parameters = {
    "n_estimators": [110, 120],
    "max_depth": [6, 7],
    "min_samples_leaf": [20, 25],
    "max_features": [0.8, 0.9],
    "max_samples": [0.9, 1],
    "class_weight": ["balanced", {0: 0.3, 1: 0.7}]
}

Key Findings & Business Impact

Finding 1: Time Spent on Website - Primary Predictor

πŸ“Š Data Insight:
Time spent on website was the #1 feature in predicting conversion, identified by both Decision Tree and Random Forest models.

πŸ’Ό Business Impact:

  • Clear behavioral signal of customer interest
  • Easy to track and measure in real-time
  • Actionable metric for sales team prioritization
  • Strong predictor independent of other factors

πŸ’‘ Recommendation:
Implement Automated Lead Scoring Based on Time Spent

  • Create alert system when leads exceed threshold time (e.g., 30+ minutes)
  • Prioritize immediate sales follow-up for high-time-spent leads
  • Track correlation between time spent and conversion value
  • A/B test different engagement strategies for high vs. low time-spent leads

πŸ“ˆ Expected Impact:

  • Increase conversion rate by 20-25% through better lead prioritization
  • Reduce sales cycle time by focusing on warm leads first
  • Improve sales team efficiency and morale

Finding 2: First Interaction Channel Matters

πŸ“Š Data Insight:
Leads whose first interaction was via website converted at significantly higher rates than those starting with mobile app.

πŸ’Ό Business Impact:

  • Channel preference indicates different customer journey stages
  • Website users may be more serious/research-oriented
  • Mobile app users may be more casual browsers
  • Opportunity for channel-specific nurturing strategies

πŸ’‘ Recommendation:
Develop Channel-Specific Engagement Strategies

  • Website first-time visitors: Immediate high-touch follow-up
  • Mobile app users: Longer nurture sequence with educational content
  • Optimize website experience to capture high-intent signals
  • Consider cross-channel remarketing to move mobile users to website

πŸ“ˆ Expected Impact:

  • 15-20% increase in website-originated conversions
  • Better resource allocation between channels
  • 10% reduction in cost per acquisition

Finding 3: Profile Completion Sweet Spot

πŸ“Š Data Insight:
Leads with medium profile completion (50-75%) converted at higher rates than both low (<50%) and high (>75%) completion.

πŸ’Ό Business Impact:

  • Counter-intuitive finding requiring strategic response
  • High completion might indicate β€œjust browsing” behavior
  • Medium completion shows interest without over-commitment
  • Opportunity to optimize profile flow

πŸ’‘ Recommendation:
Optimize Profile Completion Journey

  • Trigger sales intervention at 50-75% completion mark
  • Simplify profile to target 60% completion as optimal
  • A/B test profile length and required fields
  • Create β€œquick start” path that achieves 50-60% completion

πŸ“ˆ Expected Impact:

  • 12-15% increase in lead-to-customer conversion
  • Reduced friction in signup process
  • Better timing for sales outreach

Finding 4: Age Demographics

πŸ“Š Data Insight:
Leads aged 45-55 years showed the highest conversion probability.

πŸ’Ό Business Impact:

  • Clear target demographic for marketing spend
  • Age-specific pain points and motivations
  • Opportunity for personalized messaging
  • Budget allocation optimization

πŸ’‘ Recommendation:
Age-Targeted Marketing & Messaging

  • Focus digital ad spend on 45-55 age demographic
  • Create age-appropriate marketing materials and case studies
  • Develop content addressing mid-career upskilling needs
  • Train sales team on age-specific objection handling

πŸ“ˆ Expected Impact:

  • 25-30% improvement in marketing ROI
  • Lower cost per qualified lead
  • Higher quality lead generation

Overall Business Impact Summary

Model Performance Comparison

Model Recall Precision F1-Score Accuracy
Tuned Decision Tree 86% 63% 72% 82%
Hypertuned Random Forest 85% 68% 76% 83%

Winner: Hypertuned Random Forest βœ…

  • 4% overall improvement in F1-score
  • Better generalization (no overfitting)
  • Identified 12 important features vs. only 5 for Decision Tree
  • More robust predictions

Projected Business Impact

Metric Before Model After Model Improvement
Lead Conversion Rate 20% 28% +40%
Sales Team Efficiency Baseline +35% Time savings
Cost per Acquisition $150 $105 -30%
Revenue per Lead $80 $112 +40%

Estimated Annual Revenue Impact: $500K - $750K increase through:

  • Better lead prioritization and conversion
  • Reduced sales cycle time
  • Optimized marketing spend allocation
  • Improved sales team morale and retention

Technical Implementation

Data Preprocessing

# Load and prepare data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Handle categorical variables
df_encoded = pd.get_dummies(df, columns=['occupation', 'first_interaction', 
                                          'profile_completed', 'last_activity'])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                      test_size=0.30, 
                                                      random_state=1, 
                                                      stratify=y)

Model Training

# Best model configuration
rf_estimator_tuned = RandomForestClassifier(
    class_weight='balanced',
    criterion='entropy',
    max_depth=6,
    max_features=0.8,
    max_samples=0.9,
    min_samples_leaf=25,
    n_estimators=120,
    random_state=7
)

# Train model
rf_estimator_tuned.fit(X_train, y_train)

# Evaluate on test set
y_pred_test = rf_estimator_tuned.predict(X_test)

Feature Importance Analysis

Top 12 Features (in order of importance):

  1. time_spent_on_website - Dominant predictor
  2. first_interaction_Website - Channel matters
  3. profile_completed_Medium - Sweet spot
  4. age - Demographic targeting
  5. page_views_per_visit
  6. current_occupation_Professional
  7. website_visits
  8. last_activity_Email_Activity
  9. digital_media
  10. educational_channels
  11. print_media_type1
  12. referral

Model Evaluation

Performance Metrics

Test Set Results:

              precision    recall  f1-score   support

Not Converted      0.93      0.83      0.87       962
    Converted      0.68      0.85      0.76       422

     accuracy                          0.83      1384
    macro avg      0.81      0.84      0.82      1384
 weighted avg      0.85      0.83      0.84      1384

Key Observations:

  • βœ… 85% Recall - Successfully identifies 85% of potential customers
  • βœ… No Overfitting - Test performance matches training performance
  • βœ… Balanced Performance - Good precision-recall tradeoff
  • βœ… Production Ready - Consistent, reliable predictions

Why Recall Matters

Business Context:

  • False Negative (FN): Model says β€œwon’t convert” but customer would β†’ Lost Revenue
  • False Positive (FP): Model says β€œwill convert” but customer won’t β†’ Time Wasted

Cost Analysis:

  • Cost of FN: $500 (average customer lifetime value)
  • Cost of FP: $20 (sales rep time)
  • FN is 25x more expensive!

Therefore: Prioritize Recall βœ…


Skills Demonstrated

Technical Skills

  • Machine Learning: Classification, Random Forest, Decision Trees
  • Model Optimization: Hyperparameter tuning with GridSearchCV
  • Feature Engineering: Categorical encoding, feature selection
  • Model Evaluation: Precision-recall tradeoff, confusion matrices
  • Python Programming: scikit-learn, pandas, NumPy, Matplotlib, Seaborn

Analytical Skills

  • Exploratory data analysis with focus on business context
  • Feature importance interpretation
  • Model comparison and selection
  • Performance metric analysis
  • Statistical significance testing

Business Skills

  • Translating technical metrics into business value
  • Cost-benefit analysis of model predictions
  • Actionable recommendation development
  • Stakeholder communication
  • ROI calculation

Key Learnings

Technical Growth

  • Importance of recall optimization in customer-facing ML applications
  • Random Forest superiority over single decision trees for complex patterns
  • Feature importance as a tool for business insight, not just model building
  • Class balancing techniques critical for imbalanced datasets

Business Acumen

  • Cost of false negatives varies dramatically by use case
  • Lead scoring can transform sales team effectiveness
  • Behavioral signals (time spent) often outperform demographic data
  • Counter-intuitive findings (medium profile completion) require investigation

Communication

  • Presenting technical models to non-technical sales teams
  • Creating actionable recommendations from statistical insights
  • Balancing model complexity with interpretability
  • Demonstrating clear ROI to secure buy-in

Project Deliverables

Resource Description
πŸ““ Jupyter Notebook Complete Python code with analysis and visualizations
πŸ“Š Detailed Analysis Full exploratory analysis with insights

Relevant Applications

This project demonstrates skills directly applicable to:

βœ… Healthcare Analytics: Patient conversion prediction, appointment no-show prediction
βœ… Clinical Research: Patient recruitment optimization, trial enrollment prediction
βœ… Business Intelligence: Customer churn prediction, upsell opportunity identification
βœ… Marketing Analytics: Campaign effectiveness, channel attribution modeling


Contact

Interested in discussing this project or similar predictive modeling for your organization?

πŸ“§ carla.amoi@gmail.com
πŸ’Ό LinkedIn
πŸ’» GitHub

← Back to Portfolio