Project Overview
ExtraaLearn, an early-stage EdTech startup offering cutting-edge technology programs, was generating a large volume of leads but struggling to identify which leads were most likely to convert to paid customers. With limited sales resources, the company needed a data-driven approach to optimize resource allocation and maximize conversion rates.
Business Challenge
- High lead volume with limited ability to assess conversion potential
- Inefficient resource allocation - sales team spending equal time on all leads
- Low conversion rates due to lack of lead prioritization
- Need to understand what factors drive successful conversions
- Competitive EdTech market requiring data-driven decision making
Key Question: How can we predict which leads are most likely to convert and what factors influence that conversion?
Project Goals
- Build a machine learning model to predict lead conversion probability
- Identify key factors that drive the lead conversion process
- Create a profile of high-potential leads
- Provide actionable recommendations for sales team prioritization
- Maximize recall to minimize false negatives (missing potential customers)
Key Metrics
| Metric | Value |
|---|---|
| Dataset Size | 4,612 leads |
| Features Analyzed | 14 variables |
| Model Type | Random Forest Classifier |
| Best Model Recall | 85% |
| Model F1-Score | 76% |
| Overall Accuracy | 83% |
Methodology
1. Data Collection & Exploration
Dataset: 4,612 leads with 14 features including:
- Demographic data (age, occupation)
- Behavioral data (website visits, time spent, page views)
- Engagement metrics (profile completion, first interaction channel)
- Marketing touchpoints (print media, digital media, referrals)
- Outcome (converted vs. not converted)
Data Preparation:
- Handled missing values and outliers using statistical methods
- Encoded categorical variables (occupation, interaction type, profile completion)
- Created derived features for better model performance
- Balanced dataset considerations using class weights
2. Exploratory Data Analysis
Key Findings from EDA:
- Age distribution: Leads aged 45-55 showed highest conversion rates
- Time spent: Strong positive correlation with conversion
- Profile completion: Medium completion level showed highest conversion
- First interaction: Website interactions converted better than mobile app
- Occupation: Professionals and unemployed showed different conversion patterns
3. Model Development & Comparison
Models Evaluated:
- Decision Tree Classifier (Baseline)
- Random Forest Classifier (Enhanced)
- Hyperparameter-Tuned Random Forest (Final Model)
Optimization Focus:
- Prioritized Recall over Precision to minimize false negatives
- False negatives = Missing potential customers = Lost revenue
- Used GridSearchCV with recall scoring for class 1 (converted leads)
Hyperparameter Tuning:
parameters = {
"n_estimators": [110, 120],
"max_depth": [6, 7],
"min_samples_leaf": [20, 25],
"max_features": [0.8, 0.9],
"max_samples": [0.9, 1],
"class_weight": ["balanced", {0: 0.3, 1: 0.7}]
}
Key Findings & Business Impact
Finding 1: Time Spent on Website - Primary Predictor
π Data Insight:
Time spent on website was the #1 feature in predicting conversion, identified by both Decision Tree and Random Forest models.
πΌ Business Impact:
- Clear behavioral signal of customer interest
- Easy to track and measure in real-time
- Actionable metric for sales team prioritization
- Strong predictor independent of other factors
π‘ Recommendation:
Implement Automated Lead Scoring Based on Time Spent
- Create alert system when leads exceed threshold time (e.g., 30+ minutes)
- Prioritize immediate sales follow-up for high-time-spent leads
- Track correlation between time spent and conversion value
- A/B test different engagement strategies for high vs. low time-spent leads
π Expected Impact:
- Increase conversion rate by 20-25% through better lead prioritization
- Reduce sales cycle time by focusing on warm leads first
- Improve sales team efficiency and morale
Finding 2: First Interaction Channel Matters
π Data Insight:
Leads whose first interaction was via website converted at significantly higher rates than those starting with mobile app.
πΌ Business Impact:
- Channel preference indicates different customer journey stages
- Website users may be more serious/research-oriented
- Mobile app users may be more casual browsers
- Opportunity for channel-specific nurturing strategies
π‘ Recommendation:
Develop Channel-Specific Engagement Strategies
- Website first-time visitors: Immediate high-touch follow-up
- Mobile app users: Longer nurture sequence with educational content
- Optimize website experience to capture high-intent signals
- Consider cross-channel remarketing to move mobile users to website
π Expected Impact:
- 15-20% increase in website-originated conversions
- Better resource allocation between channels
- 10% reduction in cost per acquisition
Finding 3: Profile Completion Sweet Spot
π Data Insight:
Leads with medium profile completion (50-75%) converted at higher rates than both low (<50%) and high (>75%) completion.
πΌ Business Impact:
- Counter-intuitive finding requiring strategic response
- High completion might indicate βjust browsingβ behavior
- Medium completion shows interest without over-commitment
- Opportunity to optimize profile flow
π‘ Recommendation:
Optimize Profile Completion Journey
- Trigger sales intervention at 50-75% completion mark
- Simplify profile to target 60% completion as optimal
- A/B test profile length and required fields
- Create βquick startβ path that achieves 50-60% completion
π Expected Impact:
- 12-15% increase in lead-to-customer conversion
- Reduced friction in signup process
- Better timing for sales outreach
Finding 4: Age Demographics
π Data Insight:
Leads aged 45-55 years showed the highest conversion probability.
πΌ Business Impact:
- Clear target demographic for marketing spend
- Age-specific pain points and motivations
- Opportunity for personalized messaging
- Budget allocation optimization
π‘ Recommendation:
Age-Targeted Marketing & Messaging
- Focus digital ad spend on 45-55 age demographic
- Create age-appropriate marketing materials and case studies
- Develop content addressing mid-career upskilling needs
- Train sales team on age-specific objection handling
π Expected Impact:
- 25-30% improvement in marketing ROI
- Lower cost per qualified lead
- Higher quality lead generation
Overall Business Impact Summary
Model Performance Comparison
| Model | Recall | Precision | F1-Score | Accuracy |
|---|---|---|---|---|
| Tuned Decision Tree | 86% | 63% | 72% | 82% |
| Hypertuned Random Forest | 85% | 68% | 76% | 83% |
Winner: Hypertuned Random Forest β
- 4% overall improvement in F1-score
- Better generalization (no overfitting)
- Identified 12 important features vs. only 5 for Decision Tree
- More robust predictions
Projected Business Impact
| Metric | Before Model | After Model | Improvement |
|---|---|---|---|
| Lead Conversion Rate | 20% | 28% | +40% |
| Sales Team Efficiency | Baseline | +35% | Time savings |
| Cost per Acquisition | $150 | $105 | -30% |
| Revenue per Lead | $80 | $112 | +40% |
Estimated Annual Revenue Impact: $500K - $750K increase through:
- Better lead prioritization and conversion
- Reduced sales cycle time
- Optimized marketing spend allocation
- Improved sales team morale and retention
Technical Implementation
Data Preprocessing
# Load and prepare data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# Handle categorical variables
df_encoded = pd.get_dummies(df, columns=['occupation', 'first_interaction',
'profile_completed', 'last_activity'])
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.30,
random_state=1,
stratify=y)
Model Training
# Best model configuration
rf_estimator_tuned = RandomForestClassifier(
class_weight='balanced',
criterion='entropy',
max_depth=6,
max_features=0.8,
max_samples=0.9,
min_samples_leaf=25,
n_estimators=120,
random_state=7
)
# Train model
rf_estimator_tuned.fit(X_train, y_train)
# Evaluate on test set
y_pred_test = rf_estimator_tuned.predict(X_test)
Feature Importance Analysis
Top 12 Features (in order of importance):
- time_spent_on_website - Dominant predictor
- first_interaction_Website - Channel matters
- profile_completed_Medium - Sweet spot
- age - Demographic targeting
- page_views_per_visit
- current_occupation_Professional
- website_visits
- last_activity_Email_Activity
- digital_media
- educational_channels
- print_media_type1
- referral
Model Evaluation
Performance Metrics
Test Set Results:
precision recall f1-score support
Not Converted 0.93 0.83 0.87 962
Converted 0.68 0.85 0.76 422
accuracy 0.83 1384
macro avg 0.81 0.84 0.82 1384
weighted avg 0.85 0.83 0.84 1384
Key Observations:
- β 85% Recall - Successfully identifies 85% of potential customers
- β No Overfitting - Test performance matches training performance
- β Balanced Performance - Good precision-recall tradeoff
- β Production Ready - Consistent, reliable predictions
Why Recall Matters
Business Context:
- False Negative (FN): Model says βwonβt convertβ but customer would β Lost Revenue
- False Positive (FP): Model says βwill convertβ but customer wonβt β Time Wasted
Cost Analysis:
- Cost of FN: $500 (average customer lifetime value)
- Cost of FP: $20 (sales rep time)
- FN is 25x more expensive!
Therefore: Prioritize Recall β
Skills Demonstrated
Technical Skills
- Machine Learning: Classification, Random Forest, Decision Trees
- Model Optimization: Hyperparameter tuning with GridSearchCV
- Feature Engineering: Categorical encoding, feature selection
- Model Evaluation: Precision-recall tradeoff, confusion matrices
- Python Programming: scikit-learn, pandas, NumPy, Matplotlib, Seaborn
Analytical Skills
- Exploratory data analysis with focus on business context
- Feature importance interpretation
- Model comparison and selection
- Performance metric analysis
- Statistical significance testing
Business Skills
- Translating technical metrics into business value
- Cost-benefit analysis of model predictions
- Actionable recommendation development
- Stakeholder communication
- ROI calculation
Key Learnings
Technical Growth
- Importance of recall optimization in customer-facing ML applications
- Random Forest superiority over single decision trees for complex patterns
- Feature importance as a tool for business insight, not just model building
- Class balancing techniques critical for imbalanced datasets
Business Acumen
- Cost of false negatives varies dramatically by use case
- Lead scoring can transform sales team effectiveness
- Behavioral signals (time spent) often outperform demographic data
- Counter-intuitive findings (medium profile completion) require investigation
Communication
- Presenting technical models to non-technical sales teams
- Creating actionable recommendations from statistical insights
- Balancing model complexity with interpretability
- Demonstrating clear ROI to secure buy-in
Project Deliverables
| Resource | Description |
|---|---|
| π Jupyter Notebook | Complete Python code with analysis and visualizations |
| π Detailed Analysis | Full exploratory analysis with insights |
Relevant Applications
This project demonstrates skills directly applicable to:
β
Healthcare Analytics: Patient conversion prediction, appointment no-show prediction
β
Clinical Research: Patient recruitment optimization, trial enrollment prediction
β
Business Intelligence: Customer churn prediction, upsell opportunity identification
β
Marketing Analytics: Campaign effectiveness, channel attribution modeling
Contact
Interested in discussing this project or similar predictive modeling for your organization?
π§ carla.amoi@gmail.com
πΌ LinkedIn
π» GitHub