How I built a lead scoring system that tells sales teams exactly who to call first

TL;DR

I built a lead scoring system that ingests three families of signals for every lead in the pipeline: behavioral data (website visits, pricing page views, demo requests), email engagement (clicks, opens, unsubscribes), and firmographic data (job title, company size, decision-maker status). It combines these signals into a conversion probability and assigns every lead to one of four action tiers: Hot, Warm, Nurture, and Suppress. The system is built with Python, FastAPI, and Streamlit, and runs in production at a fraction of what tools like HubSpot or Pipedrive charge — and delivers capabilities they don’t offer out of the box.

The Problem

Your sales team starts the week with 200 leads. Some came from the pricing page. Some downloaded a whitepaper six months ago. Some attended a webinar. A few are decision-makers at companies that fit your ICP perfectly. Most are not.

Without a system, the default is gut feel and whoever shouted loudest last. Reps spend Monday chasing leads that haven’t opened an email in weeks, while the prospect who visited the pricing page three times this month waits until Thursday — by which point they’ve already booked a demo with a competitor.

The cost isn’t just missed deals. It’s the invisible tax on every salesperson’s week: time spent on low-intent prospects who were never going to convert, while high-intent leads go cold.

The question is simple: what if you could know, before making a single call, which leads are worth your team’s time?

What the System Does

The lead scoring system ingests three families of signals for every lead in the pipeline: behavioral data (website visits, pricing page views, demo requests), email engagement (clicks, opens, unsubscribes), and firmographic data (job title, company size, decision-maker status). It combines these signals into a conversion probability and assigns every lead to one of four action tiers:

Hot — call within 2 hours
Warm — follow up within 24 hours
Nurture — weekly touch, not urgent
Suppress — remove from active pipeline

Critically, every scored lead comes with a plain-English explanation: the top three reasons it scored high, and any risk factors dragging it down. Sales reps don’t just get a number — they get context. A rep who understands why a lead is hot is far more likely to trust the system and act on it than one handed an opaque score with no rationale.

The Results

The numbers make the business case clearly:

Hot + Warm leads represent just 38% of the pipeline — but capture 64% of all conversions.

Breaking it down by tier:

Tier	Share of Pipeline	Conversion Rate	Conversions Captured
Hot	12%	58.5%	27%
Warm	26%	36.2%	37%
Nurture	39%	18.5%	28%
Suppress	23%	8.2%	7%

The Hot tier converts at more than double the overall baseline rate. The top 20% of scored leads capture over 50% of all conversions.

A sales team following this system could focus on fewer than half their leads and still reach nearly two-thirds of the deals. That is not a marginal efficiency improvement — it is a structural shift in how the team operates.

The Suppress tier is equally valuable. At an 8.2% conversion rate, spending active sales time on those leads is provably wasteful. The system doesn’t just tell you who to call — it tells you who to stop calling.

How It Works

Data & Feature Engineering

The model is trained on 33 signals across three families. Behavioral signals — pricing page visits, demo requests, trial engagement — carry the heaviest weight because they are the hardest to fake and the most predictive of intent. Email engagement is included, with clicks weighted more heavily than opens: clicks require deliberate action; opens can be accidental or auto-triggered by mail clients.

Firmographic signals (decision-maker flag, company size, job title) provide context. One of the more interesting findings from the feature engineering stage: a decision-maker who visited the pricing page earns a bonus multiplier — the interaction effect between those two signals is more predictive than either signal in isolation.

Feature Importance Correlation matrix of 33 features

Rule-Based Baseline

Before training any ML model, I built a hand-crafted rule system from the same 33 signals — assigning point weights to each signal based on its correlation with conversion, with asymmetric penalties for negative signals like unsubscribes (–20 points) and careers page visits. The goal was to establish a credible baseline: something a domain expert might actually build and ship.

The result was AUC 0.739 — well above random (0.50) and already good enough to be useful. Hot + Warm leads under the rule system captured 64% of conversions, and the Hot tier alone converted at 58.5%. This mattered for two reasons: it validated that the underlying signal structure in the data was strong, and it set a meaningful bar for the ML layer to beat.

Rule-based baseline performance

Modeling

With the baseline established, two models were trained and compared head-to-head: Logistic Regression and Random Forest, evaluated using 5-fold cross-validation to produce honest out-of-sample metrics.

Logistic Regression won (AUC 0.789 vs. 0.777). The reason is dataset size: at 800 leads, a well-regularised linear model generalises better than a tree ensemble that needs more data to reliably exploit non-linear interactions. More model complexity is not always better — it has to be justified by the data. The production model is the simpler one, and it meaningfully outperforms the hand-crafted baseline.

Model Comparison Model evaluation dashboard. Baseline vs ML models

Explainability

SHAP (SHapley Additive exPlanations) values are computed for every lead at scoring time. Each output includes the top positive signals driving the score and any negative signals suppressing it — rendered as plain English, not coefficient values.

This matters for adoption. A system that produces unexplained scores creates friction: reps second-guess it, managers override it, and the business value erodes. Explainability is not a nice-to-have — it is a prerequisite for a scoring system that actually gets used.

Tier analysis. Distribution of scores by tier

SHAP importance. Top 15 features and the impact on the score

Tech Stack

Python / scikit-learn / SHAP — modeling, calibration, and per-lead explainability
FastAPI — production API with single-lead scoring and CSV batch endpoints
Streamlit — interactive dashboard for exploring scored leads and simulating new ones
Docker + GCP Cloud Run — containerised, deployed, and auto-scaling
Cloud Scheduler — automated weekly scoring runs, no manual intervention required

The system runs in production at a fraction of what tools like HubSpot or Pipedrive charge — and delivers capabilities they don’t offer out of the box.

Want This for Your Pipeline?

If your team is spending time on leads without knowing which ones are worth it, this is a solvable problem.

See the code — GitHub repository
Let’s talk — Contact

TL;DR#

The Problem#

What the System Does#

The Results#

How It Works#

Tech Stack#

Want This for Your Pipeline?#