intermediateData Science

Building Your First Machine Learning Model

A step-by-step guide to building, training, and evaluating a machine learning model using scikit-learn with a real-world dataset.

60 min read10 sections
1

Machine Learning, Minus the Buzzwords

Strip away the hype and machine learning is surprisingly straightforward: you give an algorithm a bunch of examples, and it figures out the pattern. That's it. No magic, no sentience — just pattern recognition at scale.

The three flavours you need to know:

  • Supervised learning — "Here are 10,000 emails labelled spam or not-spam. Learn to tell the difference." You provide both the input and the correct answer. This is 90% of real-world ML.
  • Unsupervised learning — "Here are 50,000 customer records. Find natural groupings." No labels — the algorithm discovers structure on its own.
  • Reinforcement learning — "Play this game a million times and figure out how to win." Trial and error with a reward signal. Cool, but niche.

This guide covers supervised learning — specifically, classification (predicting categories) and regression (predicting numbers). If you can build a solid supervised model, you can tackle the vast majority of business problems that land on a data scientist's desk.

2

The Workflow Every ML Project Follows

I've worked on ML projects across fintech, healthcare, and e-commerce, and the workflow is remarkably consistent. The algorithms change, the data changes, but the steps don't:

  1. Frame the problem — What exactly are you predicting? What does "success" look like? How will this model actually be used?
  2. Get the data — And accept that it will be messier than you expected
  3. Explore (EDA) — Distributions, correlations, missing values, outliers
  4. Clean and engineer features — This is where you'll spend 60–80% of your time
  5. Split into train/test sets — Non-negotiable. No peeking at the test set.
  6. Pick a model and train it — Start simple
  7. Evaluate — With the right metrics for your specific problem
  8. Iterate — Better features, different algorithms, hyperparameter tuning

Beginners tend to rush to step 6 — the "cool" part. Veterans know that steps 3 and 4 are where models are actually won or lost. A Random Forest on great features will crush a neural network on garbage features every time.

3

Exploratory Data Analysis: Know Your Data Before You Model It

Skipping EDA is the fastest way to build a model that looks great in your notebook and fails spectacularly in production. I've seen it happen.

Here's my standard EDA checklist — I run through this on every single dataset:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('housing_data.csv')

# Dimensions and types
print(df.shape)
print(df.dtypes)

# Missing data — how much and where?
print(df.isnull().sum().sort_values(ascending=False))

# Statistical summary
print(df.describe())

# Target variable distribution — is it balanced?
df['price_bracket'].value_counts().plot(kind='bar')
plt.title('Target Distribution')
plt.show()

# Correlations — which features relate to the target?
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='RdBu_r', center=0, fmt='.2f')
plt.title('Feature Correlations')
plt.tight_layout()
plt.show()

The questions I'm trying to answer: Are there features with 40%+ missing values (probably drop them)? Is my target variable heavily imbalanced (need special handling)? Are any features almost perfectly correlated with each other (redundant — pick one)? Are there obvious outliers that could skew the model?

EDA isn't glamorous, but it's where you build intuition about your data. That intuition is what separates a good model from a great one.

4

Data Preprocessing: The Unglamorous 80% of ML

Raw data almost never comes model-ready. Column names have spaces. Dates are strings. Categorical variables need encoding. Missing values need handling. Welcome to the real job.

Missing values — there's no single right answer:

  • If less than 5% of a column is missing, fill numerical columns with the median (not the mean — medians are robust to outliers)
  • For categorical columns, fill with the mode or a literal "Unknown"
  • If more than 40% is missing, seriously consider dropping the column. It's adding noise, not signal.

Encoding categorical variables:

from sklearn.preprocessing import LabelEncoder

# For ordinal categories (small < medium < large)
size_map = {'small': 0, 'medium': 1, 'large': 2}
df['size_encoded'] = df['size'].map(size_map)

# For nominal categories (no natural order) — one-hot encoding
df = pd.get_dummies(df, columns=['city', 'department'], drop_first=True)

Feature scaling matters for distance-based algorithms (KNN, SVM, logistic regression) but not for tree-based ones (Random Forest, XGBoost):

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['age', 'income', 'years_experience']] = scaler.fit_transform(
    df[['age', 'income', 'years_experience']]
)

Feature engineering is where domain knowledge pays off. Creating a "years_since_last_purchase" feature from a date column, or a "price_per_sqft" feature from price and area — these kinds of transformations often improve model performance more than switching algorithms.

5

Train-Test Split: The Rule You Cannot Break

Here's the most important rule in machine learning: never evaluate your model on data it was trained on. That measures memorisation, not understanding.

from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

A few things people get wrong:

  • Data leakage — If you fit your scaler on the entire dataset before splitting, the test set's statistics bleed into the training process. Fit on X_train only, then transform both sets.
  • Not stratifying — If your target has 90% class A and 10% class B, a random split could give you a test set with zero class B examples. stratify=y prevents that.
  • Using random_state — Always set it. Otherwise every run produces a different split, making your results impossible to reproduce.

For small datasets (under 5,000 rows), a single 80/20 split is risky because the test set is tiny. Use k-fold cross-validation instead — it splits the data into k parts, trains on k-1, tests on the remaining fold, and averages the results across all rotations.

6

Picking Your First Algorithm (Hint: Start Boring)

Beginners gravitate toward neural networks and complex ensembles. Resist that urge. Start with the simplest model that could work, measure it, and only add complexity if you need it.

My recommended starting points:

For classification:

  1. Logistic Regression — Fast, interpretable, surprisingly strong on many datasets. Start here.
  2. Random Forest — Handles messy data well, doesn't need much feature scaling, gives you feature importances for free
  3. XGBoost/LightGBM — When you need the best accuracy on tabular data. These win most Kaggle competitions for a reason.

For regression: swap Logistic Regression for Linear Regression, and the rest applies.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=200, max_depth=15, random_state=42)
model.fit(X_train, y_train)

predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)

That's five lines from import to predictions. Scikit-learn's consistent API — .fit(), .predict(), .score() — works identically across all its algorithms. Learn the pattern once and you can swap models in and out in seconds.

I'll repeat this because it's genuinely important: the baseline model matters. If logistic regression gives you 88% accuracy and a complex gradient boosting ensemble gives you 90%, you need to ask whether that 2% is worth the added complexity, training time, and maintenance burden. Often it isn't.

7

Evaluating Classification Models (Accuracy Is Not Enough)

Imagine a model that predicts "no fraud" for every single transaction. If 99% of transactions are legitimate, that model is 99% accurate. And completely useless.

This is why accuracy alone is dangerous. Use these metrics instead:

  • Precision — Out of everything the model flagged as positive, how many actually were? High precision = few false alarms.
  • Recall — Out of all actual positives, how many did the model catch? High recall = few missed cases.
  • F1 Score — The harmonic mean of precision and recall. Use this as your single metric when both matter equally.
  • AUC-ROC — How well the model distinguishes between classes across all possible thresholds. 1.0 = perfect, 0.5 = random guessing.
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print(f"AUC-ROC: {roc_auc_score(y_test, probabilities[:, 1]):.4f}")

Which metric you prioritise depends on the business context. Cancer screening? Maximise recall — you'd rather flag a healthy patient for further testing than miss an actual case. Email spam filter? Maximise precision — users will forgive the occasional spam email getting through, but they'll be furious if important emails end up in spam.

8

Evaluating Regression Models: How Wrong Are Your Predictions?

Regression metrics tell you how far off your predictions are from reality:

  • MAE (Mean Absolute Error) — The average gap between predicted and actual values. If MAE = $12,000 on a housing price model, your predictions are off by $12K on average. Easy to explain to stakeholders.
  • RMSE (Root Mean Squared Error) — Like MAE but penalises large errors more heavily. If occasional big misses are a problem (pricing a house at $500K when it's worth $200K), RMSE will flag that.
  • R² Score — How much of the variance your model explains. 1.0 means perfect predictions. 0 means your model is no better than guessing the average every time.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

mae = mean_absolute_error(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
r2 = r2_score(y_test, predictions)

print(f"MAE:  {mae:,.0f}")
print(f"RMSE: {rmse:,.0f}")
print(f"R²:   {r2:.4f}")

Always plot your residuals (prediction errors) against your predicted values. If the residuals show a pattern — like errors getting larger for higher predictions — your model is systematically wrong in a way that a single metric won't reveal.

9

Hyperparameter Tuning: Squeezing Out Better Performance

Once you've got a decent model, tuning its hyperparameters can push accuracy another 2–5%. Not transformative, but meaningful.

Hyperparameters are the knobs you set before training — number of trees, maximum tree depth, learning rate. The model can't learn these from data; you have to find good values through experimentation.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1,  # Use all CPU cores
    verbose=1
)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best F1: {grid_search.best_score_:.4f}")

Grid search is thorough but slow — the example above tests 108 combinations with 5-fold CV, so it trains 540 models. For faster iteration, use RandomizedSearchCV which samples random combinations instead of testing all of them. In practice, randomised search gets you 95% of the benefit in a fraction of the time.

A reality check: if tuning moves your F1 from 0.85 to 0.87, great. If it moves it from 0.85 to 0.855... you're probably better off spending that time on feature engineering instead.

10

Where to Go from Here

You've built a complete ML pipeline — from raw data to a tuned, evaluated model. That's a real accomplishment. Here's how to keep building on it:

Projects that will teach you the most:

  • Pick a Kaggle competition and work through it end-to-end. The discussion forums are an incredible learning resource — experienced practitioners share their entire approach.
  • Build a model that solves a problem you personally care about. Predicting your commute time, forecasting your electricity bill, classifying recipes by cuisine. Personal projects stick because you're genuinely curious about the answer.
  • Reproduce a published result. Find an interesting paper or blog post, download the data, and try to match their numbers.

Skills to develop next:

  • Feature engineering — The single highest-leverage skill in applied ML
  • Handling imbalanced data — SMOTE, class weights, threshold tuning
  • Ensemble methods — Stacking, blending, and how gradient boosting actually works
  • Model deployment — Flask/FastAPI for serving predictions via REST APIs
  • Deep learning — Check out our TensorFlow & Keras tutorial when you're ready

And remember: the best data scientists aren't the ones who know the most algorithms. They're the ones who ask the best questions about the data.

Ready to Take the Next Step?

Our tutorials are just the beginning. Explore our expert-led courses and certifications for hands-on, career-ready training.