Machine Learning System Design — Part 5: Tackling Class Imbalance and Enhancing Data with Augmentation

Rare classes and limited data are two of the biggest blockers to real-world ML performance. This part explores how to rebalance skewed datasets and expand your training data using smart augmentation techniques that help models learn better, faster, and more fairly.

Introduction

In the earlier parts of this series, we emphasized the critical role of high-quality data in building successful machine learning systems. Once you’ve collected, cleaned, and structured your data, the next challenge is often not about feeding it into a sophisticated model. It is about making sure your data can actually teach the model what it needs to learn. Two of the biggest barriers here are class imbalance and limited data volume. Both are extremely common in real-world applications and, if left unaddressed, can severely degrade model performance.

In Part 5, we learn two powerful strategies that directly tackle these issues:

Handling class imbalance — to ensure your model pays attention to the rare but important cases.
Data augmentation — to artificially expand your dataset and expose your model to more diverse scenarios.

These strategies are especially crucial in domains like fraud detection, medical diagnostics, and anomaly detection, where the cost of ignoring minority classes or failing to generalize can be extremely high. The goal of this part is to provide practical techniques, trade-offs, and best practices for turning skewed, limited datasets into robust training material that enables models to perform well in production.

Tackling Class Imbalance

Class imbalance is one of the most common challenges we probably face in real-world machine learning. It occurs when some classes in your dataset appear much more frequently than others, and it can completely undermine the performance of your model if not handled properly.

1.1 The Challenge of Imbalanced Data

In most real-world problems, the classes you care about predicting are actually quite rare. This creates a fundamental challenge for machine learning algorithms, which typically assume that all classes are roughly equally represented.

How Class Imbalance Affects Model Training

When your dataset is heavily imbalanced, several problems emerge:

1. Insufficient Learning Signal:
If your vehicle detection dataset is dominated by cars and motorcycles, but contains very few examples of trucks, buses, or tricycles, your model gets very little exposure to these rarer vehicle types. With such limited positive examples, the model struggles to learn what makes a tricycle different from, say, a motorcycle or a small car — leading to poor performance on the underrepresented classes.

2. Trivial Solutions:

A model that always predicts “car” or “motorcycle” could still achieve very high accuracy on an imbalanced dataset, simply because those classes dominate. However, such a model would completely fail to detect trucks or tricycles. Standard accuracy metrics become misleading in these cases because high accuracy won’t mean good performance on the rare classes that might be critical in certain applications (e.g., road safety, logistics, or traffic analysis).

3. Asymmetric Costs:
In many real-world use cases, different mistakes have different consequences. Misclassifying a tricycle as a motorcycle (false negative) might lead to safety risks if, for example, the tricycle has different road behavior. On the other hand, flagging a car as a truck (false positive) might just lead to a minor processing delay in a logistics system. Despite this, standard loss functions typically treat all errors as equal, which doesn’t align with the real-world costs of those mistakes.

Common Examples of Class Imbalance

Class imbalance shows up everywhere in real-world ML:

Fraud detection : 99.9% of credit card transactions are legitimate
Medical diagnosis : Most patients don’t have rare diseases
Quality control : Most manufactured products pass inspection
Email spam filtering : The ratio of spam to legitimate email varies, but one class usually dominates
Click-through prediction : Most ads don’t get clicked
Anomaly detection : By definition, anomalies are rare (obviously)
Customer churn : Most customers don’t cancel their subscriptions in any given month

The severity of imbalance varies, but even moderate imbalance (like 80/20 splits) can cause problems for many algorithms.

1.2 Handling Techniques: Balancing the Scales

Fortunately, there are several proven techniques for handling class imbalance. The best approach depends on your data size, the severity of imbalance, and your specific business requirements.

Resampling: Changing the Data Distribution

Resampling techniques modify your training data to create more balanced class distributions:

Undersampling removes examples from the majority class to match the minority class size. This is simple and fast, but you risk losing valuable information. If you have millions of legitimate transactions and only thousands of fraudulent ones, throwing away 99% of your legitimate transaction data seems wasteful.

Oversampling duplicates or creates additional examples of the minority class. Simple duplication (repeating the same minority examples multiple times) is easy to implement but can lead to overfitting — your model might memorize specific minority examples rather than learning general patterns.

SMOTE (Synthetic Minority Oversampling Technique) generates synthetic minority class examples by interpolating between existing minority examples. Instead of just copying existing fraud examples, SMOTE creates new synthetic fraud examples that are similar to but not identical to the originals. This helps avoid overfitting while providing more training signal.

Here’s how SMOTE works conceptually:

For each minority class example, find its k nearest neighbors (also from the minority class)
Create a new synthetic example by interpolating between the original example and one of its neighbors
Repeat until you have the desired number of minority class examples

Reweighting: Adjusting the Loss Function

Instead of changing your data, you can change how your algorithm learns from it by giving minority class examples higher importance during training.

Class weights multiply the loss for each class by a weight factor. If fraud examples are 100 times rarer than legitimate examples, you might weight fraud examples 100 times higher in the loss function. Most ML libraries support class weights directly.

Focal loss dynamically adjusts weights based on prediction confidence. Examples that the model is already confident about get lower weight, while difficult examples (including minority class examples the model struggles with) get higher weight. This helps the model focus on learning the hardest cases.

Ensemble Methods: Multiple Models, Better Balance

Ensemble approaches train multiple models on different balanced subsets of your data, then combine their predictions:

Balanced bagging trains each model in your ensemble on a balanced subset of the data. You might train 10 different models, each on a balanced sample containing all your fraud examples plus an equal number of randomly sampled legitimate examples.

Boosting algorithms like AdaBoost naturally handle imbalance by focusing on misclassified examples in subsequent iterations. Since minority class examples are more likely to be misclassified, they get increasing attention as the ensemble grows.

Ensemble methods are complex and can’t be fully explained in just a few sentences. Since a detailed explanation is beyond the scope for now, I recommend a valuable open-source resource: Introduction to Statistics with Python (also available in an R version). It’s an excellent starting point for learning traditional machine learning algorithms, including ensemble techniques.

Synthetic Data Generation: Creating New Training Examples

Modern techniques can generate entirely new training examples for the minority class:

Generative AI models can create realistic synthetic examples to help balance the dataset. For image data, these models can generate new images that closely resemble those from the underrepresented class. For tabular data, specialized generative models can produce synthetic records that preserve the statistical properties of the original dataset.