All Articles

Practical Guide to Handling Imbalanced Datasets

TL;DR Learn how to handle imbalanced data using TensorFlow 2, Keras and scikit-learn

Datasets in the wild will throw a variety of problems towards you. What are the most common ones?

The data might have too few examples, too large to fit into the RAM, multiple missing values, do not contain enough predictive power to make correct predictions, and it can imbalanced.

In this guide, we’ll try out different approaches to solving the imbalance issue for classification tasks. That isn’t the only issue on our hands. Our dataset is real, and we’ll have to deal with multiple problems - imputing missing data and handling categorical features.

Before getting any deeper, you might want to consider far simpler solutions to the imbalanced dataset problem:

  • Collect more data - This might seem like a no brainer, but it is often overlooked. Can you write some more queries and extract data from your database? Do you need a few more hours for more customer data? More data can balance your dataset or might make it even more imbalanced. Either way, you want a more complete picture of the data.
  • Use Tree based models - Tree-based models tend to perform better on imbalanced datasets. Essentially, they build hierarchies based on split/decision points, which might better separate the classes.

Here’s what you’ll learn:

  • Impute missing data
  • Handle categorical features
  • Use the right metrics for classification tasks
  • Set per class weights in Keras when training a model
  • Use resampling techniques to balance the dataset

Run the complete code in your browser

Data

Naturally, our data should be imbalanced. Kaggle has the perfect one for us - Porto Seguro’s Safe Driver Prediction. The object is to predict whether a driver will file an insurance claim. How many drivers do that?

Setup

Let’s start with installing TensorFlow and setting up the environment:

!pip install tensorflow-gpu
!pip install gdown
import numpy as np
import tensorflow as tf
from tensorflow import keras
import pandas as pd

RANDOM_SEED = 42

np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

We’ll use gdown to get the data from Google Drive:

!gdown --id 18gwvNkMs6t0jL0APl9iWPrhr5GVg082S --output insurance_claim_prediction.csv

Exploration

Let’s load the data in Pandas and have a look:

df = pd.read_csv('insurance_claim_prediction.csv')
print(df.shape)j
(595212, 59)

Loads of data. What features does it have?

print(df.columns)
Index(['id', 'target', 'ps_ind_01', 'ps_ind_02_cat', 'ps_ind_03',
       'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_ind_06_bin', 'ps_ind_07_bin',
       'ps_ind_08_bin', 'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin',
       'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_ind_14', 'ps_ind_15',
       'ps_ind_16_bin', 'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_reg_01',
       'ps_reg_02', 'ps_reg_03', 'ps_car_01_cat', 'ps_car_02_cat',
       'ps_car_03_cat', 'ps_car_04_cat', 'ps_car_05_cat', 'ps_car_06_cat',
       'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat', 'ps_car_10_cat',
       'ps_car_11_cat', 'ps_car_11', 'ps_car_12', 'ps_car_13', 'ps_car_14',
       'ps_car_15', 'ps_calc_01', 'ps_calc_02', 'ps_calc_03', 'ps_calc_04',
       'ps_calc_05', 'ps_calc_06', 'ps_calc_07', 'ps_calc_08', 'ps_calc_09',
       'ps_calc_10', 'ps_calc_11', 'ps_calc_12', 'ps_calc_13', 'ps_calc_14',
       'ps_calc_15_bin', 'ps_calc_16_bin', 'ps_calc_17_bin', 'ps_calc_18_bin',
       'ps_calc_19_bin', 'ps_calc_20_bin'],
      dtype='object')

Those seem somewhat cryptic, here is the data description:

features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.

What is the proportion of each target class?

no_claim, claim = df.target.value_counts()
print(f'No claim {no_claim}')
print(f'Claim {claim}')
print(f'Claim proportion {round(percentage(claim, claim + no_claim), 2)}%')
No claim 573518
Claim 21694
Claim proportion 3.64%

Good, we have an imbalanced dataset on our hands. Let’s look at a graphical representation of the imbalance:

You got the visual proof right there. But how good of a model can you build using this dataset?

Baseline model

You might’ve noticed something in the data description. Missing data points have a value of -1. What should we do before training our model?

Data preprocessing

Let’s check how many rows/columns contain missing data:

row_count = df.shape[0]

for c in df.columns:
  m_count = df[df[c] == -1][c].count()
  if m_count > 0:
    print(f'{c} - {m_count} ({round(percentage(m_count, row_count), 3)}%) rows missing')
ps_ind_02_cat - 216 (0.036%) rows missing
ps_ind_04_cat - 83 (0.014%) rows missing
ps_ind_05_cat - 5809 (0.976%) rows missing
ps_reg_03 - 107772 (18.106%) rows missing
ps_car_01_cat - 107 (0.018%) rows missing
ps_car_02_cat - 5 (0.001%) rows missing
ps_car_03_cat - 411231 (69.09%) rows missing
ps_car_05_cat - 266551 (44.783%) rows missing
ps_car_07_cat - 11489 (1.93%) rows missing
ps_car_09_cat - 569 (0.096%) rows missing
ps_car_11 - 5 (0.001%) rows missing
ps_car_12 - 1 (0.0%) rows missing
ps_car_14 - 42620 (7.16%) rows missing

Missing data imputation

ps_car_03_cat, ps_car_05_cat and ps_reg_03 have too many missing rows for our own comfort. We’ll get rid of them. Note that this is not the best strategy but will do in our case.

df.drop(
  ["ps_car_03_cat", "ps_car_05_cat", "ps_reg_03"],
  inplace=True,
  axis=1
)

What about the other features? We’ll use the SimpleImputer from scikit-learn to replace the missing values:

from sklearn.impute import SimpleImputer

cat_columns = [
  'ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat',
  'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_07_cat',
  'ps_car_09_cat'
]
num_columns = ['ps_car_11', 'ps_car_12', 'ps_car_14']

mean_imp = SimpleImputer(missing_values=-1, strategy='mean')
cat_imp = SimpleImputer(missing_values=-1, strategy='most_frequent')

for c in cat_columns:
  df[c] = cat_imp.fit_transform(df[[c]]).ravel()

for c in num_columns:
  df[c] = mean_imp.fit_transform(df[[c]]).ravel()

We use the most frequent value for categorical features. Numerical features are replaced with the mean number of the column.

Categorical features

Pandas get_dummies() uses one-hot encoding to represent categorical features. Perfect! Let’s use it:

df = pd.get_dummies(df, columns=cat_columns)

Now that we don’t have more missing values (you can double-check that) and categorical features are encoded, we can try to predict insurance claims. What accuracy can we get?

Building the model

We’ll start by splitting the data into train and test datasets:

from sklearn.model_selection import train_test_split

labels = df.columns[2:]

X = df[labels]
y = df['target']

X_train, X_test, y_train, y_test = \
  train_test_split(X, y, test_size=0.05, random_state=RANDOM_SEED)

Our binary classification model is a Neural Network with batch normalization and dropout layers:

def build_model(train_data, metrics=["accuracy"]):
  model = keras.Sequential([
    keras.layers.Dense(
      units=36,
      activation='relu',
      input_shape=(train_data.shape[-1],)
    ),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.25),
    keras.layers.Dense(units=1, activation='sigmoid'),
  ])

  model.compile(
    optimizer=keras.optimizers.Adam(lr=0.001),
    loss=keras.losses.BinaryCrossentropy(),
    metrics=metrics
  )

  return model

You should be familiar with the training procedure:

BATCH_SIZE = 2048

model = build_model(X_train)
history = model.fit(
    X_train,
    y_train,
    batch_size=BATCH_SIZE,
    epochs=20,
    validation_split=0.05,
    shuffle=True,
    verbose=0
)

In general, you should strive for a small batch size (e.g. 32). Our case is a bit specific - we have highly imbalanced data, so we’ll give a fair chance to each batch to contain some insurance claim data points.

The validation accuracy seems quite good. Let’s evaluate the performance of our model:

model.evaluate(X_test, y_test, batch_size=BATCH_SIZE)
119043/119043 - loss: 0.1575 - accuracy: 0.9632

That’s pretty good. It seems like our model is pretty awesome. Or is it?

def awesome_model_predict(features):
  return np.full((features.shape[0], ), 0)

y_pred = awesome_model_predict(X_test)

This amazing model predicts that there will be no claim, no matter the features. What accuracy does it get?

from sklearn.metrics import accuracy_score

accuracy_score(y_pred, y_test)
0.9632

Sweet! Wait. What? This is as good as our complex model. Is there something wrong with our approach?

Evaluating the model

Not really. We’re just using the wrong metric to evaluate our model. This is a well-known problem. The Accuracy paradox suggests accuracy might not be the correct metric when the dataset is imbalanced. What can you do?

Using the correct metrics

One way to understand the performance of our model is to use a confusion matrix. It shows us how well our model predicts for each class:

When the model is predicting everything perfectly, all values are on the main diagonal. That’s not the case. So sad! Our complex model seems as dumb as dumb as our awesome model.

Good, now we know that our model is very bad at predicting insurance claims. Can we somehow tune it to do better?

Useful metrics

We can use a wide range of other metrics to measure our peformance better:

  • Precision - predicted positives divided by all positive predictions
true positivestrue positives+false positives\frac{\text{true positives}}{\text{true positives} + \text{false positives}}

Low precision indicates a high number of false positives.

  • Recall - percentage of actual positives that were correctly classified
true positivestrue positives+false negatives\frac{\text{true positives}}{\text{true positives} + \text{false negatives}}

Low recall indicates a high number of false negatives.

  • F1 score - combines precision and recall in one metric:
2×precision×recallprecision+recall\frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}}
  • ROC curve - A curve of True Positive Rate vs. False Positive Rate at different classification thresholds. It starts at (0,0) and ends at (1,1). A good model produces a curve that goes quickly from 0 to 1.

  • AUC (Area under the ROC curve) - Summarizes the ROC curve with a single number. The best value is 1.0, while 0.5 is the worst.

Different combinations of precision and recall give you a better understanding of how well your model is performing for a given class:

  • high precision + high recall : your model can be trusted when predicting this class
  • high precision + low recall : you can trust the predictions for this class, but your model is not good at detecting it
  • low precision + high recall: your model can detect the class but messes it up with other classes
  • low precision + low recall : you can’t trust the predictions for this class

Measuring your model

Luckily, Keras can calculate most of those metrics for you:

METRICS = [
      keras.metrics.TruePositives(name='tp'),
      keras.metrics.FalsePositives(name='fp'),
      keras.metrics.TrueNegatives(name='tn'),
      keras.metrics.FalseNegatives(name='fn'),
      keras.metrics.BinaryAccuracy(name='accuracy'),
      keras.metrics.Precision(name='precision'),
      keras.metrics.Recall(name='recall'),
      keras.metrics.AUC(name='auc'),
]

And here are the results:

loss :  0.1557253243213323
tp :  0.0
fp :  1.0
tn :  57302.0
fn :  2219.0
accuracy :  0.9627029
precision :  0.0
recall :  0.0
auc :  0.62021655
f1 score: 0.0

Here is the ROC:

Our model is complete garbage. And we can measure how much garbage it is. Can we do better?

Weighted model

We have many more examples of no insurance claims compared to those claimed. Let’s force our model to pay attention to the underrepresented class. We can do that by passing weights for each class. First we need to calcualte those:

no_claim_count, claim_count = np.bincount(df.target)
total_count = len(df.target)

weight_no_claim = (1 / no_claim_count) * (total_count) / 2.0
weight_claim = (1 / claim_count) * (total_count) / 2.0

class_weights = {0: weight_no_claim, 1: weight_claim}

Now, let’s use the weights when training our model:

model = build_model(X_train, metrics=METRICS)

history = model.fit(
    X_train,
    y_train,
    batch_size=BATCH_SIZE,
    epochs=20,
    validation_split=0.05,
    shuffle=True,
    verbose=0,
    class_weight=class_weights
)

Evaluation

Let’s begin with the confusion matrix:

Things are a lot different now. We have a lot of correctly predicted insurance claims. The bad news is that we have a lot of predicted claims that were no claims. What can our metrics tell us?

loss :  0.6694403463347913
tp :  642.0
fp :  11170.0
tn :  17470.0
fn :  479.0
accuracy :  0.6085817
precision :  0.05435151
recall :  0.57270294
auc :  0.63104653
f1 score: 0.09928090930178612

The recall has jumped significantly while the precision bumped up only slightly. The F1-score is pretty low too! Overall, our model has improved somewhat. Especially, considering the minimal effort on our part. How can we do better?

Resampling techniques

These methods try to “correct” the balance in your data. They act as follows:

  • oversampling - replicate examples from the under-represented class (claims)
  • undersampling - sample from the most represented class (no claims) to keep only a few examples
  • generate synthetic data - create new synthetic examples from the under-represented class

Naturally, a classifier trained on the “rebalanced” data will not know the original proportions. It is expected to have (much) lower accuracy since true proportions play a role in making a prediction.

You must think long and hard (that’s what she said) before using resampling methods. It can be a perfectly good approach or complete nonsense.

Let’s start by separating the classes:

X = pd.concat([X_train, y_train], axis=1)

no_claim = X[X.target == 0]
claim = X[X.target == 1]

Oversample minority class

We’ll start by adding more copies from the “insurance claim” class. This can be a good option when the data is limited. Either way, you might want to evaluate all approaches using your metrics.

We’ll use the resample() utility from scikit-learn:

from sklearn.utils import resample

claim_upsampled = resample(claim,
                          replace=True,
                          n_samples=len(no_claim),
                          random_state=RANDOM_SEED)

Here is the new distribution of no claim vs claim:

Our new model performs like this:

loss :  0.6123614118771424
tp :  530.0
fp :  8754.0
tn :  19886.0
fn :  591.0
accuracy :  0.68599844
precision :  0.057087462
recall :  0.47279215
auc :  0.6274258
f1 score: 0.10187409899086977

The performance of our model is similar to the weighted one. Can undersampling do better?

Undersample majority class

We’ll remove samples from the no claim class and balance the data this way. This can be a good option when your dataset is large. Removing data can lead to underfitting on the test set.

no_claim_downsampled = resample(no_claim,
                                replace = False,
                                n_samples = len(claim),
                                random_state = RANDOM_SEED)

loss :  0.6377013992475753
tp :  544.0
fp :  8969.0
tn :  19671.0
fn :  577.0
accuracy :  0.67924464
precision :  0.057184905
recall :  0.485281
auc :  0.6206339
f1 score: 0.1023133345871732

Again, we don’t have such impressive results but doing better than the baseline model.

Generating synthetic samples

Let’s try to simulate the data generation process by creating synthetic samples. We’ll use the imbalanced-learn library to do that.

One over-sampling method to generate synthetic data is the Synthetic Minority Oversampling Technique (SMOTE). It uses KNN algorithm to generate new data samples.

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=RANDOM_SEED, ratio=1.0)
X_train, y_train = sm.fit_sample(X_train, y_train)

loss :  0.26040001417683606
tp :  84.0
fp :  1028.0
tn :  27612.0
fn :  1037.0
accuracy :  0.9306139
precision :  0.07553957
recall :  0.0749331
auc :  0.5611229
f1 score: 0.07523510971786834

We have high accuracy but very low precision and recall. Not a useful approach for our dataset.

Conclusion

There are a lot of ways to handle imbalanced datasets. You should always start with something simple (like collecting more data or using a Tree-based model) and evaluate your model with the appropriate metrics. If all else fails, come back to this guide and try the more advanced approaches.

You learned how to:

  • Impute missing data
  • Handle categorical features
  • Use the right metrics for classification tasks
  • Set per class weights in Keras when training a model
  • Use resampling techniques to balance the dataset

Run the complete code in your browser

Remember that the best approach is almost always specific to the problem at hand (context is king). And sometimes, you can restate the problem as outlier/anomaly detection ;)

References