All Articles

Hacker's Guide to Fixing Underfitting and Overfitting Models

TL;DR Learn how to handle underfitting and overfitting models using TensorFlow 2, Keras and scikit-learn. Understand how you can use the bias-variance tradeoff to make better predictions.

The problem of the goodness of fit can be illustrated using the following diagrams:

One way to describe the problem of underfitting is by using the concept of bias:

  • a model has a high bias if it makes a lot of mistakes on the training data. We also say that the model underfits.
  • a model has a low bias if predicts well on the training data

Naturally, we can use another concept to describe the problem of overfitting - variance:

  • a model has a high variance if it predicts very well on the training data but performs poorly on the test data. Basically, overfitting means that the model has memorized the training data and can’t generalize to things it hasn’t seen.
  • A model has a low variance if it generalizes well on the test data

Getting your model to low bias and low variance can be pretty elusive 🦄. Nonetheless, we’ll try to solve some of the common practical problems using a realistic dataset.

Here’s another way to look at the bias-variance tradeoff (heavily inspired by the original diagram of Andrew Ng):

You’ll learn how to diagnose and fix problems when:

  • Your data has no predictive power
  • Your model is too simple to make good predictions
  • Your data brings the Curse of dimensionality
  • Your model is too complex

Run the complete code in your browser

Data

We’ll use the Heart Disease dataset provided by UCI and hosted on Kaggle. Here is the description of the data:

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.

We have 13 features and 303 rows of data. We’re using those to predict whether or not a patient has heart disease.

Let’s start with downloading and loading the data into a Pandas dataframe:

!pip install tensorflow-gpu
!pip install gdown

!gdown --id 1rsxu0CKFfI-xR1pH-5JQHcfZ7MIa08Q6 --output heart.csv
df = pd.read_csv('heart.csv')

Exploration

We’ll have a look at how well balanced the patients with and without heart disease are:

That looks pretty good. Almost no dataset will be perfectly balanced anyways. Do we have missing data?

df.isnull().values.any()
false

Nope. Let’s have a look at the correlations between the features:

Features like cp (chest pain type), exang (exercise induced angina), and oldpeak (ST depression induced by exercise relative to rest) seem to have a decent correlation with our target variable.

Let’s have a look at the distributions of our features, starting with the most correlated to the target variable:

Seems like only oldpeak is a non-categorical feature. It appears that the data contains several features with outliers. You might want to explore those on your own, if interested :)

Underfitting

We’ll start by building a couple of models that underfit and proceed by fixing the issue in some way.

Recall that your model underfits when it makes mistakes on the training data. Here are the most common reasons for that:

  • The data features are not informative
  • Your model is too simple to predict the data (e.g. linear model predicts non-linear data)

Data with no predictive power

We’ll build a model with the trestbps (resting blood pressure) feature. Its correlation with the target variable is low: -0.14. Let’s prepare the data:

from sklearn.model_selection import train_test_split

X = df[['trestbps']]
y = df.target

X_train, X_test, y_train, y_test = \
 train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)

We’ll build a binary classifier with 2 hidden layers:

def build_classifier(train_data):
  model = keras.Sequential([
    keras.layers.Dense(
      units=32,
      activation='relu',
      input_shape=[train_data.shape[1]]
    ),
    keras.layers.Dense(units=16, activation='relu'),
    keras.layers.Dense(units=1),
  ])

  model.compile(
    loss="binary_crossentropy",
    optimizer="adam",
    metrics=['accuracy']
  )

  return model

And train it for 100 epochs:

BATCH_SIZE = 32

clf = build_classifier(X_train)

clf_history = clf.fit(
  x=X_train,
  y=y_train,
  shuffle=True,
  epochs=100,
  validation_split=0.2,
  batch_size=BATCH_SIZE,
  verbose=0
)

Here’s how the train and validation accuracy changes during training:

Our model is flatlining. This is expected, the feature we’re using has no predictive power.

The fix

Knowing that we’re using an uninformative feature makes it easy to fix the issue. We can use other feature(s):

X = pd.get_dummies(df[['oldpeak', 'cp']], columns=["cp"])
y = df.target

X_train, X_test, y_train, y_test = \
 train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)

And here are the results (using the same model, created from scratch):

Underpowered model

In this case, we’re going to build a regressive model and try to predict the patient maximum heart rate (thalach) from its age.

Before starting our analysis, we’ll use MinMaxScaler from scikit-learn to scale the feature values in the 0-1 range:

from sklearn.preprocessing import MinMaxScaler

s = MinMaxScaler()

X = s.fit_transform(df[['age']])
y = s.fit_transform(df[['thalach']])

X_train, X_test, y_train, y_test = \
 train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)

Our model is a simple linear regression:

lin_reg = keras.Sequential([
  keras.layers.Dense(
    units=1,
    activation='linear',
    input_shape=[X_train.shape[1]]
  ),
])

lin_reg.compile(
  loss="mse",
  optimizer="adam",
  metrics=['mse']
)

Here’s the train/validation loss:

Here are the predictions from our model:

You can kinda see that a linear model might not be the perfect fit here.

The fix

We’ll use the same training process, except that our model is going to be a lot more complex:

lin_reg = keras.Sequential([
  keras.layers.Dense(
    units=64,
    activation='relu',
    input_shape=[X_train.shape[1]]
  ),
  keras.layers.Dropout(rate=0.2),
  keras.layers.Dense(units=32, activation='relu'),
  keras.layers.Dropout(rate=0.2),
  keras.layers.Dense(units=16, activation='relu'),
  keras.layers.Dense(units=1, activation='linear'),
])

lin_reg.compile(
  loss="mse",
  optimizer="adam",
  metrics=['mse']
)

Here’s the training/validation loss:

Our validation loss is similar. What about the predictions:

Interesting, right? Our model broke from the linear-only predictions. Note that this fix included adding more parameters and increasing the regularization (using Dropout).

Overfitting

A model overfits when predicts training data well but performs poor on the validation set. Here are some of the reasons for that:

  • Your data has many features but a small number of examples (curse of dimensionality)
  • Your model is too complex for the data (Early stopping)

Curse of dimensionality

The Curse of dimensionality refers to the problem of having too many features (dimensions), compared to the data points (examples). The most common way to solve this problem is to add more information.

We’ll use a couple of features to create our dataset:

X = df[['oldpeak', 'age', 'exang', 'ca', 'thalach']]
X = pd.get_dummies(X, columns=['exang', 'ca', 'thalach'])
y = df.target

X_train, X_test, y_train, y_test = \
  train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)

Our model contains one hidden layer:

def build_classifier():

  model = keras.Sequential([
    keras.layers.Dense(
      units=16,
      activation='relu',
      input_shape=[X_train.shape[1]]
    ),
    keras.layers.Dense(units=1, activation='sigmoid'),
  ])

  model.compile(
    loss="binary_crossentropy",
    optimizer="adam",
    metrics=['accuracy']
  )

  return model

Here’s the interesting part. We’re using just a tiny bit of the data for training:

clf = build_classifier()

clf_history = clf.fit(
  x=X_train,
  y=y_train,
  shuffle=True,
  epochs=500,
  validation_split=0.95,
  batch_size=BATCH_SIZE,
  verbose=0
)

Here’s the result of the training:

The fix

Our solution will be pretty simple - add more data. However, you can provide additional information via other methods (i.e. Bayesian prior) or reduce the number of features via feature selection.

Let’s try the simple approach:

clf = build_classifier()

clf_history = clf.fit(
  x=X_train,
  y=y_train,
  shuffle=True,
  epochs=500,
  validation_split=0.2,
  batch_size=BATCH_SIZE,
  verbose=0
)

The training/validation loss looks like this:

While this is an improvement, you can see that the validation loss starts to decrease after some time. How can you fix this?

Too complex model

We’ll reuse the dataset but build a new model:

def build_classifier():
  model = keras.Sequential([
    keras.layers.Dense(
      units=128,
      activation='relu',
      input_shape=[X_train.shape[1]]
    ),
    keras.layers.Dense(units=64, activation='relu'),
    keras.layers.Dense(units=32, activation='relu'),
    keras.layers.Dense(units=16, activation='relu'),
    keras.layers.Dense(units=8, activation='relu'),
    keras.layers.Dense(units=1, activation='sigmoid'),
  ])

  model.compile(
    loss="binary_crossentropy",
    optimizer="adam",
    metrics=['accuracy']
  )

  return model

Here is the result:

You can see that the validation accuracy starts to decrease after epoch 25 or so.

The Fix #1

One way to fix this would be to simplify the model. But what if you spent so much time fine-tuning it? You can see that your model is performing better at a previous stage of the training.

You can use the EarlyStopping callback to stop the training at some point:

clf = build_classifier()

early_stop = keras.callbacks.EarlyStopping(
  monitor='val_accuracy',
  patience=25
)

clf_history = clf.fit(
  x=X_train,
  y=y_train,
  shuffle=True,
  epochs=200,
  validation_split=0.2,
  batch_size=BATCH_SIZE,
  verbose=0,
  callbacks=[early_stop]
)

Here’s the new training/validation loss:

Alright, looks like the training stopped much earlier than epoch 200. Faster training and a more accurate model. Nice!

The Fix #2

Another approach to fixing this problem is by using Regularization. Regularization is a set of methods that forces the building of a less complex model. Usually, you get higher bias (less correct predictions on the training data) but reduced variance (higher accuracy on the validation dataset).

One of the most common ways to Regularize Neural Networks is by using Dropout.

Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-adaptations on training data. It is a very efficient way of performing model averaging with neural networks. The term “dropout” refers to dropping out units (both hidden and visible) in a neural network.

Using Dropout in Keras is really easy:

model = keras.Sequential([
    keras.layers.Dense(
      units=128,
      activation='relu',
      input_shape=[X_train.shape[1]]
    ),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(units=64, activation='relu'),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(units=32, activation='relu'),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(units=16, activation='relu'),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(units=8, activation='relu'),
    keras.layers.Dense(units=1, activation='sigmoid'),
  ])

model.compile(
  loss="binary_crossentropy",
  optimizer="adam",
  metrics=['accuracy']
)

Here’s how the training process has changed:

The validation accuracy seems very good. Note that the training accuracy is down (we have a higher bias). There you have it, two ways to solve one issue!

Conclusion

Well done! You now have the toolset for dealing with the most common problems related to high bias or high variance. Here’s a summary:

  • Your data has no predictive power - use different data
  • Your model is too simple to make good predictions - use model with more parameters
  • Your data brings the Curse of dimensionality - use more data, reduce the number of features or use Bayesian Prior to provide more information
  • Your model is too complex - use Early Stopping or Regularization to force creating a simpler model

Run the complete code in your browser

References