All Articles

Linear Regression with TensorFlow.js | Deep Learning for JavaScript Hackers (Part II)

TL;DR Build a Linear Regression model in TensorFlow.js to predict house prices. Learn how to handle categorical data and do feature scaling.

Raining again. It has been 3 weeks since the last time you saw the sun. You’re getting tired of all this cold and unpleasant feeling of loneliness and melancholy.

The voice in your head is getting louder and louder.

- “MOVE”.

Alright, you’re ready to do it. Where to? You remember that you’re nearly broke.

A friend of yours told you about this place Ames, Iowa and it stuck in your head. After a quick search, you found that the weather is pleasant during the year and there is some rain, but not much. Excitement!

Fortunately, you know of this dataset on Kaggle that might help you find out how much your dream house might cost. Let’s get to it!

Run the complete source code for this tutorial right in your browser:

House prices data

Our data comes from Kaggle’s House Prices: Advanced Regression Techniques challenge.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Here’s a subset of the data we’re going to use for our model:

  • OverallQual - Rates the overall material and finish of the house (0 - 10)
  • GrLivArea - Above grade (ground) living area square feet
  • GarageCars - Size of garage in car capacity
  • TotalBsmtSF - Total square feet of basement area
  • FullBath - Full bathrooms above grade
  • YearBuilt - Original construction date
  • SalePrice - The property’s sale price in dollars (we’re trying to predict this)

Let’s use Papa Parse to load the training data:

const prepareData = async () => {
  const csv = await Papa.parsePromise(
    "https://raw.githubusercontent.com/curiousily/Linear-Regression-with-TensorFlow-js/master/src/data/housing.csv"
  );

  return csv.data;
};
const data = await prepareData();

Exploration

Let’s build a better understanding of our data. First - the quality score of each house:

Most houses are of average quality, but there are more “good” than “bad” ones.

Let’s see how large are they (that’s what she said):

Most of the houses are within the 1,000 - 2,000 range, and we have some that are bigger.

Let’s have a look at the year they are built:

Even though there are a lot of houses that were built recently, we have a much more widespread distribution.

How related is the year with the price?

Seems like newer houses are pricier, no love for the old and well made then?

Oh ok, but higher quality should equal higher price, right?

Generally yes, but look at quality 10. Some of those are relatively cheap. Any ideas why that might be?

Is a larger house equal higher price?

Seems like it, we might start our price prediction model using the living area!

Linear Regression

Linear Regression models assume that there is a linear relationship (can be modeled using a straight line) between a dependent continuous variable YY and one or more explanatory (independent) variables XX.

In our case, we’re going to use features like living area (X) to predict the sale price (Y) of a house.

Simple Linear Regression

Simple Linear Regression is a model that has a single independent variable XX. It is given by:

Y=bX+aY = bX + a

Where a and b are parameters, learned during the training of our model. XX is the data we’re going to use to train our model, bb controls the slope and aa the interception point with the yy axis.

Multiple Linear Regression

A natural extension of the Simple Linear Regression model is the multivariate one. It is given by:

Y(x1,x2,,xn)=w1x1+w2x2++wnxn+w0Y(x_1,x_2,\ldots,x_n) = w_1 x_1 + w_2 x_2 + \ldots + w_n x_n + w_0

where x1,x2,xnx_1, x_2 \ldots, x_n are features from our dataset and w1,w2,wnw_1, w_2 \ldots, w_n are learned parameters.

Loss function

We’re going to use Root Mean Squared Error to measure how far our predictions are from the real house prices. It is given by:

RMSE=J(W)=1mi=1m(y(i)hw(x(i)))2RMSE = J(W) = \sqrt{\frac{1}{m} \sum_{i=1}^{m} (y^{(i)} - h_w(x^{(i)}))^2}

where the hypothesis/prediction hwh_w is given by:

hw(x)=g(wTx)h_w(x) = g(w^Tx)

Data Preprocessing

Currently, our data sits into an array of JS objects. We need to turn it into Tensors and use it for training our model(s). Here is the code for that:

const createDataSets = (data, features, categoricalFeatures, testSize) => {
  const X = data.map(r =>
    features.flatMap(f => {
      if (categoricalFeatures.has(f)) {
        return oneHot(!r[f] ? 0 : r[f], VARIABLE_CATEGORY_COUNT[f]);
      }
      return !r[f] ? 0 : r[f];
    })
  );

  const X_t = normalize(tf.tensor2d(X));

  const y = tf.tensor(data.map(r => (!r.SalePrice ? 0 : r.SalePrice)));

  const splitIdx = parseInt((1 - testSize) * data.length, 10);

  const [xTrain, xTest] = tf.split(X_t, [splitIdx, data.length - splitIdx]);
  const [yTrain, yTest] = tf.split(y, [splitIdx, data.length - splitIdx]);

  return [xTrain, xTest, yTrain, yTest];
};

We store our features in X and the labels in y. Then we convert the data into Tensors and split it into training and testing datasets.

Categorical features

Some of the features in our dataset are categorical/enumerable. For example, GarageCars can be in the 0-5 range.

Leaving categories represented as integers in our dataset might introduce implicit ordering dependence. Something that does not exist with categorical variables.

We’ll use one-hot encoding from TensorFlow to create an integer vector for each value to break the ordering. First, let’s specify how many different values each category has:

const VARIABLE_CATEGORY_COUNT = {
  OverallQual: 10,
  GarageCars: 5,
  FullBath: 4
};

We’ll use tf.oneHot() to convert individual value to a one-hot representation:

const oneHot = (val, categoryCount) =>
  Array.from(tf.oneHot(val, categoryCount).dataSync());

Note that the createDataSets() function accepts a parameter called categoricalFeatures which should be a set. We’ll use this to check whether or not we should process this feature as categorical.

Feature scaling

Feature scaling is used to transform the feature values into a (similar) range. Feature scaling will help our model(s) learn faster since we’re using Gradient Descent for training it.

Let’s use one of the simplest method for feature scaling - min-max normalization:

const normalize = tensor =>
  tf.div(
    tf.sub(tensor, tf.min(tensor)),
    tf.sub(tf.max(tensor), tf.min(tensor))
  );

this method rescales the range of values in the range of [0, 1].

Predicting house prices

Now that we know about the Linear Regression model(s), we can try to predict house prices based on the data we have. Let’s start simple:

Building a Simple Linear Regression model

We’ll wrap the training process in a function that we can reuse for our future model(s):

const trainLinearModel = async (xTrain, yTrain) => {
  ...
}

trainLinearModel accepts the features and labels for our model. Let’s define a Linear Regression model using TensorFlow:

const model = tf.sequential();

model.add(
  tf.layers.dense({
    inputShape: [xTrain.shape[1]],
    units: xTrain.shape[1]
  })
);

model.add(tf.layers.dense({ units: 1 }));

Since TensorFlow.js doesn’t offer RMSE loss function, we’ll use MSE and take the square root of that later. We’ll also track Mean Absolute Error (MAE) between the predictions and real prices:

model.compile({
  optimizer: tf.train.sgd(0.001),
  loss: "meanSquaredError",
  metrics: [tf.metrics.meanAbsoluteError]
});

Here’s the training process:

const trainLogs = [];
const lossContainer = document.getElementById("loss-cont");
const accContainer = document.getElementById("acc-cont");

await model.fit(xTrain, yTrain, {
  batchSize: 32,
  epochs: 100,
  shuffle: true,
  validationSplit: 0.1,
  callbacks: {
    onEpochEnd: async (epoch, logs) => {
      trainLogs.push({
        rmse: Math.sqrt(logs.loss),
        val_rmse: Math.sqrt(logs.val_loss),
        mae: logs.meanAbsoluteError,
        val_mae: logs.val_meanAbsoluteError
      });
      tfvis.show.history(lossContainer, trainLogs, ["rmse", "val_rmse"]);
      tfvis.show.history(accContainer, trainLogs, ["mae", "val_mae"]);
    }
  }
});

We train for 100 epochs, shuffle the data beforehand, and use 10% of it for validation. The RMSE and MAE are visualized after each epoch.

Training

Our Simple Linear Regression model is using the GrLivArea feature:

const [xTrainSimple, xTestSimple, yTrainSimple, yTestIgnored] = createDataSets(
  data,
  ["GrLivArea"],
  new Set(),
  0.1
);

const simpleLinearModel = await trainLinearModel(xTrainSimple, yTrainSimple);

We don’t have categorical features, so we leave that set is empty. Let’s have a look at the performance:

Building a Multiple Linear Regression model

We have a lot more data we haven’t used yet. Let’s see if that will help improve the predictions:

const features = [
  "OverallQual",
  "GrLivArea",
  "GarageCars",
  "TotalBsmtSF",
  "FullBath",
  "YearBuilt"
];

const categoricalFeatures = new Set(["OverallQual", "GarageCars", "FullBath"]);

const [xTrain, xTest, yTrain, yTest] = createDataSets(
  data,
  features,
  categoricalFeatures,
  0.1
);

We use all features in our dataset and pass a set of the categorical ones. Did we do better?

Overall, both models are performing at about the same level. This time, increasing the model complexity didn’t give us better accuracy.

Evaluation

Another way to evaluate our models is to check their predictions against the test data. Let’s start with the Simple Linear Regression:

How did adding more data improved the predictions?

Well, it didn’t. Again, having a more complex model trained with more data didn’t provide better performance.

Conclusion

You did it! You built two Linear Regression models that predict house price based on a set of features. You also did:

  • Feature scaling for faster model training
  • Convert categorical variables into one-hot representations
  • Implement RMSE (based on MSE) for accuracy evaluation

Run the complete source code for this tutorial right in your browser:

Is it time to learn about Neural Networks?

References

Handling Categorical Data in Machine Learning Models

About Feature Scaling and Normalization

RMSE: Root Mean Square Error