All Articles

Build a Decision Tree from Scratch in Python | Machine Learning from Scratch (Part III)

TL;DR Build a Decision Tree regression model using Python from scratch. Compare the performance of your model with that of a Scikit-learn model. The Decision Tree is used to predict house sale prices and send the results to Kaggle.

I am sorry, you might be losing sleep. Deep down you know your Linear Regression model ain’t gonna cut it. That housing market domination is still further down the road.

Can we improve that, can we have a model that makes better predictions?

Complete source code notebook on Google Colaboratory

The Data

Once again, we’re going to use the Kaggle data: “House Prices: Advanced Regression Techniques”. It contains 1460 training data points and 80 features that might help us predict the selling price of a house.

Decision Trees

Decision tree models build structures like this:

source: https://www.xoriant.com

The algorithms for building trees breaks down a data set into smaller and smaller subsets while an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches. Leaf node represents a classification or decision (used for regression). The topmost decision node in a tree which corresponds to the best predictor (most important feature) is called a root node.

Decision trees can handle both categorical and numerical data. They are used for classification and regression problems. They can handle missing data pretty well, too!

Data Preprocessing

We’re going to use the same data we used with the Linear Regression model. However, we’re not going to do any scaling, just because we’re lazy (or it is not needed):

X = df_train[['OverallQual', 'GrLivArea', 'GarageCars']]
y = df_train['SalePrice']

Cost function

We’re going to use a new cost function — Root Mean Square Error (RMSE). It is the standard deviation of how far from the regression line data points are. In other words, it tells you how concentrated the data is around the line of best fit.

RMSE is given by the formula:

RMSE=1mi=1m(y(i)h(x(i)))2RMSE = \sqrt{\frac{1}{m} \sum_{i=1}^{m} (y^{(i)} - h(x^{(i)}))^2}

We’ve already implemented MSE in previous parts, so we’re going to import an implementation here, in the name of readability (or holy laziness?):

from sklearn.metrics import mean_squared_error
from math import sqrt

def rmse(h, y):
  return sqrt(mean_squared_error(h, y))

Using a prebuild Decision Tree model

Let use Decision Tree regressor from the scikit-learn library to get a quick feel of the model:

from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor(
  n_estimators=1,
  max_depth=2,
  bootstrap=False,
  random_state=RANDOM_SEED
)
reg.fit(X, y)

We are using RandomForestRegressor with 1 estimator, which basically means we’re using a Decision Tree model. Here is the tree structure of our model:

You should receive the exact same model (if you’re running the code) since we are setting the random state. How many features did the model use?

Now that this model is ready to be used let’s evaluate its score:

preds = reg.predict(X)
metrics.r2_score(y, preds)
0.6336246655552089

The R2R^2 statistic gives us information about the goodness of fit of the model. A score of 1 indicates perfect fit. Let’s have a look at the RMSE:

48069.23940764968

Building your own Decision Tree

class DecisionTreeRegressor:

  def fit(self, X, y, min_leaf = 5):
    self.dtree = Node(X, y, np.array(np.arange(len(y))), min_leaf)
    return self

  def predict(self, X):
    return self.dtree.predict(X.values)

Let’s start implementing our Node helper class:

class Node:

    def __init__(self, x, y, idxs, min_leaf=5):
        self.x = x
        self.y = y
        self.idxs = idxs
        self.min_leaf = min_leaf
        self.row_count = len(idxs)
        self.col_count = x.shape[1]
        self.val = np.mean(y[idxs])
        self.score = float('inf')
        self.find_varsplit()

Trees are recursive data structures and we’re going to take full advantage of that. Our Node class represents one decision point in our model. Each division within the model has 2 possible outcomes for finding a solution — go to the left or go to the right. That decision point also divides our data into two sets.

The property idxs stores indexes of the subset of the data that this Node is working with.

The decision (prediction) is based on the value that Node holds. To make that prediction we’re just going to take the average of the data of the dependent variable for this Node.

The method find_varsplit finds where should we split the data. Let’s have a look at it:

def find_varsplit(self):
    for c in range(self.col_count): self.find_better_split(c)
    if self.is_leaf: return
    x = self.split_col
    lhs = np.nonzero(x <= self.split)[0]
    rhs = np.nonzero(x > self.split)[0]
    self.lhs = Node(self.x, self.y, self.idxs[lhs], self.min_leaf)
    self.rhs = Node(self.x, self.y, self.idxs[rhs], self.min_leaf)

First, we try to find a better feature to split on. If no such feature is found (we’re at a leaf node) we do nothing. Then we use the split value found by findbettersplit, create the data for the left and right nodes and create each one using a subset of the data.

Here are the split_col and is_leaf implementations:

@property
def split_col(self): return self.x.values[self.idxs,self.var_idx]

@property
def is_leaf(self): return self.score == float('inf')

It is time to implement the workhorse of our algorithm find_better_split method:

def find_better_split(self, var_idx):
    x = self.x.values[self.idxs, var_idx]

    for r in range(self.row_count):
        lhs = x <= x[r]
        rhs = x > x[r]
        if rhs.sum() < self.min_leaf or lhs.sum() < self.min_leaf: continue

        curr_score = self.find_score(lhs, rhs)
        if curr_score < self.score:
            self.var_idx = var_idx
            self.score = curr_score
            self.split = x[r]

def find_score(self, lhs, rhs):
    y = self.y[self.idxs]
    lhs_std = y[lhs].std()
    rhs_std = y[rhs].std()
    return lhs_std * lhs.sum() + rhs_std * rhs.sum()

We are trying to split on each data point and let the best split wins.

We’re going to create our split such that it has as low standard deviation as possible. We find the split that minimizes the weighted averages of the standard deviations which is equivalent to minimizing RMSE.

If we find a better split we store the following information: index of the variable, split score and value of the split.

The score is a metric that tells us how effective the split was (note that leaf nodes do not have scores, so it will be infinity). The method find_score calculates a weighted average of the data. If the score is lower than the previous we have a better split. Note that the score is initially set to infinity -> only leaf nodes and really shallow trees (and Thanos) have a score of infinity.

Finally, let’s look at how we use all this to make predictions:

def predict(self, x):
    return np.array([self.predict_row(xi) for xi in x])

def predict_row(self, xi):
    if self.is_leaf: return self.val
    node = self.lhs if xi[self.var_idx] <= self.split else self.rhs
    return node.predict_row(xi)

Once again, we’re exploiting the recursive nature of life. Starting at the tree root, predict_row checks if we need to go left or right node based on the split value we found. The recursion ends once we hit a leaf node. At that point, the answer/prediction is stored in the val property.

Here is the complete source of our Node class:

class Node:

    def __init__(self, x, y, idxs, min_leaf=5):
        self.x = x
        self.y = y
        self.idxs = idxs
        self.min_leaf = min_leaf
        self.row_count = len(idxs)
        self.col_count = x.shape[1]
        self.val = np.mean(y[idxs])
        self.score = float('inf')
        self.find_varsplit()

    def find_varsplit(self):
        for c in range(self.col_count): self.find_better_split(c)
        if self.is_leaf: return
        x = self.split_col
        lhs = np.nonzero(x <= self.split)[0]
        rhs = np.nonzero(x > self.split)[0]
        self.lhs = Node(self.x, self.y, self.idxs[lhs], self.min_leaf)
        self.rhs = Node(self.x, self.y, self.idxs[rhs], self.min_leaf)

    def find_better_split(self, var_idx):

        x = self.x.values[self.idxs, var_idx]

        for r in range(self.row_count):
            lhs = x <= x[r]
            rhs = x > x[r]
            if rhs.sum() < self.min_leaf or lhs.sum() < self.min_leaf: continue

            curr_score = self.find_score(lhs, rhs)
            if curr_score < self.score:
                self.var_idx = var_idx
                self.score = curr_score
                self.split = x[r]

    def find_score(self, lhs, rhs):
        y = self.y[self.idxs]
        lhs_std = y[lhs].std()
        rhs_std = y[rhs].std()
        return lhs_std * lhs.sum() + rhs_std * rhs.sum()

    @property
    def split_col(self): return self.x.values[self.idxs,self.var_idx]

    @property
    def is_leaf(self): return self.score == float('inf')

    def predict(self, x):
        return np.array([self.predict_row(xi) for xi in x])

    def predict_row(self, xi):
        if self.is_leaf: return self.val
        node = self.lhs if xi[self.var_idx] <= self.split else self.rhs
        return node.predict_row(xi)

Evaluation

Let’s check how your Decision Tree regressor does on the training data:

regressor = DecisionTreeRegressor().fit(X, y)
preds = regressor.predict(X)

Here is the R2R^2 score:

0.8504381072711565

Our scikit-learn model gave us a score of 0.6336246655552089

The RMSE score is:

30712.460628635836

The scikit-learn regressor gave us 48069.23940764968

Looks like your model is doing pretty good, eh? Let’s make a prediction on the test data and send it to Kaggle.

Sending your predictions to Kaggle

Let make our predictions and format the data as requested on Kaggle:

X_test = df_test[['OverallQual', 'GrLivArea', 'GarageCars']]
pred_test = regressor.predict(X_test)

submission = pd.DataFrame({'Id': df_test.Id, 'SalePrice': pred_test})
submission.to_csv('submission.csv', index=False)

Feel free to submit your csv file to Kaggle. Also, how can you improve?

Complete source code notebook on Google Colaboratory

Conclusion

Give yourself a treat, you just implemented a Decision Tree regressor!

Would it be possible to implement Random Forest regressor on top of your model? How that affects your Kaggle scores?

In the next part, you’re gonna do some unsupervised learning with k means!

Acknowledgments

The source code presented in this part is heavily inspired by the excellent course of fast.aiIntroduction to Machine Learning for Coders