RuleFit2

## Summary¶

This post will be longer than the last one, so here are the major take-aways if don't want to read everything:

• RuleFit outperforms Linear Regression and Decision Trees
• RuleFit can keep up with other Tree-Ensembles like Random Forest or Gradient Boosting (at the price of interpretability)
• Nevertheless, the comparison took place on only one dataset and a plain train-test split so there might be many other situations where RuleFits will clearly underperform

## Introduction¶

In my last post I gave a short introduction to the RuleFit Algorithm. Now I want to construct an actual use case for RuleFit and how its benefits can be fully exploited. The dataset will be the same as last time (House Sales in King County).

## Preliminaries¶

The purpose of this post is to show a possible implementation of RuleFit in Python and compare its performance to other models. I am not aiming for the lowest test-error possible and thus there won't be any feature engineering except for dropping some unnecessary variables. The target variable is, again, 'price'.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

import matplotlib.pyplot as plt

from copy import deepcopy

In [2]:
#loading and transforming data

df.drop(["id", "date", "zipcode"], 1, inplace = True)

X = df.drop("price",1)
y = df["price"]


Out[2]:
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated lat long sqft_living15 sqft_lot15
0 221900.0 3 1.00 1180 5650 1.0 0 0 3 7 1180 0 1955 0 47.5112 -122.257 1340 5650
1 538000.0 3 2.25 2570 7242 2.0 0 0 3 7 2170 400 1951 1991 47.7210 -122.319 1690 7639
2 180000.0 2 1.00 770 10000 1.0 0 0 3 6 770 0 1933 0 47.7379 -122.233 2720 8062
3 604000.0 4 3.00 1960 5000 1.0 0 0 5 7 1050 910 1965 0 47.5208 -122.393 1360 5000
4 510000.0 3 2.00 1680 8080 1.0 0 0 3 8 1680 0 1987 0 47.6168 -122.045 1800 7503
In [3]:
print(len(df))

21613

In [4]:
#simple train-test split
np.random.seed(123)

df_trainX, df_testX, df_trainY, df_testY = train_test_split(X, y, test_size=0.5, random_state=42)


## The RuleFit class¶

Below is my implementation of RuleFit. Except for the scaling part (lines 25-31) that I did not mention in the last post and the GradientBoostingRegressor for actual rule generation, the concept stays the same. Thus, you don't have to go through the whole block to understand what is happening under the hood.

If you have any questions or find any errors in the code (which I hope you don't :-)), feel free to write me a message or leave a comment.

In [5]:
class RuleFit:

'''
I wanted to keep this class as flexible as possible and so both the rule generating algorithm
and the linear model are replaceable by other sklearn-like models. The standard models are
Gradient Boosting and Ridge Regression - after some testing I found Ridge Regression to be slightly
better than Lasso in terms of accuracy. However, the Lasso Regression will make it much easier to
obtain interpretable results since most irrelavant rules will have close-to-zero relevance.
'''

def __init__(self,
linear_model = Ridge(alpha = 100, max_iter = 1000),
scale_rules = True):

'''
This list will be filled with the single Decision Trees created by the rule generator
for easier access of the individual Trees
'''
self.trees = []

'''The Friedman paper on RuleFit suggests scaling the rules for better results (item 3.2) -
while I got better results applying the original algorithm with Lasso and scaling,
leaving out the scaling works better with Ridge Regression'''
self.scale_rules = scale_rules
if scale_rules:
self.rule_support = []
self.scales = []

self.rule_generator = rule_generator
self.linear_model = linear_model

'''This list will make it easier to track the amount of rules/nodes in each tree'''
self.n_nodes_tree = []

'''Another Linear model - this will be used when only the most predictive rules
are being used out of the thousands of other rules generated by the rule generator'''
self.linear_model_best_rules = None

def fit(self, X, y):

'''In case of re-fitting we want the lists to be refreshed'''
self.trees = []
self.n_nodes_tree = []

if self.scale_rules:
self.rule_support = []
self.scales = []

'''For tree/rule visualizations we need the feature names of our dataset'''
self.feature_names = X.columns.tolist()

'''each tree will make unique predictions, this list will keep all predicted
dummy-matrices'''
predictionFrame = []

'''now we can actually generate the rules'''
rules_generation = self.rule_generator.fit(X,y)

for tree in range(len(rules_generation.estimators_)):

'''for some reason sklearn GradientBoosting and RandomForest will save the trees
in a different structure (list of trees VS. list of list of trees) this part will
account for that and work for both structures'''
cur_tree = rules_generation.estimators_[tree]
try:
cur_tree = cur_tree[0]
except:
pass

'''now we save each tree, tree decision path(the rules we'll use) and the amount of
rules generated(for later rule evaluation)'''
self.trees.append(cur_tree)

predictionFrame.append(pd.DataFrame(self.trees[tree].decision_path(
X).toarray()).iloc[:,1:-1])

'''scaling the rules according to paper'''
if self.scale_rules:
self.rule_support.append(predictionFrame[tree].mean(0))
self.scales.append(np.sqrt(self.rule_support[tree] * (1-self.rule_support[tree])))
predictionFrame[tree]/=self.scales[tree]

self.n_nodes_tree.append(predictionFrame[tree].shape[1])

'''the last step of fitting is combining the 0/1 binary matrices with the original data
and regress the y against the combined features'''
X_comb = pd.concat([X.reset_index(drop=True),
pd.concat(predictionFrame,1).reset_index(drop=True)], 1)

self.n_cols_data = X.shape[1] - 1

self.linear_model.fit(X_comb,y)

def predict(self, X):

'''pretty similar to the fit-procedure: predict if observation matches each rule into a
binary matrix, combine binary matrix with original features and use linear model to predict'''
predictionFrame = []

for tree in range(len(self.trees)):

predictionFrame.append(pd.DataFrame(self.trees[tree].decision_path(
X).toarray()).iloc[:,1:-1])

if self.scale_rules:
predictionFrame[tree]/=self.scales[tree]

X_comb = pd.concat([X.reset_index(drop=True),pd.concat(predictionFrame,1).fillna(0)],1)

result = self.linear_model.predict(X_comb)

return result

'''some methods to easily access the coefficients from the linear model'''

def get_all_coefficients(self):
return self.linear_model.coef_

'''returns only coefficients of individual rules'''
def get_tree_coefficients(self):
return self.linear_model.coef_[(self.n_cols_data+1):]

'''returns only coefficients of the original features'''
def get_linear_coefficients(self):
return self.linear_model.coef_[:self.n_cols_data]

'''if scale_rules is set to True, the rule scaling will be applied - see Friedman paper
for details (item 6 in their paper). Otherwise the rules will be ranked via their absolute
coefficients Ranking from best to worst'''
def rank_rules(self):

coeffs_tree = self.get_tree_coefficients()

relative_importance = np.abs(coeffs_tree)

if self.scale_rules:
relative_importance *= pd.concat(self.scales).reset_index(drop=True)

'''the first output element are the actual importances of each rule (scaled/unscaled),
while the second element is the ranking from most relevant to least relevant rule'''
return relative_importance, np.argsort(relative_importance)[::-1]

'''After applying the above fit-predict methods, we will have some 1000s of rules that are
being used in our model. This will still be hard to interpret for a human user who wants
to make sense of the algorithm's output. Once we got an initial ensemble of rules to find
the important ones, we can fit another linear model using only a subset of the n best
rules (n_rules). That way the number of parameters in the model can be reduced to an amount that
is much easier to interpret and evaluate. The two methods below will make that possible'''

'''Fit another linear model using only the 'n_rules' most important rules, obtained via the
rank_rules method. This should be used with Lasso and scaled rules like in the original paper'''
def fit_best_n_rules(self, X,y,n_rules = 20):

if len(self.trees)>0:

self.best_rules = self.rank_rules()[1][:n_rules]

predictionFrame = []

for tree in range(len(self.trees)):

predictionFrame.append(pd.DataFrame(self.trees[tree].decision_path(
X).toarray()).iloc[:,1:-1])

if self.scale_rules:
predictionFrame[tree]/=self.scales[tree]

X_comb = pd.concat([X.reset_index(drop=True),
pd.concat(predictionFrame,1).fillna(0).iloc[:,self.best_rules]],1)

'''the amount of parameters in the final model should be much lower than
in the penalized one, so Linear Regression can be applied here'''
self.linear_model_best_rules = LinearRegression()

self.linear_model_best_rules.fit(X_comb, y)

else:
AssertionError("No model was fit yet!")

def predict_best_n_rules(self, X):

if self.linear_model_best_rules is not None:

predictionFrame = []

for tree in range(len(self.trees)):

predictionFrame.append(pd.DataFrame(self.trees[tree].decision_path(
X).toarray()).iloc[:,1:-1])

if self.scale_rules:
predictionFrame[tree]/=self.scales[tree]

X_comb = pd.concat([X.reset_index(drop=True),
pd.concat(predictionFrame,1).fillna(0).iloc[:,self.best_rules]],1)

result = self.linear_model_best_rules.predict(X_comb)

else:
AssertionError(
"Model has not been trained yet"
)

return result

'''Finally, we want to inspect the rules that our model used - the following two functions
are actually just helper functions that shouldn't be called on their own but only through
the get_rule method at the end. You don't have to fully understand what's happening in here
but feel free to figure things out or write message me if you want more details'''

'''this method will apply some logic to match the correct tree and node number to a user
given rule-number (which ranges from 0 to the total amount of rules - 1)'''
def calculate_node(self,node_num):

select = node_num
tree = np.where(np.cumsum(self.n_nodes_tree)>=select)[0][0]

node_sub = np.maximum(0,np.cumsum(self.n_nodes_tree)[tree]-np.cumsum(self.n_nodes_tree)[0])

if node_sub > 0:
node_result = select - node_sub -1
else:
node_result = select - node_sub

node_result += 1

return tree, node_result

'''once we got the correct tree and node-number, this method will print out the correct
decision path up to that node. It's recursively searching through the tree structure until
it finds the correct node number'''
def find_node(self,tree_, current_node, search_node, features):

child_left = tree_.children_left[current_node]
child_right = tree_.children_right[current_node]

split_feature = str(features[tree_.feature[current_node]])
split_value = str(tree_.threshold[current_node])

if child_left != -1:
if child_left != search_node:
left_one = self.find_node(tree_, child_left, search_node, features)
else:
return("--> if "+str(split_feature)+" <= "+str(split_value))
else:
return ""

if child_right != -1:
if child_right != search_node:
right_one = self.find_node(tree_, child_right, search_node, features)
else:
return("--> if "+str(split_feature)+" > "+str(split_value))
else:
return ""

if len(left_one)>0:
return("--> if "+str(split_feature)+" <= "+str(split_value)+"\n"+left_one)
elif len(right_one)>0:
return("--> if "+str(split_feature)+" > "+str(split_value)+"\n"+right_one)
else:
return ""

'''that's the actual function that should be called to print out a certain rule - see
examples below'''
def get_rule(self, rule_no):

tree, node = self.calculate_node(rule_no)

print(self.find_node(self.trees[tree].tree_, 0, node, self.feature_names))



## Comparing different models with RuleFit¶

Since the main purpose of RuleFit is to get a model that is both as accurate and interpretable as possible, let's see how other highly interpretable models are doing on the dataset - although not as popular in the data science community, I decided to compare on Mean-Absolute-Percentage Error $$MAPE=\frac{1}{n}\sum^n_{i=1}\frac{|\hat{y}_i-y_i|}{y_i}$$ since it makes it easier to interpret the relation between different errors and feels more natural to most non-stats/data who will base their decisions on model outputs.

### Linear Regression¶

In [6]:
OLS_model = LinearRegression()
OLS_model.fit(df_trainX, df_trainY)
predictions_linear = OLS_model.predict(df_testX)

print np.mean(np.abs(predictions_linear-df_testY)/df_testY)

0.248897782587121


### Decision Tree¶

In [7]:
dtree_model = DecisionTreeRegressor()

dtree_model.fit(df_trainX, df_trainY)
predictions_dtree = dtree_model.predict(df_testX)

print np.mean(np.abs(predictions_dtree-df_testY)/df_testY)

0.19460480708949326


### RuleFit¶

At first we'll run RuleFit to find the most important rules out of all rules generated by the GradientBoostingRegressor. The Lasso is highly penalized to set most coefficients to 0 through high penalization.
Also notice, that the initial max_depth of GradientBoostingRegressor-Trees is 3, so the resulting single rules will be highly interpretable for a human user.

In [8]:
model_rulef1 = RuleFit(linear_model = Lasso(alpha = 5000, max_iter = 1000))
model_rulef1.fit(df_trainX, df_trainY)


Next, another model will be created using only the 20 best rules of those created above:

In [9]:
model_rulef1.fit_best_n_rules(df_trainX, df_trainY, 20)
predictions_best_n = model_rulef1.predict_best_n_rules(df_testX)

print np.mean(np.abs(predictions_best_n-df_testY)/df_testY)

0.17198168819622428


So, this marks an increase of more than 7 percentage points compared to plain OLS and almost 2.5 percentage points compared to the Decision Tree. Also notice that the Decision Tree is not bounded in depth, so interpreting the Tree above could be challenging as well.

## Inspecting the rules¶

Finally, want to know what rules are actually being used here - this can be done by using the functions that have been defined earlier in the RuleFit class:

In [10]:
print("Most relevant rule:")
model_rulef1.get_rule(model_rulef1.rank_rules()[1][0])
print("\nSecond most relevant rule:")
model_rulef1.get_rule(model_rulef1.rank_rules()[1][1])

Most relevant rule:

Second most relevant rule:
--> if long <= -121.98750305175781
--> if sqft_living15 <= 3685.0
--> if sqft_above > 1753.5


Using the other methods in the RuleFit class, the user could actually rebuild the whole Linear Regression model, finalizing it to a pure White Box model that is easily interpretable. I will leave that task to the curious reader and continue with another interesting quality of RuleFit:

## RuleFit as a competitive alternative to other Tree Ensembles¶

We don't always need our model to be highly interpretable but rather prefer the most accurate model. In the past, ensembles of Decision Trees have been successfully applied to many different tasks in data science. Altough they have lost some attention due to the current hype about Deep Learning, the popularity of Decision Tree models might soon see a renaissance.

Nevertheless, I was curious if RuleFit could actually compete with the likes of plain Gradient Boosting and Random Forest when interpretability is not necessary, so here are the results:

In [11]:
model_gb = GradientBoostingRegressor()
model_gb.fit(df_trainX, df_trainY)

prediction_gb = model_gb.predict(df_testX)

print np.mean(np.abs(prediction_gb - df_testY)/df_testY)

0.15224260656492367


### Random Forest¶

In [12]:
model_rf = RandomForestRegressor()
model_rf.fit(df_trainX, df_trainY)

prediction_rf = model_rf.predict(df_testX)

print np.mean(np.abs(prediction_rf - df_testY)/df_testY)

0.1465201931109741


### RuleFit¶

In [13]:
model_rulef2 = RuleFit(scale_rules = False)
model_rulef2.fit(df_trainX, df_trainY)
predictions_rulef2 = model_rulef1.predict(df_testX)

print np.mean(np.abs(predictions_rulef2-df_testY)/df_testY)

0.1481472199354308


Not exactly on par with Random Forest but keep in mind that the initial GradientBoosting model will only generate rules of max_depth = 3, so we could try out some hyperparameter tuning:

In [14]:
model_rulef3 = RuleFit(rule_generator = GradientBoostingRegressor(max_depth = 7, n_estimators = 40),
scale_rules = False)
model_rulef3.fit(df_trainX, df_trainY)
predictions_rulef3 = model_rulef3.predict(df_testX)

print np.mean(np.abs(predictions_rulef3-df_testY)/df_testY)

0.12974866235857196


Of course we would have to actually perform the hyper-tuning on all models to have a fair comparison, yet the results at least indicate that RuleFit can keep up with more popular Tree-Ensembles. On the other hand, we should keep the no-free-lunch theorem in mind, too, and there are probably countless of datasets out there, where simple OLS will outperform even the most dedicated BlackBox algorithm. However, having a large toolbox for different problems at hand will always help on the various problem one will face in data analysis.
Another non-probabilistic WhiteBox algorithm that sounds really promising to me is Fast Function Extraction and I will probably work out a combination with RuleFit for a future blog post.

If you want to read more about interpretable Machine Learning, I highly recommend the work of Christoph Molnar. He also got a fully working Python package for RuleFit on GitHub if you want to try it out.