RuleFit on real-world data



This post will be longer than the last one, so here are the major take-aways if don't want to read everything:

  • RuleFit outperforms Linear Regression and Decision Trees
  • RuleFit can keep up with other Tree-Ensembles like Random Forest or Gradient Boosting (at the price of interpretability)
  • Nevertheless, the comparison took place on only one dataset and a plain train-test split so there might be many other situations where RuleFits will clearly underperform


In my last post I gave a short introduction to the RuleFit Algorithm. Now I want to construct an actual use case for RuleFit and how its benefits can be fully exploited. The dataset will be the same as last time (House Sales in King County).


The purpose of this post is to show a possible implementation of RuleFit in Python and compare its performance to other models. I am not aiming for the lowest test-error possible and thus there won't be any feature engineering except for dropping some unnecessary variables. The target variable is, again, 'price'.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import Lasso
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

import matplotlib.pyplot as plt

from copy import deepcopy
In [2]:
#loading and transforming data
df = pd.read_csv("kc_house_data.csv")

df.drop(["id", "date", "zipcode"], 1, inplace = True)

X = df.drop("price",1)
y = df["price"]

price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated lat long sqft_living15 sqft_lot15
0 221900.0 3 1.00 1180 5650 1.0 0 0 3 7 1180 0 1955 0 47.5112 -122.257 1340 5650
1 538000.0 3 2.25 2570 7242 2.0 0 0 3 7 2170 400 1951 1991 47.7210 -122.319 1690 7639
2 180000.0 2 1.00 770 10000 1.0 0 0 3 6 770 0 1933 0 47.7379 -122.233 2720 8062
3 604000.0 4 3.00 1960 5000 1.0 0 0 5 7 1050 910 1965 0 47.5208 -122.393 1360 5000
4 510000.0 3 2.00 1680 8080 1.0 0 0 3 8 1680 0 1987 0 47.6168 -122.045 1800 7503
In [3]:
In [4]:
#simple train-test split

df_trainX, df_testX, df_trainY, df_testY = train_test_split(X, y, test_size=0.5, random_state=42)

The RuleFit class

Below is my implementation of RuleFit. Except for the scaling part (lines 25-31) that I did not mention in the last post and the GradientBoostingRegressor for actual rule generation, the concept stays the same. Thus, you don't have to go through the whole block to understand what is happening under the hood.

If you have any questions or find any errors in the code (which I hope you don't :-)), feel free to write me a message or leave a comment.

In [5]:
class RuleFit:
    I wanted to keep this class as flexible as possible and so both the rule generating algorithm
    and the linear model are replaceable by other sklearn-like models. The standard models are
    Gradient Boosting and Ridge Regression - after some testing I found Ridge Regression to be slightly
    better than Lasso in terms of accuracy. However, the Lasso Regression will make it much easier to
    obtain interpretable results since most irrelavant rules will have close-to-zero relevance.
    def __init__(self,
                 rule_generator = GradientBoostingRegressor(),
                 linear_model = Ridge(alpha = 100, max_iter = 1000), 
                 scale_rules = True):
        This list will be filled with the single Decision Trees created by the rule generator
        for easier access of the individual Trees
        self.trees = []
        '''The Friedman paper on RuleFit suggests scaling the rules for better results (item 3.2) - 
        while I got better results applying the original algorithm with Lasso and scaling,
        leaving out the scaling works better with Ridge Regression'''
        self.scale_rules = scale_rules
        if scale_rules:
            self.rule_support = []
            self.scales = []
        self.rule_generator = rule_generator
        self.linear_model = linear_model
        '''This list will make it easier to track the amount of rules/nodes in each tree''' 
        self.n_nodes_tree = []
        '''Another Linear model - this will be used when only the most predictive rules
        are being used out of the thousands of other rules generated by the rule generator'''
        self.linear_model_best_rules = None
    def fit(self, X, y):
        '''In case of re-fitting we want the lists to be refreshed'''
        self.trees = []
        self.n_nodes_tree = []
        if self.scale_rules:
            self.rule_support = []
            self.scales = []

        '''For tree/rule visualizations we need the feature names of our dataset'''
        self.feature_names = X.columns.tolist()
        '''each tree will make unique predictions, this list will keep all predicted
        predictionFrame = []
        '''now we can actually generate the rules'''
        rules_generation =,y)
        for tree in range(len(rules_generation.estimators_)):
            '''for some reason sklearn GradientBoosting and RandomForest will save the trees
            in a different structure (list of trees VS. list of list of trees) this part will
            account for that and work for both structures'''
            cur_tree = rules_generation.estimators_[tree]
                cur_tree = cur_tree[0]
            '''now we save each tree, tree decision path(the rules we'll use) and the amount of
            rules generated(for later rule evaluation)'''


            '''scaling the rules according to paper'''
            if self.scale_rules:
                self.scales.append(np.sqrt(self.rule_support[tree] * (1-self.rule_support[tree])))
        '''the last step of fitting is combining the 0/1 binary matrices with the original data
        and regress the y against the combined features''' 
        X_comb = pd.concat([X.reset_index(drop=True),
            pd.concat(predictionFrame,1).reset_index(drop=True)], 1)
        self.n_cols_data = X.shape[1] - 1

    def predict(self, X):
        '''pretty similar to the fit-procedure: predict if observation matches each rule into a 
        binary matrix, combine binary matrix with original features and use linear model to predict'''
        predictionFrame = []
        for tree in range(len(self.trees)):
            if self.scale_rules:
        X_comb = pd.concat([X.reset_index(drop=True),pd.concat(predictionFrame,1).fillna(0)],1)
        result = self.linear_model.predict(X_comb)
        return result
    '''some methods to easily access the coefficients from the linear model'''
    def get_all_coefficients(self):
        return self.linear_model.coef_
    '''returns only coefficients of individual rules'''
    def get_tree_coefficients(self):
        return self.linear_model.coef_[(self.n_cols_data+1):]
    '''returns only coefficients of the original features'''
    def get_linear_coefficients(self):
        return self.linear_model.coef_[:self.n_cols_data]
    '''if scale_rules is set to True, the rule scaling will be applied - see Friedman paper
    for details (item 6 in their paper). Otherwise the rules will be ranked via their absolute
    coefficients Ranking from best to worst'''
    def rank_rules(self):
        coeffs_tree = self.get_tree_coefficients()

        relative_importance = np.abs(coeffs_tree) 
        if self.scale_rules:
            relative_importance *= pd.concat(self.scales).reset_index(drop=True)
        '''the first output element are the actual importances of each rule (scaled/unscaled),
        while the second element is the ranking from most relevant to least relevant rule'''
        return relative_importance, np.argsort(relative_importance)[::-1]

    '''After applying the above fit-predict methods, we will have some 1000s of rules that are
    being used in our model. This will still be hard to interpret for a human user who wants
    to make sense of the algorithm's output. Once we got an initial ensemble of rules to find
    the important ones, we can fit another linear model using only a subset of the n best
    rules (n_rules). That way the number of parameters in the model can be reduced to an amount that
    is much easier to interpret and evaluate. The two methods below will make that possible'''
    '''Fit another linear model using only the 'n_rules' most important rules, obtained via the
    rank_rules method. This should be used with Lasso and scaled rules like in the original paper'''
    def fit_best_n_rules(self, X,y,n_rules = 20):
        if len(self.trees)>0:
            self.best_rules = self.rank_rules()[1][:n_rules]

            predictionFrame = []
            for tree in range(len(self.trees)):

                if self.scale_rules:

            X_comb = pd.concat([X.reset_index(drop=True),
            '''the amount of parameters in the final model should be much lower than
            in the penalized one, so Linear Regression can be applied here'''
            self.linear_model_best_rules = LinearRegression()
  , y)
            AssertionError("No model was fit yet!")
    def predict_best_n_rules(self, X):
        if self.linear_model_best_rules is not None:
            predictionFrame = []

            for tree in range(len(self.trees)):

                if self.scale_rules:

            X_comb = pd.concat([X.reset_index(drop=True),

            result = self.linear_model_best_rules.predict(X_comb)
                "Model has not been trained yet"
        return result
    '''Finally, we want to inspect the rules that our model used - the following two functions
    are actually just helper functions that shouldn't be called on their own but only through
    the get_rule method at the end. You don't have to fully understand what's happening in here
    but feel free to figure things out or write message me if you want more details'''
    '''this method will apply some logic to match the correct tree and node number to a user
    given rule-number (which ranges from 0 to the total amount of rules - 1)'''
    def calculate_node(self,node_num):
        select = node_num
        tree = np.where(np.cumsum(self.n_nodes_tree)>=select)[0][0]

        node_sub = np.maximum(0,np.cumsum(self.n_nodes_tree)[tree]-np.cumsum(self.n_nodes_tree)[0]) 

        if node_sub > 0:
            node_result = select - node_sub -1
            node_result = select - node_sub

        node_result += 1
        return tree, node_result
    '''once we got the correct tree and node-number, this method will print out the correct
    decision path up to that node. It's recursively searching through the tree structure until
    it finds the correct node number'''
    def find_node(self,tree_, current_node, search_node, features):
        child_left = tree_.children_left[current_node]
        child_right = tree_.children_right[current_node]

        split_feature = str(features[tree_.feature[current_node]])
        split_value = str(tree_.threshold[current_node])

        if child_left != -1:
            if child_left != search_node:
                left_one = self.find_node(tree_, child_left, search_node, features)
                return("--> if "+str(split_feature)+" <= "+str(split_value))
            return ""

        if child_right != -1:
            if child_right != search_node:
                right_one = self.find_node(tree_, child_right, search_node, features)
                return("--> if "+str(split_feature)+" > "+str(split_value))
            return ""

        if len(left_one)>0:
            return("--> if "+str(split_feature)+" <= "+str(split_value)+"\n"+left_one)
        elif len(right_one)>0:
            return("--> if "+str(split_feature)+" > "+str(split_value)+"\n"+right_one)
            return ""
    '''that's the actual function that should be called to print out a certain rule - see
    examples below'''
    def get_rule(self, rule_no):
        tree, node = self.calculate_node(rule_no)
        print(self.find_node(self.trees[tree].tree_, 0, node, self.feature_names))

Comparing different models with RuleFit

Since the main purpose of RuleFit is to get a model that is both as accurate and interpretable as possible, let's see how other highly interpretable models are doing on the dataset - although not as popular in the data science community, I decided to compare on Mean-Absolute-Percentage Error $$MAPE=\frac{1}{n}\sum^n_{i=1}\frac{|\hat{y}_i-y_i|}{y_i}$$ since it makes it easier to interpret the relation between different errors and feels more natural to most non-stats/data who will base their decisions on model outputs.

Linear Regression

In [6]:
OLS_model = LinearRegression(), df_trainY)
predictions_linear = OLS_model.predict(df_testX)

print np.mean(np.abs(predictions_linear-df_testY)/df_testY)

Decision Tree

In [7]:
dtree_model = DecisionTreeRegressor(), df_trainY)
predictions_dtree = dtree_model.predict(df_testX)

print np.mean(np.abs(predictions_dtree-df_testY)/df_testY)


At first we'll run RuleFit to find the most important rules out of all rules generated by the GradientBoostingRegressor. The Lasso is highly penalized to set most coefficients to 0 through high penalization.
Also notice, that the initial max_depth of GradientBoostingRegressor-Trees is 3, so the resulting single rules will be highly interpretable for a human user.

In [8]:
model_rulef1 = RuleFit(linear_model = Lasso(alpha = 5000, max_iter = 1000)), df_trainY)

Next, another model will be created using only the 20 best rules of those created above:

In [9]:
model_rulef1.fit_best_n_rules(df_trainX, df_trainY, 20)
predictions_best_n = model_rulef1.predict_best_n_rules(df_testX)

print np.mean(np.abs(predictions_best_n-df_testY)/df_testY)

So, this marks an increase of more than 7 percentage points compared to plain OLS and almost 2.5 percentage points compared to the Decision Tree. Also notice that the Decision Tree is not bounded in depth, so interpreting the Tree above could be challenging as well.

Inspecting the rules

Finally, want to know what rules are actually being used here - this can be done by using the functions that have been defined earlier in the RuleFit class:

In [10]:
print("Most relevant rule:")
print("\nSecond most relevant rule:")
Most relevant rule:
--> if grade <= 8.5

Second most relevant rule:
--> if long <= -121.98750305175781
--> if sqft_living15 <= 3685.0
--> if sqft_above > 1753.5

Using the other methods in the RuleFit class, the user could actually rebuild the whole Linear Regression model, finalizing it to a pure White Box model that is easily interpretable. I will leave that task to the curious reader and continue with another interesting quality of RuleFit:

RuleFit as a competitive alternative to other Tree Ensembles

We don't always need our model to be highly interpretable but rather prefer the most accurate model. In the past, ensembles of Decision Trees have been successfully applied to many different tasks in data science. Altough they have lost some attention due to the current hype about Deep Learning, the popularity of Decision Tree models might soon see a renaissance.

Nevertheless, I was curious if RuleFit could actually compete with the likes of plain Gradient Boosting and Random Forest when interpretability is not necessary, so here are the results:

Gradient Boosting

In [11]:
model_gb = GradientBoostingRegressor(), df_trainY)

prediction_gb = model_gb.predict(df_testX)

print np.mean(np.abs(prediction_gb - df_testY)/df_testY)

Random Forest

In [12]:
model_rf = RandomForestRegressor(), df_trainY)

prediction_rf = model_rf.predict(df_testX)

print np.mean(np.abs(prediction_rf - df_testY)/df_testY)


In [13]:
model_rulef2 = RuleFit(scale_rules = False), df_trainY)
predictions_rulef2 = model_rulef1.predict(df_testX)

print np.mean(np.abs(predictions_rulef2-df_testY)/df_testY)

Not exactly on par with Random Forest but keep in mind that the initial GradientBoosting model will only generate rules of max_depth = 3, so we could try out some hyperparameter tuning:

In [14]:
model_rulef3 = RuleFit(rule_generator = GradientBoostingRegressor(max_depth = 7, n_estimators = 40),
                      scale_rules = False), df_trainY)
predictions_rulef3 = model_rulef3.predict(df_testX)

print np.mean(np.abs(predictions_rulef3-df_testY)/df_testY)

Of course we would have to actually perform the hyper-tuning on all models to have a fair comparison, yet the results at least indicate that RuleFit can keep up with more popular Tree-Ensembles. On the other hand, we should keep the no-free-lunch theorem in mind, too, and there are probably countless of datasets out there, where simple OLS will outperform even the most dedicated BlackBox algorithm. However, having a large toolbox for different problems at hand will always help on the various problem one will face in data analysis.
Another non-probabilistic WhiteBox algorithm that sounds really promising to me is Fast Function Extraction and I will probably work out a combination with RuleFit for a future blog post.

If you want to read more about interpretable Machine Learning, I highly recommend the work of Christoph Molnar. He also got a fully working Python package for RuleFit on GitHub if you want to try it out.


  1. sai

    This is a nice post in an interesting line of content.Thanks for sharing this article, great way of bring this topic to discussion.

Leave Comment

Your email address will not be published. Required fields are marked *