Model Fitness
When you make a model with a decision tree regressor you're specifying the depth that you want to go in the tree.
The layer of depth will reduce the size of each leaf node in the tree.
You can get to the point that you're just fitting the data more closely to what you already have in each leaf.
If you make so many branches you'll eventually get to the point that each leaf has just 1 value and then it fits the data perfectly, but isnt going to work for data that's outside
This causes either underfitting or overfitting
We measure underfitting and overfitting with the mean absolute error.
The max_leaf_nodes is a good way to measure overfitting in a model.
from sklearn.metrics import mean_absolute_error from sklearn.tree import DecisionTreeRegressor def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y): model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes,random_state=0) model.fit(train_X, train_y) preds_val = model.predict(val_X) mae = mean_absolute_error(val_y, preds_val) return(mae)
for max_leaf_nodes in [5, 50, 500, 5000]: my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y) print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))
Example of how to get the minimum Mean absolute error
leaflist = [1,2,3,4] scores = {leaf_size: get_mae(leaf_size, train_X, val_X train_y, val_y) for leaf_size in leaflist} best_tree_size = min(scores, key=scores.get)