Ensemble Cross-Validation for Random Forest Regression

This page was generated from docs/tutorials/random_forests.ipynb. Interactive online version: Binder badge.

Ensemble Cross-Validation for Random Forest Regression#

[ ]:
import sys
sys.path.append('..')

import os
n_cores = int(8)
os.environ["OMP_NUM_THREADS"] = f"{n_cores}"
os.environ["OPENBLAS_NUM_THREADS"] = f"{n_cores}"
os.environ["MKL_NUM_THREADS"] = f"{n_cores}"
os.environ["VECLIB_MAXIMUM_THREADS"] = f"{n_cores}"
os.environ["NUMEXPR_NUM_THREADS"] = f"{n_cores}"
os.environ["NUMBA_CACHE_DIR"]='/tmp/numba_cache'

import numpy as np
from sklearn_ensemble_cv import reset_random_seeds

reset_random_seeds(0)

We make up some fake data for illustration. Below, the response is of dimension 2.

[2]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=300, n_features=200,
                       n_informative=5, n_targets=2,
                       random_state=0, shuffle=False)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

Utility fitting function#

To facilitate the fitting and model selection of random forests, we define a function that takes in the data and returns the prediction values on test features. One can customize the function to return quantities they need.

[3]:
from joblib import Parallel, delayed
from tqdm import tqdm
from sklearn_ensemble_cv import reset_random_seeds, Ensemble, ECV
from sklearn.tree import DecisionTreeRegressor

def fit_rf(X, y, X_test=None, M=25, M_max=50,
    # fixed parameters for bagging regressor
    kwargs_ensemble={'verbose':1},
    # fixed parameters for decision tree
    kwargs_regr={'min_samples_split': 2},
    # grid search parameters
    grid_regr = {'max_depth': [6,7]},
    grid_ensemble = {
        'max_features':np.array([0.9,1.]),
        'max_samples':np.array([0.6,0.7])},
    ):

    # Make sure y is 2D
    y = y.reshape(-1, 1) if y.ndim == 1 else y

    # Run ECV
    res_ecv, info_ecv = ECV(
        X, y, DecisionTreeRegressor, grid_regr, grid_ensemble,
        kwargs_regr, kwargs_ensemble,
        M=M, M0=M, M_max=M_max, return_df=True
    )

    # Replace the in-sample best parameter for 'n_estimators' with extrapolated best parameter
    info_ecv['best_params_ensemble']['n_estimators'] = info_ecv['best_n_estimators_extrapolate']

    # Fit the ensemble with the best CV parameters
    regr = Ensemble(
        estimator=DecisionTreeRegressor(**info_ecv['best_params_regr']),
        **info_ecv['best_params_ensemble']).fit(X, y)

    # Predict
    if X_test is None:
        X_test = X
    return regr.predict(X_test).reshape(-1, y.shape[1])

Model selection#

A simple way to deal with multiple responses is through multitask learning, where we use all the responses for node splitting of decision trees. This is implemented in the sklearn package.

[4]:
y_test_pred = fit_rf(X_train, y_train, X_test)

# Print the mean squared error
print(np.mean((y_test_pred - y_test)**2))
5722.970560206186

Alternatively, we can also fit a separate random forest for each response, as implemented below.

[5]:
def fit_rf_ind(X, Y, *args, **kwargs):
    Y_hat = Parallel(n_jobs=-1)(delayed(fit_rf)(X, Y[:,j], *args, **kwargs)
        for j in tqdm(range(Y.shape[1])))
    Y_pred = np.concatenate(Y_hat, axis=-1)
    return Y_pred
[6]:
y_test_pred = fit_rf_ind(X_train, y_train, X_test)

# Print the mean squared error
print(np.mean((y_test_pred - y_test)**2))
100%|██████████| 2/2 [00:00<00:00,  9.43it/s]
5489.074680557097

We see that the second approach gives better performance in this case. However, the first approach is more computationally efficient and may be preferred in practice, especially when the number of responses is large.