Ensemble Cross-Validation for Random Forest Regression#
[ ]:
import sys
sys.path.append('..')
import os
n_cores = int(8)
os.environ["OMP_NUM_THREADS"] = f"{n_cores}"
os.environ["OPENBLAS_NUM_THREADS"] = f"{n_cores}"
os.environ["MKL_NUM_THREADS"] = f"{n_cores}"
os.environ["VECLIB_MAXIMUM_THREADS"] = f"{n_cores}"
os.environ["NUMEXPR_NUM_THREADS"] = f"{n_cores}"
os.environ["NUMBA_CACHE_DIR"]='/tmp/numba_cache'
import numpy as np
from sklearn_ensemble_cv import reset_random_seeds
reset_random_seeds(0)
We make up some fake data for illustration. Below, the response is of dimension 2.
[2]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples=300, n_features=200,
n_informative=5, n_targets=2,
random_state=0, shuffle=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
Utility fitting function#
To facilitate the fitting and model selection of random forests, we define a function that takes in the data and returns the prediction values on test features. One can customize the function to return quantities they need.
[3]:
from joblib import Parallel, delayed
from tqdm import tqdm
from sklearn_ensemble_cv import reset_random_seeds, Ensemble, ECV
from sklearn.tree import DecisionTreeRegressor
def fit_rf(X, y, X_test=None, M=25, M_max=50,
# fixed parameters for bagging regressor
kwargs_ensemble={'verbose':1},
# fixed parameters for decision tree
kwargs_regr={'min_samples_split': 2},
# grid search parameters
grid_regr = {'max_depth': [6,7]},
grid_ensemble = {
'max_features':np.array([0.9,1.]),
'max_samples':np.array([0.6,0.7])},
):
# Make sure y is 2D
y = y.reshape(-1, 1) if y.ndim == 1 else y
# Run ECV
res_ecv, info_ecv = ECV(
X, y, DecisionTreeRegressor, grid_regr, grid_ensemble,
kwargs_regr, kwargs_ensemble,
M=M, M0=M, M_max=M_max, return_df=True
)
# Replace the in-sample best parameter for 'n_estimators' with extrapolated best parameter
info_ecv['best_params_ensemble']['n_estimators'] = info_ecv['best_n_estimators_extrapolate']
# Fit the ensemble with the best CV parameters
regr = Ensemble(
estimator=DecisionTreeRegressor(**info_ecv['best_params_regr']),
**info_ecv['best_params_ensemble']).fit(X, y)
# Predict
if X_test is None:
X_test = X
return regr.predict(X_test).reshape(-1, y.shape[1])
Model selection#
A simple way to deal with multiple responses is through multitask learning, where we use all the responses for node splitting of decision trees. This is implemented in the sklearn package.
[4]:
y_test_pred = fit_rf(X_train, y_train, X_test)
# Print the mean squared error
print(np.mean((y_test_pred - y_test)**2))
5722.970560206186
Alternatively, we can also fit a separate random forest for each response, as implemented below.
[5]:
def fit_rf_ind(X, Y, *args, **kwargs):
Y_hat = Parallel(n_jobs=-1)(delayed(fit_rf)(X, Y[:,j], *args, **kwargs)
for j in tqdm(range(Y.shape[1])))
Y_pred = np.concatenate(Y_hat, axis=-1)
return Y_pred
[6]:
y_test_pred = fit_rf_ind(X_train, y_train, X_test)
# Print the mean squared error
print(np.mean((y_test_pred - y_test)**2))
100%|██████████| 2/2 [00:00<00:00, 9.43it/s]
5489.074680557097
We see that the second approach gives better performance in this case. However, the first approach is more computationally efficient and may be preferred in practice, especially when the number of responses is large.