Question
Asked 4 July 2021
Why am I getting worse performance after GridSearchCV?
I first construct a base model (default parameters) and obtain MAE (rfr base file for image).
# BASELINE MODEL
rfr_pipe.fit(train_x, train_y)
base_rfr_pred = rfr_pipe.predict(test_x)
base_rfr_mae = mean_absolute_error(test_y, base_rfr_pred)
MAE = 2.188
Then I perform GridSearchCV to get best parameters and get the average MAE (rfr grid for image).
# RFR GRIDSEARCHCV
rfr_param = {'rfr_model__n_estimators' : [10, 100, 500, 1000],
'rfr_model__max_depth' : [None, 5, 10, 15, 20],
'rfr_model__min_samples_leaf' : [10, 100, 500, 1000],
'rfr_model__max_features' : ['auto', 'sqrt', 'log2']}
rfr_grid = GridSearchCV(estimator = rfr_pipe, param_grid = rfr_param, n_jobs = -1,
cv = 5, scoring = 'neg_mean_absolute_error')
rfr_grid.fit(train_x, train_y)
print('best parameters are:-', rfr_grid.best_params_)
print('best mae is:- ', -1 * rfr_grid.best_score_)
MAE = 2.697
Then I fit the "best parameters" obtained to get an optimized MAE but the results are always worse than the base model MAE (opt rfr for image).
# OPTIMIZED RFR MODEL
opt_rfr = RandomForestRegressor(random_state = 69, criterion = 'mae', max_depth = None,
max_features = 'auto', min_samples_leaf = 10, n_estimators = 100)
opt_rfr_pipe = Pipeline(steps = [('rfr_preproc', preproc), ('opt_rfr_model', opt_rfr)])
opt_rfr_pipe.fit(train_x, train_y)
opt_rfr_pred = opt_rfr_pipe.predict(test_x)
opt_rfr_mae = mean_absolute_error(test_y, opt_rfr_pred)
MAE = 2.496
Not just once but every time and in most of the models (linear regression, random forest regressor)! I guess there is something fundamentally wrong with my code else this problem wouldn't arise every time. Any idea what might be causing this?
Similar questions and discussions
How to Plot Random Forest Classifier results?
Rabia Bibi
I have applied Random Forest classifier to differentiate three species of same grain data. The classification accuracy is about 90%. I need to show this on a graph. How can I do that?
clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
clf.fit(X_train, y_train.values.ravel())
y_pred=clf.predict(X_test)
accuracy=accuracy_score(y_test, y_pred)
print(accuracy)
#classification Report
print(classification_report(y_test, y_pred))
How to compare deep learning models?
Sarah Almaghrabi
I need to evaluate different deep learning models for one forecasting problem
First: how to select the best configuration for a specific model?
second: how to compare the different models for the task at hand?
For the first: Using training and validation data I started to train the model with checkpoints (to save only the weights when there is improvement in the val_loss). Then, I used the latest checkpoints to create a new model for the prediction
Is this a good practice?
Is there something I should consider?
For the second: I used the same grid search for all the models .. however, I faced two main problems:
- The randomness by model nature
- The too many hyperparameters that I need to track
I was reading about the randomness problem and I followed the (seed) solution suggested https://opendatascience.com/properly-setting-the-random-seed-in-ml-experiments-not-as-simple-as-you-might-imagine/. But, I am afraid this is affecting the weights initialization and hence the model performance!
I am still not sure how to mitigate it. Is there any way other than logging and plotting to select the best candidate
I am self-learning, so I am sorry if these questions are silly or should be basics.
Thanks for any help
Related Publications
Thesis (M.S.)--Ohio State University, 1995. Includes bibliographical references (leaves 130-131). Advisor: Benedikt Munk, Dept. of Electrical Engineering.