Jared Petry - Tuning an XGBoost Machine Learning Model to Predict Eel Presence

Case Study: Eel Species Distribution Modeling

This blog post loosely follows the boosted regression tree exercise outlined in the academic paper: “A working guide to boosted regression trees” by Edith et al. We will use the same dataset that they did on the distribution of the short finned eel (Anguilla australis). We will be using the xgboost library, tidymodels, caret, parsnip, vip, and more.

Citation:

Elith, J., Leathwick, J. R., & Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology, 77(4), 802–813. https://doi.org/10.1111/j.1365-2656.2008.01390.x

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(dplyr)
library(caret)

Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.0     ✔ rsample      1.1.0
✔ dials        1.0.0     ✔ tune         1.0.0
✔ infer        1.0.3     ✔ workflows    1.0.0
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.1     ✔ yardstick    1.1.0
✔ recipes      1.0.1     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard()        masks purrr::discard()
✖ dplyr::filter()          masks stats::filter()
✖ recipes::fixed()         masks stringr::fixed()
✖ dplyr::lag()             masks stats::lag()
✖ caret::lift()            masks purrr::lift()
✖ yardstick::precision()   masks caret::precision()
✖ yardstick::recall()      masks caret::recall()
✖ yardstick::sensitivity() masks caret::sensitivity()
✖ yardstick::spec()        masks readr::spec()
✖ yardstick::specificity() masks caret::specificity()
✖ recipes::step()          masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/

library(gbm)

Loaded gbm 2.1.8.1

library(xgboost)


Attaching package: 'xgboost'

The following object is masked from 'package:dplyr':

    slice

library(ggpubr)
library(tictoc)
library(vip)


Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi

Read in the data

eel_dat <- read_csv("eel.model.data.csv") |> 
  mutate(Angaus = as.factor(Angaus))

Rows: 1000 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Method
dbl (13): Site, Angaus, SegSumT, SegTSeas, SegLowFlow, DSDist, DSMaxSlope, U...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Split and Resample

Split the joined data from above into a training and test set, stratified by outcome score. Use 10-fold CV to resample the training set, stratified by Angaus

# create initial split
eel_split <- initial_split(data = eel_dat, 
                           prop = 0.7, 
                           strata = Angaus)

# create training and testing data from the split 
eel_train <- training(eel_split) 
eel_test <- testing(eel_split)

# resample the training set 
eel_folds <- vfold_cv(data = eel_train, 
                      v = 10,
                      strata = Angaus)

Preprocess

Create a recipe to prepare the data for the XGBoost model. We are interested in predicting the binary outcome variable Angaus which indicates presence or absence of the eel species Anguilla australis

eel_recipe <- recipe(Angaus ~ ., data = eel_train) |> 
  step_integer(all_predictors(), zero_based = TRUE)

Tuning XGBoost

We are going to tune 3 different times to get the ideal hyperparamters in our model. We first tune the learning rate and get the estimation of the best learning rate. Then we take that learning rate and set it as fixed in the next tuning. Next, we tune the three tree paramters: tree_depth, loss_reduction, and min_n. Lastly, we will set those as fixed for the stochastic tuning (m_try).

Tune Learning Rate

Create a model specification using {xgboost} for the estimation

Only specify learning rate parameter to tune()

# create model specification: (tune learning rate first)
eel_boost_model <- boost_tree(
  mode = "classification", 
  trees = 3000, 
  engine = "xgboost", 
  tree_depth = NULL,
  loss_reduction = NULL, 
  learn_rate = tune(),
  min_n = NULL,
  mtry = NULL
  )

Set up a grid to tune the model by using a range of learning rate parameter values.

learn_rate_grid <- expand.grid(learn_rate = seq(0.0001, 0.3, length.out = 30))

Computational efficiency becomes a factor as models get more complex and data gets larger. Becuase of this, we will be recording the time it takes to run with Sys.time().

Tune the learning rate parameter:

# tune the learning rate:

# define a workflow for tuning the learning rate: 
eel_learn_wf <- workflow() |> 
  add_model(eel_boost_model) |> 
  add_recipe(eel_recipe)

# fit to the tuning grid 
start_time1 <- Sys.time()
tune_learn <- eel_learn_wf |> 
  tune_grid(
    eel_folds, 
    grid = learn_rate_grid
  )
end_time1 <- Sys.time()
print(paste("time elapsed:", (end_time1 - start_time1)))

[1] "time elapsed: 5.83796453078588"

Show the performance of the best models and the estimates for the learning rate parameter values associated with each.

show_best(tune_learn, metric = "roc_auc")  # using metric roc_auc

# A tibble: 5 × 7
  learn_rate .metric .estimator  mean     n std_err .config              
       <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     0.0725 roc_auc binary     0.821    10  0.0214 Preprocessor1_Model08
2     0.0518 roc_auc binary     0.820    10  0.0205 Preprocessor1_Model06
3     0.104  roc_auc binary     0.820    10  0.0241 Preprocessor1_Model11
4     0.269  roc_auc binary     0.819    10  0.0223 Preprocessor1_Model27
5     0.207  roc_auc binary     0.819    10  0.0241 Preprocessor1_Model21

# --- learn rate = 0.04146552... use this because the paper did
show_best(tune_learn, metric = "accuracy")

# A tibble: 5 × 7
  learn_rate .metric  .estimator  mean     n std_err .config              
       <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
1     0.259  accuracy binary     0.823    10 0.00873 Preprocessor1_Model26
2     0.124  accuracy binary     0.821    10 0.0112  Preprocessor1_Model13
3     0.207  accuracy binary     0.821    10 0.0110  Preprocessor1_Model21
4     0.176  accuracy binary     0.818    10 0.0118  Preprocessor1_Model18
5     0.0104 accuracy binary     0.818    10 0.0123  Preprocessor1_Model02

Tune Tree Parameters

Create a new specification where you set the learning rate (which we already optimized) and tune the tree parameters.

# create model specification for tuning tree depth with new (optimized) learning rate
eel_boost_model2 <- boost_tree(
  mode = "classification", 
  trees = 3000, 
  engine = "xgboost", 
  tree_depth = tune(),
  loss_reduction = tune(), 
  learn_rate = 0.04146552,
  min_n = tune(),
  mtry = NULL
  )

# define workflow for tuning tree parameters: 
tree_tune_wf <- workflow() |> 
  add_model(eel_boost_model2) |> 
  add_recipe(eel_recipe)

Set up a tuning grid. This time use grid_max_entropy() to get a representative sampling of the parameter space

# we are now tuning the tree parameters: tree_depth(), min_n(), and loss_reduction()

# use tidymodels dials package to specify which paramters we are trying to tune... 
# --- grid_max_entropy() needs an object like this to work properly

tree_depth_param <- dials::parameters(
  tree_depth(), 
  min_n(), 
  loss_reduction()
)
tree_tune_grid <- grid_max_entropy(tree_depth_param, size = 60)

Show the performance of the best models and the estimates for the tree parameter values associated with each.

# fit to the tuning grid 
start_time1 <- Sys.time()
tune_tree <- tree_tune_wf |> 
  tune_grid(
    eel_folds, 
    grid = tree_tune_grid
  )
end_time1 <- Sys.time()
print(paste("time elapsed:", (end_time1 - start_time1)))

[1] "time elapsed: 11.8651802341143"

show_best(tune_tree, metric = "roc_auc")

# A tibble: 5 × 9
  min_n tree_depth loss_reduction .metric .estimator  mean     n std_err .config
  <int>      <int>          <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>  
1     3         14       4.98e- 9 roc_auc binary     0.831    10  0.0209 Prepro…
2    15         11       2.75e+ 0 roc_auc binary     0.830    10  0.0187 Prepro…
3     2         15       2.35e- 6 roc_auc binary     0.829    10  0.0202 Prepro…
4     4          7       1.63e-10 roc_auc binary     0.828    10  0.0238 Prepro…
5     3          9       1.48e- 5 roc_auc binary     0.827    10  0.0213 Prepro…

# best value for min_n: 17
# best value for tree_depth: 11
# best value for loss_reduction: 0.367

Tune Stochastic Parameters

We will create a new specification where we set the learning rate and tree parameters (which you already optimized) and tune the stochastic parameters (m_try).

# model specification with fixed learning rate and tree parameters to tune stochastic params
eel_boost_model3 <- boost_tree(
  mode = "classification", 
  trees = 3000, 
  engine = "xgboost", 
  tree_depth = 11,
  loss_reduction = 0.367, 
  learn_rate = 0.04146552,
  min_n = 17,
  mtry = tune(), 
  sample_size = tune()
  )

# define workflow for tuning stochastic parameters: 
stoch_tune_wf <- workflow() |> 
  add_model(eel_boost_model3) |> 
  add_recipe(eel_recipe)

Set up a tuning grid using grid_max_entropy() again.

stoch_param <- dials::parameters(
  finalize(mtry(), dplyr::select(eel_train, -Angaus)), 
  sample_size = sample_prop(c(0.4, 0.9))
)
stoch_grid <- grid_max_entropy(stoch_param, size = 60)

Show the performance of the best models and the estimates for the tree parameter values associated with each.

# fit to the tuning grid 
start_time1 <- Sys.time()
tune_stoch <- stoch_tune_wf |> 
  tune_grid(
    eel_folds, 
    grid = stoch_grid
  )
end_time1 <- Sys.time()
print(paste("time elapsed:", (end_time1 - start_time1)))

[1] "time elapsed: 20.4042701681455"

show_best(tune_stoch, metric = "roc_auc")

# A tibble: 5 × 8
   mtry sample_size .metric .estimator  mean     n std_err .config              
  <int>       <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     2       0.886 roc_auc binary     0.808    10  0.0214 Preprocessor1_Model07
2     5       0.447 roc_auc binary     0.808    10  0.0257 Preprocessor1_Model32
3     6       0.775 roc_auc binary     0.806    10  0.0242 Preprocessor1_Model40
4     1       0.710 roc_auc binary     0.806    10  0.0241 Preprocessor1_Model60
5     3       0.443 roc_auc binary     0.806    10  0.0249 Preprocessor1_Model10

# best value for mtry: 1
# best value for sample size: 0.8932816

Finalize workflow and make final prediction

Assemble your final workflow with all of your optimized parameters and do a final fit.

# model specification with optimal parameters
eel_boost_final <- boost_tree(
  mode = "classification", 
  trees = 3000, 
  engine = "xgboost", 
  tree_depth = 11,
  loss_reduction = 0.367, 
  learn_rate = 0.04146552,
  min_n = 17,
  mtry = 1, 
  sample_size = 0.8932816
  )

# define workflow for final fit
stoch_tune_wf <- workflow() |> 
  add_model(eel_boost_final) |> 
  add_recipe(eel_recipe)

final_eel_fit <- last_fit(eel_boost_final, 
                          Angaus ~ ., 
                          eel_split)

final_eel_fit$.predictions

[[1]]
# A tibble: 301 × 6
   .pred_0 .pred_1  .row .pred_class Angaus .config             
     <dbl>   <dbl> <int> <fct>       <fct>  <chr>               
 1   0.194 0.806       2 1           1      Preprocessor1_Model1
 2   0.812 0.188       4 0           0      Preprocessor1_Model1
 3   0.986 0.0136      6 0           0      Preprocessor1_Model1
 4   0.551 0.449      10 0           1      Preprocessor1_Model1
 5   0.968 0.0321     11 0           0      Preprocessor1_Model1
 6   0.996 0.00419    15 0           0      Preprocessor1_Model1
 7   0.215 0.785      18 1           1      Preprocessor1_Model1
 8   0.774 0.226      20 0           0      Preprocessor1_Model1
 9   0.969 0.0314     22 0           0      Preprocessor1_Model1
10   0.985 0.0155     28 0           0      Preprocessor1_Model1
# … with 291 more rows
# ℹ Use `print(n = ...)` to see more rows

boost_tree_metrics <- final_eel_fit |> collect_metrics()
boost_tree_accuracy <- boost_tree_metrics$.estimate[1]
print(paste0("the decision tree model accuracy came out to: ", boost_tree_accuracy))

[1] "the decision tree model accuracy came out to: 0.850498338870432"

How well did your model perform? What types of errors did it make?

make a confusion matrix

# to make a confusion matrix we need a table of the predictions vs the true values
boost_predictions <- final_eel_fit$.predictions[[1]]

# make a simple table of just the predictions and actual values
confusion_table <- boost_predictions |> 
  dplyr::select(c(.pred_class, Angaus))

# create a confusion matrix comparing the predictions with actual observations
confusionMatrix(data = confusion_table$.pred_class, 
                reference = confusion_table$Angaus)

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 226  31
         1  14  30
                                          
               Accuracy : 0.8505          
                 95% CI : (0.8051, 0.8888)
    No Information Rate : 0.7973          
    P-Value [Acc > NIR] : 0.01109         
                                          
                  Kappa : 0.4837          
                                          
 Mcnemar's Test P-Value : 0.01707         
                                          
            Sensitivity : 0.9417          
            Specificity : 0.4918          
         Pos Pred Value : 0.8794          
         Neg Pred Value : 0.6818          
             Prevalence : 0.7973          
         Detection Rate : 0.7508          
   Detection Prevalence : 0.8538          
      Balanced Accuracy : 0.7167          
                                          
       'Positive' Class : 0

Well, looks like my model did OK. It got an accuracy of 0.8339 which is not too bad… from looking at the confusion matrix, most of the errors that were made were false negative predictions, which were over twice the number of false positive predictions.

Fit the model to the evaluation data and compare performance

Now we will fit the final model to the big dataset used in the paper.

# read in the eval data 
eval_data <- read_csv("eel.eval.data.csv") |> 
  mutate(Angaus_obs = as.factor(Angaus_obs))

Rows: 500 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Method
dbl (12): Angaus_obs, SegSumT, SegTSeas, SegLowFlow, DSDist, DSMaxSlope, USA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# finalize our model: 
best_params <- select_best(tune_stoch)

Warning: No value of `metric` was given; metric 'roc_auc' will be used.

final_model <- finalize_model(eel_boost_final, parameters = best_params)

# make predictions with our model
eval_fit <- final_model |> fit(Angaus_obs ~ ., data = eval_data)

How does the model perform on this data?

# generate the predicted outcomes 
eval_preds <- eval_fit |> predict(new_data = eval_data)
eval_pred_probs <- eval_fit |> predict_classprob.model_fit(new_data = eval_data)
joined_predictions <- bind_cols(eval_data, eval_preds, eval_pred_probs)

# assess model performance with a confusion matrix 
confusionMatrix(data = joined_predictions$.pred_class, 
                reference = joined_predictions$Angaus_obs)

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 374  57
         1  19  50
                                          
               Accuracy : 0.848           
                 95% CI : (0.8135, 0.8783)
    No Information Rate : 0.786           
    P-Value [Acc > NIR] : 0.0002819       
                                          
                  Kappa : 0.4811          
                                          
 Mcnemar's Test P-Value : 2.194e-05       
                                          
            Sensitivity : 0.9517          
            Specificity : 0.4673          
         Pos Pred Value : 0.8677          
         Neg Pred Value : 0.7246          
             Prevalence : 0.7860          
         Detection Rate : 0.7480          
   Detection Prevalence : 0.8620          
      Balanced Accuracy : 0.7095          
                                          
       'Positive' Class : 0

Woohoo! we got an even better accuracy than with the testing data that we fit the model to! Accuracy was 84.4% Still over twice as many false negative errors than false positive ones

How do our results compare to those of Elith et al.?

As for variable importance, I got just about the same results for the most important predictors being: summer air temp, distance to coast, native vegetation, downstream max slope… execpt, in the paper one of the most important was the fishing method… which didn’t pop up for me which I thought was interesting. The roc area under the curve that the scientists from the paper acheived was 0.858, which was slightly higher than mine… dang! But at least it was really close.

Variable Importance

Using the package {vip} to compare variable importance. This is really cool because we can see which variables influenced the model the most once it is finalized.

# create a plot of variable importance
vip(eval_fit)

Discussion

What do our variable importance results tell us about the distribution of this eel species?

Summer air temperature is very important to this species because it was the most influential variable in the model. Maybe some part of their breeding or other important stage of their life cycle occurs in summer. From the code I wrote below, it seems like they like warmer temperatures on average in the summer.

Their distance to the coast is also very important in determining the prescence of this eel species. The average distance for the dataset while the eel is present is HALF of that with the eel absent, so I’m thinking that the eel likes to be closer to the coast rather than farther. Maybe this is because they are anadromous in some way or need brackish water for part of their life cycle.

The species also seems to be heavily influence in the amount of native forest that the particular habitat contains… no surprise there!

The species also seems to like areas with more gently sloped downstream areas, which to me suggests that they somewhat rely on being able to travel up and downstream…

# make a little subset dataframe for just the places that the eel was present
eel_pres <- eel_dat |> 
  filter(Angaus == 1)
eel_abs <- eel_dat |> 
  filter(Angaus == 0)

# find out what summer temperature they like 
mean(eel_pres$SegSumT) # = 17.8005

[1] 17.8005

mean(eel_abs$SegSumT) # = 16.8005

[1] 16.07343

# find out if they like to be closer or farther from the coast
mean(eel_pres$DSDist) # = 42.47604

[1] 42.47604

mean(eel_abs$DSDist) # = 82.60588

[1] 82.60588

# find out whether they like steeper slope or shallower slope 
mean(eel_pres$DSMaxSlope) # = 1.544

[1] 1.544554

mean(eel_abs$DSMaxSlope) # = 3.395

[1] 3.395376