R/fit_xgb_occupancy_models.R
fit_xgb_occupancy_models.Rd
Estimate probability of occupancy for a set of features in a set of
planning units. Models are fitted using gradient boosted trees (via
xgboost::xgb.train()
).
fit_xgb_occupancy_models(
site_data,
feature_data,
site_detection_columns,
site_n_surveys_columns,
site_env_vars_columns,
feature_survey_sensitivity_column,
feature_survey_specificity_column,
xgb_tuning_parameters,
xgb_early_stopping_rounds = rep(20, length(site_detection_columns)),
xgb_n_rounds = rep(100, length(site_detection_columns)),
n_folds = rep(5, length(site_detection_columns)),
n_threads = 1,
seed = 500,
verbose = FALSE
)
sf::sf()
object with site data.
base::data.frame()
object with feature data.
character
names of numeric
columns in the argument to site_data
that contain the proportion of
surveys conducted within each site that detected each feature.
Each column should correspond to a different feature, and contain
a proportion value (between zero and one). If a site has
not previously been surveyed, a value of zero should be used.
character
names of numeric
columns in the argument to site_data
that contain the total
number of surveys conducted for each each feature within each site.
Each column should correspond to a different feature, and contain
a non-negative integer number (e.g. 0, 1, 2, 3). If a site has
not previously been surveyed, a value of zero should be used.
character
names of columns in the
argument to site_data
that contain environmental information
for fitting updated occupancy models based on possible survey outcomes.
Each column should correspond to a different environmental variable,
and contain numeric
, factor
, or character
data.
No missing (NA
) values are permitted in these columns.
character
name of the
column in the argument to feature_data
that contains
probability of future surveys correctly detecting a presence of each
feature in a given site (i.e. the sensitivity of the survey methodology).
This column should have numeric
values that are between zero and
one. No missing (NA
) values are permitted in this column.
character
name of the
column in the argument to feature_data
that contains
probability of future surveys correctly detecting an absence of each
feature in a given site (i.e. the specificity of the survey methodology).
This column should have numeric
values that are between zero and
one. No missing (NA
) values are permitted in this column.
list
object containing the candidate
parameter values for fitting models. Valid parameters include:
"max_depth"
, "eta"
, "lambda"
,
"min_child_weight"
, "subsample"
, "colsample_by_tree"
,
"objective"
. See documentation for the params
argument in
xgboost::xgb.train()
for more information.
numeric
model rounds for parameter
tuning. See xgboost::xgboost()
for more information.
Defaults to 10 for each feature.
numeric
model rounds for model fitting
See xgboost::xgboost()
for more information.
Defaults to 100 for each feature.
numeric
number of folds to split the training
data into when fitting models for each feature.
Defaults to 5 for each feature.
integer
number of threads to use for parameter
tuning. Defaults to 1.
integer
initial random number generator state for model
fitting. Defaults to 500.
logical
indicating if information should be
printed during computations. Defaults to FALSE
.
A list
object containing:
list
of list
objects containing the best
tuning parameters for each feature.
tibble::tibble()
object containing
predictions for each feature.
tibble::tibble()
object containing the
performance of the best models for each feature. It contains the following
columns:
name of the feature.
mean TSS statistic for models calculated using training data in cross-validation.
standard deviation in TSS statistics for models calculated using training data in cross-validation.
mean sensitivity statistic for models calculated using training data in cross-validation.
standard deviation in sensitivity statistics for models calculated using training data in cross-validation.
mean specificity statistic for models calculated using training data in cross-validation.
standard deviation in specificity statistics for models calculated using training data in cross-validation.
mean TSS statistic for models calculated using test data in cross-validation.
standard deviation in TSS statistics for models calculated using test data in cross-validation.
mean sensitivity statistic for models calculated using test data in cross-validation.
standard deviation in sensitivity statistics for models calculated using test data in cross-validation.
mean specificity statistic for models calculated using test data in cross-validation.
standard deviation in specificity statistics for models calculated using test data in cross-validation.
This function (i) prepares the data for model fitting, (ii) calibrates
the tuning parameters for model fitting (see xgboost::xgb.train()
for details on tuning parameters), (iii) generate predictions using
the best found tuning parameters, and (iv) assess the performance of the
best supported models. These analyses are performed separately for each
feature. For a given feature:
The data are prepared for model fitting by partitioning the data using
k-fold cross-validation (set via argument to n_folds
). The
training and evaluation folds are constructed
in such a manner as to ensure that each training and evaluation
fold contains at least one presence and one absence observation.
A grid search method is used to tune the model parameters. The
candidate values for each parameter (specified via parameters
) are
used to generate a full set of parameter combinations, and these
parameter combinations are subsequently used for tuning the models.
To account for unbalanced datasets, the
scale_pos_weight
xgboost::xgboost()
parameter
is calculated as the mean value across each of the training folds
(i.e. number of absence divided by number of presences per feature).
For a given parameter combination, models are fit using k-fold cross-
validation (via xgboost::xgb.cv()
) -- using the previously
mentioned training and evaluation folds -- and the True Skill Statistic
(TSS) calculated using the data held out from each fold is
used to quantify the performance (i.e. "test_tss_mean"
column in
output). These models are also fitted using the
early_stopping_rounds
parameter to reduce time-spent
tuning models. If relevant, they are also fitted using the supplied weights
(per by the argument to site_weights_data
). After exploring the
full set of parameter combinations, the best parameter combination is
identified, and the associated parameter values and models are stored for
later use.
The cross-validation models associated with the best parameter combination are used to generate predict the average probability that the feature occupies each site. These predictions include sites that have been surveyed before, and also sites that have not been surveyed before.
The performance of the cross-validation models is evaluated.
Specifically, the TSS, sensitivity, and specificity statistics are
calculated (if relevant, weighted by the argument to
site_weights_data
). These performance values are calculated using
the models' training and evaluation folds.
# \dontrun{
# set seeds for reproducibility
set.seed(123)
# simulate data for 30 sites, 2 features, and 3 environmental variables
site_data <- simulate_site_data(
n_sites = 30, n_features = 2, n_env_vars = 3, prop = 0.1)
feature_data <- simulate_feature_data(n_features = 2, prop = 1)
# create list of possible tuning parameters for modeling
parameters <- list(eta = seq(0.1, 0.5, length.out = 3),
lambda = 10 ^ seq(-1.0, 0.0, length.out = 3),
objective = "binary:logistic")
# fit models
# note that we use 10 random search iterations here so that the example
# finishes quickly, you would probably want something like 1000+
results <- fit_xgb_occupancy_models(
site_data, feature_data,
c("f1", "f2"), c("n1", "n2"), c("e1", "e2", "e3"),
"survey_sensitivity", "survey_specificity",
n_folds = rep(5, 2), xgb_early_stopping_rounds = rep(100, 2),
xgb_tuning_parameters = parameters, n_threads = 1)
# print best found model parameters
print(results$parameters)
#> [[1]]
#> [[1]]$eta
#> [1] 0.1
#>
#> [[1]]$lambda
#> [1] 0.1
#>
#> [[1]]$objective
#> [1] "binary:logistic"
#>
#> [[1]]$scale_pos_weight
#> [[1]]$scale_pos_weight[[1]]
#> [1] 1 1 1 1 1
#>
#>
#>
#> [[2]]
#> [[2]]$eta
#> [1] 0.1
#>
#> [[2]]$lambda
#> [1] 0.1
#>
#> [[2]]$objective
#> [1] "binary:logistic"
#>
#> [[2]]$scale_pos_weight
#> [[2]]$scale_pos_weight[[1]]
#> [1] 1 1 1 1 1
#>
#>
#>
# print model predictions
print(results$predictions)
#> # A tibble: 30 × 2
#> f1 f2
#> <dbl> <dbl>
#> 1 0.450 0.635
#> 2 0.549 0.362
#> 3 0.450 0.629
#> 4 0.549 0.362
#> 5 0.539 0.363
#> 6 0.450 0.638
#> 7 0.450 0.373
#> 8 0.549 0.627
#> 9 0.450 0.630
#> 10 0.529 0.557
#> # ℹ 20 more rows
# print model performance
print(results$performance, width = Inf)
#> # A tibble: 2 × 13
#> feature train_tss_mean train_tss_std train_sensitivity_mean
#> <chr> <dbl> <dbl> <dbl>
#> 1 f1 1.00 0 1.00
#> 2 f2 0.976 0.0142 1.00
#> train_sensitivity_std train_specificity_mean train_specificity_std
#> <dbl> <dbl> <dbl>
#> 1 0 1.00 0
#> 2 0 0.976 0.0142
#> test_tss_mean test_tss_std test_sensitivity_mean test_sensitivity_std
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.669 0.433 0.900 0.224
#> 2 0.729 0.253 0.992 0.0174
#> test_specificity_mean test_specificity_std
#> <dbl> <dbl>
#> 1 0.769 0.265
#> 2 0.737 0.263
# }