Approximately near optimal survey scheme — approx_near_optimal_survey

Find a near optimal survey scheme that maximizes value of information. This function uses the approximation method for calculating the expected value of the decision given a survey scheme, and a greedy heuristic algorithm to maximize this metric.

Usage

approx_near_optimal_survey_scheme(
  site_data,
  feature_data,
  site_detection_columns,
  site_n_surveys_columns,
  site_probability_columns,
  site_management_cost_column,
  site_survey_cost_column,
  feature_survey_column,
  feature_survey_sensitivity_column,
  feature_survey_specificity_column,
  feature_model_sensitivity_column,
  feature_model_specificity_column,
  feature_target_column,
  total_budget,
  survey_budget,
  site_management_locked_in_column = NULL,
  site_management_locked_out_column = NULL,
  site_survey_locked_out_column = NULL,
  prior_matrix = NULL,
  n_approx_replicates = 100,
  n_approx_outcomes_per_replicate = 10000,
  seed = 500,
  n_threads = 1,
  verbose = FALSE
)

Arguments

site_data: sf::sf() object with site data.
feature_data: base::data.frame() object with feature data.
site_detection_columns: character names of numeric columns in the argument to site_data that contain the proportion of surveys conducted within each site that detected each feature. Each column should correspond to a different feature, and contain a proportion value (between zero and one). If a site has not previously been surveyed, a value of zero should be used.
site_n_surveys_columns: character names of numeric columns in the argument to site_data that contain the total number of surveys conducted for each each feature within each site. Each column should correspond to a different feature, and contain a non-negative integer number (e.g. 0, 1, 2, 3). If a site has not previously been surveyed, a value of zero should be used.
site_probability_columns: character names of numeric columns in the argument to site_data that contain modeled probabilities of occupancy for each feature in each site. Each column should correspond to a different feature, and contain probability data (values between zero and one). No missing (NA) values are permitted in these columns.
site_management_cost_column: character name of column in the argument to site_data that contains costs for managing each site for conservation. This column should have numeric values that are equal to or greater than zero. No missing (NA) values are permitted in this column.
site_survey_cost_column: character name of column in the argument to site_data that contains costs for surveying each site. This column should have numeric values that are equal to or greater than zero. No missing (NA) values are permitted in this column.
feature_survey_column: character name of the column in the argument to feature_data that contains logical (TRUE / FALSE) values indicating if the feature will be surveyed in the planned surveys or not. Note that considering additional features will rapidly increase computational burden, and so it is only recommended to consider features that are of specific conservation interest. No missing (NA) values are permitted in this column.
feature_survey_sensitivity_column: character name of the column in the argument to feature_data that contains probability of future surveys correctly detecting a presence of each feature in a given site (i.e. the sensitivity of the survey methodology). This column should have numeric values that are between zero and one. No missing (NA) values are permitted in this column.
feature_survey_specificity_column: character name of the column in the argument to feature_data that contains probability of future surveys correctly detecting an absence of each feature in a given site (i.e. the specificity of the survey methodology). This column should have numeric values that are between zero and one. No missing (NA) values are permitted in this column.
feature_model_sensitivity_column: character name of the column in the argument to feature_data that contains probability of the initial models correctly predicting a presence of each feature in a given site (i.e. the sensitivity of the models). This column should have numeric values that are between zero and one. No missing (NA) values are permitted in this column. This should ideally be calculated using fit_xgb_occupancy_models() or fit_hglm_occupancy_models().
feature_model_specificity_column: character name of the column in the argument to feature_data that contains probability of the initial models correctly predicting an absence of each feature in a given site (i.e. the specificity of the models). This column should have numeric values that are between zero and one. No missing (NA) values are permitted in this column. This should ideally be calculated using fit_xgb_occupancy_models() or fit_hglm_occupancy_models().
feature_target_column: character name of the column in the argument to feature_data that contains the \(target\) values used to parametrize the conservation benefit of managing of each feature. This column should have numeric values that are equal to or greater than zero. No missing (NA) values are permitted in this column.
total_budget: numeric maximum expenditure permitted for conducting surveys and managing sites for conservation.
survey_budget: numeric maximum expenditure permitted for conducting surveys.
site_management_locked_in_column: character name of the column in the argument to site_data that contains logical (TRUE / FALSE) values indicating which sites should be locked in for (TRUE) being managed for conservation or (FALSE) not. No missing (NA) values are permitted in this column. This is useful if some sites have already been earmarked for conservation, or if some sites are already being managed for conservation. Defaults to NULL such that no sites are locked in.
site_management_locked_out_column: character name of the column in the argument to site_data that contains logical (TRUE / FALSE) values indicating which sites should be locked out for (TRUE) being managed for conservation or (FALSE) not. No missing (NA) values are permitted in this column. This is useful if some sites could potentially be surveyed to improve model predictions even if they cannot be managed for conservation. Defaults to NULL such that no sites are locked out.
site_survey_locked_out_column: character name of the column in the argument to site_data that contains logical (TRUE / FALSE) values indicating which sites should be locked out (TRUE) from being selected for future surveys or (FALSE) not. No missing (NA) values are permitted in this column. This is useful if some sites will never be considered for future surveys (e.g. because they are too costly to survey, or have a low chance of containing the target species). Defaults to NULL such that no sites are locked out.
prior_matrix: numeric matrix containing the prior probability of each feature occupying each site. Rows correspond to features, and columns correspond to sites. Defaults to NULL such that prior data is calculated automatically using prior_probability_matrix().
n_approx_replicates: integer number of replicates to use for approximating the expected value calculations. Defaults to 100.
n_approx_outcomes_per_replicate: integer number of outcomes to use per replicate for approximation calculations. Defaults to 10000.
seed: integer state of the random number generator for simulating outcomes when conducting the value of information analyses. Defaults to 500.
n_threads: integer number of threads to use for computation.
verbose: logical indicating if information should be printed during processing. Defaults to FALSE.

Value

A matrix of logical (TRUE/ FALSE) values indicating if a site is selected in the scheme or not. Columns correspond to sites, and rows correspond to different schemes. If there are no ties for the best identified solution, then the the matrix

will only contain a single row.

Details

Ideally, the brute-force algorithm would be used to identify the optimal survey scheme. Unfortunately, it is not feasible to apply the brute-force to large problems because it can take an incredibly long time to complete. In such cases, it may be desirable to obtain a "relatively good" survey scheme and the greedy heuristic algorithm is provided for such cases. The greedy heuristic algorithm -- unlike the brute force algorithm -- is not guaranteed to identify an optimal solution -- or even a "relatively good solution" for that matter -- though greedy heuristic algorithms tend to deliver solutions that are 15\ greedy algorithms is implemented as:

Initialize an empty list of survey scheme solutions, and an empty list of approximate expected values.
Calculate the expected value of current information.
Add a survey scheme with no sites selected for surveying to the list of survey scheme solutions, and add the expected value of current information to the list of approximate expected values.
Set the current survey solution as the survey scheme with no sites selected for surveying.
For each remaining candidate site that has not been selected for a survey, generate a new candidate survey scheme with each candidate site added to the current survey solution.
Calculate the approximate expected value of each new candidate survey scheme. If the cost of a given candidate survey scheme exceeds the survey budget, then store a missing NA value instead. Also if the the cost of a given candidate survey scheme plus the management costs of locked in planning units exceeds the total budget, then a store a missing value NA value too.
If all of the new candidate survey schemes are associated with missing NA values -- because they all exceed the survey budget -- then go to step 12.
Calculate the cost effectiveness of each new candidate survey scheme. This calculated as the difference between the approximate expected value of a given new candidate survey scheme and that of the current survey solution, and dividing this difference by the the cost of the newly selected candidate site.
Find the new candidate survey scheme that is associated with the highest cost-effectiveness value, ignoring any missing NA values. This new candidate survey scheme is now set as the current survey scheme.
Store the current survey scheme in the list of survey scheme solutions and store its approximate expected value in the list of approximate expected values.
Go to step 12.
Find the solution in the list of survey scheme solutions that has the highest expected value in the list of approximate expected values and return this solution.

Examples

# set seeds for reproducibility
set.seed(123)

# load example site data
data(sim_sites)
print(sim_sites)
#> Simple feature collection with 6 features and 13 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 0.02541313 ymin: 0.07851093 xmax: 0.9888107 ymax: 0.717068
#> CRS:           NA
#> # A tibble: 6 × 14
#>   survey_cost management_cost    f1    f2    f3    n1    n2    n3     e1     e2
#>         <dbl>           <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>
#> 1          19              99     0     0  0        0     0     0  1.13   0.535
#> 2          22              87     0     1  0.25     4     4     4 -1.37  -1.45 
#> 3          13              94     1     1  0        1     1     1  0.155 -0.867
#> 4          19              61     0     0  0        0     0     0 -0.792  1.32 
#> 5           9             105     0     0  0        0     0     0 -0.194  0.238
#> 6          12             136     0     0  0        0     0     0  1.07   0.220
#> # ℹ 4 more variables: p1 <dbl>, p2 <dbl>, p3 <dbl>, geometry <POINT>

# load example feature data
data(sim_features)
print(sim_features)
#> # A tibble: 3 × 7
#>   name  survey survey_sensitivity survey_specificity model_sensitivity
#>   <chr> <lgl>               <dbl>              <dbl>             <dbl>
#> 1 f1    TRUE                0.951              0.854             0.711
#> 2 f2    TRUE                0.990              0.832             0.722
#> 3 f3    TRUE                0.986              0.808             0.772
#> # ℹ 2 more variables: model_specificity <dbl>, target <dbl>

# set total budget for managing sites for conservation
 # (i.e. 50% of the cost of managing all sites)
total_budget <- sum(sim_sites$management_cost) * 0.5

# set total budget for surveying sites for conservation
# (i.e. 40% of the cost of managing all sites)
survey_budget <- sum(sim_sites$survey_cost) * 0.4

# find survey scheme using approximate method and greedy heuristic algorithm
# (using 10 replicates so that this example completes relatively quickly)
approx_near_optimal_survey <- approx_near_optimal_survey_scheme(
  sim_sites, sim_features,
  c("f1", "f2", "f3"), c("n1", "n2", "n3"), c("p1", "p2", "p3"),
  "management_cost", "survey_cost",
  "survey", "survey_sensitivity", "survey_specificity",
  "model_sensitivity", "model_specificity",
  "target", total_budget, survey_budget)

# print result
print(approx_near_optimal_survey)
#>       [,1]  [,2]  [,3]  [,4] [,5]  [,6]
#> [1,] FALSE FALSE FALSE FALSE TRUE FALSE
#> [2,] FALSE FALSE FALSE FALSE TRUE  TRUE
#> attr(,"ev")
#> [1] 1.771926 1.771926