surveyvoi: Survey Value of Information

Introduction

The funding available for conservation is limited. To ensure that conservation funds are allocated cost-effectively, conservation plans (termed prioritizations) can be developed – using a combination of economic, biodiversity, and land-use data – to prioritize a set of sites for conservation management (e.g. protected area establishment). However, existing data on biodiversity patterns are incomplete. As a consequence, prioritizations can potentially be improved by collecting additional data. Specifically, ecological surveys can be conducted in sites to learn more about which species are present within them. However, conducting ecological surveys reduces the funds available for conservation management. Thus, decision makers need to strategically allocate funding for surveying sites and managing them for conservation—this is not a trivial task.

The surveyvoi R package is a decision support tool for prioritizing sites for ecological surveys based on their potential to improve plans for conserving biodiversity (e.g. plans for establishing protected areas). Given a set of sites that could potentially be acquired for conservation management – wherein some sites have previously been surveyed and other sites have not – it can be used to generate and evaluate plans for additional surveys. Specifically, plans for ecological surveys can be generated using various conventional approaches (e.g. maximizing expected species richness, geographic coverage, diversity of sampled environmental conditions) and directly maximizing value of information using optimization algorithms. After generating plans for surveys, they can also be evaluated using value of information analysis. Please note that several functions depend on the ‘Gurobi’ optimization software (available from https://www.gurobi.com) and the gurobi R package (installation instructions available for online Linux, Windows, and Mac OS).

This tutorial provides a brief overview of the surveyvoi R package. Here, we will simulate survey data, fit statistical models to characterize the spatial distribution of a simulated species, and generate and evaluate survey schemes based on different approaches. Although this tutorial deals with only a single simulated species – to keep the tutorial simple and reduce computational burden – the functions used in this tutorial are designed to work with multiple species. If you want to learn more about a specific function, please consult the documentation written specifically for the function (accessible using the R code ?function, where function is the name of desired function).

Setup

Let’s start by setting up our R session. Here we will load some R packages and pre-set the random number generators for reproducibility.

# load packages
library(tidyr)
library(dplyr)
library(surveyvoi)
library(ggplot2)
library(gridExtra)
library(viridis)
library(tibble)

# set RNG seed for reproducibility
set.seed(40)

# set default table printing options
options(pillar.sigfig = 6, tibble.width = Inf)

Simulate data

Let’s simulate some data. To keep things simple, we will simulate data for 30 sites and one conservation feature (e.g. species). Of the 30 sites in total, we will simulate survey data for 15 sites—meaning that 15 of the sites will not have survey data. We will also simulate three spatially auto-correlated variables to characterize the environmental conditions within the sites. Although the simulation code (i.e. simulate_site_data) can output the probability that features are expected to inhabit the sites, we will disable this option to make our simulation study more realistic and instead predict these probabilities using statistical models.

# simulate site data
site_data <- simulate_site_data(
  n_sites = 30, n_features = 1, proportion_of_sites_missing_data = 15 / 30,
  n_env_vars = 3, survey_cost_intensity = 5, management_cost_intensity = 2500,
  max_number_surveys_per_site = 1, output_probabilities = FALSE)

# print site data
print(site_data)

## Simple feature collection with 30 features and 7 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 0.07758767 ymin: 0.03189323 xmax: 0.9762666 ymax: 0.9557619
## CRS:           NA
## # A tibble: 30 × 8
##    survey_cost management_cost    f1    n1        e1        e2         e3
##          <dbl>           <dbl> <dbl> <dbl>     <dbl>     <dbl>      <dbl>
##  1           9            2458     0     0  0.334113  0.502778 -1.00012  
##  2           1            2524     0     1 -1.32716  -1.19893   1.06494  
##  3           3            2483     0     0 -1.18385   0.676827  0.387451 
##  4          12            2513     0     0  0.691077 -0.926471 -0.244236 
##  5           4            2477     0     0  0.832864  2.70543  -0.351963 
##  6           8            2481     0     1 -0.870905 -0.635214 -0.779346 
##  7           1            2499     1     1  0.774926  1.31091  -0.0104341
##  8           6            2479     0     0  1.34695   0.220259  0.824172 
##  9           7            2484     0     1 -0.670210 -0.729128 -0.912692 
## 10          11            2519     0     0  0.125632  0.313454 -1.37448  
##                geometry
##                 <POINT>
##  1  (0.683582 0.841256)
##  2  (0.872904 0.240454)
##  3  (0.690117 0.679761)
##  4 (0.115936 0.0704436)
##  5  (0.195009 0.646561)
##  6  (0.461201 0.102122)
##  7  (0.203535 0.955762)
##  8  (0.590849 0.748232)
##  9  (0.373888 0.150227)
## 10  (0.141298 0.307626)
## # ℹ 20 more rows

# plot the spatial location of the sites
ggplot(site_data) +
  geom_sf() +
  ggtitle("Sites") +
  labs(x = "X coordinate", y = "Y coordinate")

The site_data object is a spatially explicit dataset (i.e. sf object) that contains information on the site locations and additional site attributes. Here, each row corresponds to a different site, and each column contains a different site attribute. The f1 column contains the results from previous surveys, where values describe the proportion of previous surveys where species were previously detected at each site. Since each site has had at most a single previous survey, these data contain zeros (indicating that the species has not been detected) and ones (indicating that the species has been detected). The n1 column contains the number of previous surveys conducted within each site. Thus, sites with zeros in this column have not previously been surveyed. The e1, e2, and e3 columns contain environmental information for each site (e.g. normalized temperature and rainfall data). The survey_cost column contains the cost of surveying each site, and the management_cost column contains the cost of managing each site for conservation.

To help understand the simulated data, let’s create some visualizations.

# plot site occupancy data from previous surveys
# 1 = species was detected in 100% of the previous surveys
# 0 = species was detected in 0% of the previous surveys
site_data %>%
  select(starts_with("f")) %>%
  gather(name, value, -geometry) %>%
  mutate(value = as.character(value)) %>%
  ggplot() +
  geom_sf(aes(color = value)) +
    scale_color_manual(values = c("1" = "red", "0" = "black")) +
    facet_wrap(~ name) +
    labs(title = "Detection/non-detection data",
         x = "X coordinate", y = "Y coordinate")

# plot number of previous surveys within each site
site_data %>%
  select(starts_with("n")) %>%
  gather(name, value, -geometry) %>%
  mutate(value = as.character(value)) %>%
  ggplot() +
  geom_sf(aes(color = value)) +
    scale_color_manual(values = c("1" = "blue", "0" = "black")) +
    facet_wrap(~ name) +
    labs(title = "Number of previous surveys",
         x = "X coordinate", y = "Y coordinate")

# plot site cost data
# note that survey and management costs are on different scales
p1 <- ggplot(site_data) +
      geom_sf(aes(color = survey_cost)) +
      scale_color_viridis() +
      labs(title = "Survey cost", x = "X coordinate", y = "Y coordinate") +
      theme(legend.title = element_blank())
p2 <- ggplot(site_data) +
      geom_sf(aes(color = management_cost)) +
      scale_color_viridis() +
      labs(title = "Management cost", x = "X coordinate", y = "Y coordinate") +
      theme(legend.title = element_blank())
grid.arrange(p1, p2, nrow = 1)

# plot site environmental data
site_data %>%
  select(starts_with("e")) %>%
  gather(var, value, -geometry) %>%
  ggplot() +
  geom_sf(aes(color = value)) +
    facet_wrap(~ var) +
    scale_color_viridis() +
    labs(title = "Environmental conditions",
         x = "X coordinate", y = "Y coordinate")

After simulating data for the sites, we will simulate data for the conservation feature. We set proportion_of_survey_features = 1 to indicate that this feature will be examined in future surveys.

# simulate feature data
feature_data <- simulate_feature_data(
  n_features = 1, proportion_of_survey_features = 1)

# remove simulated model performance statistics since we will fit models below
feature_data$model_sensitivity <- NULL
feature_data$model_specificity <- NULL

# manually set target
feature_data$target <- 2

# print feature data
print(feature_data)

## # A tibble: 1 × 5
##   name  survey survey_sensitivity survey_specificity target
##   <chr> <lgl>               <dbl>              <dbl>  <dbl>
## 1 f1    TRUE             0.989102           0.834741      2

The feature_data object is a table (i.e. tibble object) that contains information on the conservation feature. Here, each row corresponds to a different feature – and so it only has one row because we only have one feature – and each column contains different information about the feature(s). The name column contains the name of the feature. The survey column indicates if the feature will be examined in future surveys. The survey_sensitivity and survey_specificity columns denote the sensitivity (probability of correctly recording a presence) and specificity (probability of correctly recording an absence) of the survey methodology. Finally, the target column specifies the number of occupied sites for each species that should ideally be represented in the prioritization.

Modeling probability of occupancy

After simulating the data, we need to estimate the probability of the feature occurring in the unsurveyed sites. This is important for calculating the potential benefits of surveying sites, because if we can reliably predict the probability of the feature(s) occurring in unsurveyed sites using models, then we may not need to conduct any additional surveys. Specifically, we will fit gradient boosted regression trees – via the xgboost R package. These models are well-suited for modeling species distributions because they can accommodate high order interactions among different predictor variables that are needed to effectively model species’ environmental niches, even in the case of limited data. Furthermore, they can incorporate knowledge of the sensitivity and specificity of previous surveys during model fitting (using weights).

# create list of candidate parameter values for calibration procedure
xgb_parameters <- list(eta = 0.1, lambda = 0.1, objective = "binary:logistic")

# identify suitable parameters for model fitting
# ideally we would try a larger range of values (i.e. not just a single value of 0.1),
# but we will keep it low to reduce processing time for this example
xgb_results <- fit_xgb_occupancy_models(
  site_data, feature_data,
  c("f1"), c("n1"), c("e1", "e2", "e3"),
  "survey_sensitivity", "survey_specificity",
  n_folds = c(2), xgb_tuning_parameters = xgb_parameters)

After fitting the models, we can examine the tuning parameters used to fit the models, extract the modeled probability of occupancy, and evaluate the performance of the models.

# print best parameters
print(xgb_results$parameters)

## [[1]]
## [[1]]$eta
## [1] 0.1
## 
## [[1]]$lambda
## [1] 0.1
## 
## [[1]]$objective
## [1] "binary:logistic"
## 
## [[1]]$scale_pos_weight
## [[1]]$scale_pos_weight[[1]]
## [1] 1 1

# print model performance (TSS value)
xgb_performance <- xgb_results$performance
print(data.frame(xgb_performance))

##   feature train_tss_mean train_tss_std train_sensitivity_mean
## 1      f1              1             0                      1
##   train_sensitivity_std train_specificity_mean train_specificity_std
## 1                     0                      1                     0
##   test_tss_mean test_tss_std test_sensitivity_mean test_sensitivity_std
## 1     0.7194687    0.2261671             0.9396965           0.08528208
##   test_specificity_mean test_specificity_std
## 1             0.7797722            0.3114492

# store the model sensitivities and specificities in the feature_data object
feature_data$model_sensitivity <- xgb_performance$test_sensitivity_mean
feature_data$model_specificity <- xgb_performance$test_specificity_mean

# store predicted probabilities in the site_data object
xgb_predictions <- xgb_results$predictions
print(xgb_predictions)

## # A tibble: 30 × 1
##          f1
##       <dbl>
##  1 0.565847
##  2 0.410422
##  3 0.410541
##  4 0.452530
##  5 0.565847
##  6 0.410422
##  7 0.565847
##  8 0.565847
##  9 0.452463
## 10 0.565847
## # ℹ 20 more rows

site_data$p1 <- xgb_predictions$f1

# plot site-level estimated occupancy probabilities
site_data %>%
  select(starts_with("p")) %>%
  gather(name, value, -geometry) %>%
  ggplot() +
  geom_sf(aes(color = value)) +
  facet_wrap(~name) +
  scale_color_viridis() +
  labs(title = "Modeled probabilities", x = "X coordinate", y = "Y coordinate")

Expected value given current information

After simulating and modeling the data, we will now examine the expected value of the decision given current information. This value represents the conservation value of a near-optimal prioritization given current information, whilst accounting for uncertainty in the presence (and absence) of the conservation feature in each site. Specifically, “current information” refers to our existing survey data and our occupancy models. Next, we will set a total budget (i.e. total_budget). This total budget represents the total amount of resources available for surveying sites and managing them for conservation. It will be set at 10% of the total site management costs.

# calculate total budget for surveying and managing sites
total_budget <- sum(site_data$management_cost) * 0.1

# print total budget
print(total_budget)

## [1] 7498.9

Given the total budget, we can now calculate the expected value of the decision given current information.

# expected value of the decision given current information
evd_current <- evdci(
  site_data = site_data,
  feature_data = feature_data,
  site_detection_columns = c("f1"),
  site_n_surveys_columns = c("n1"),
  site_probability_columns = c("p1"),
  site_management_cost_column = "management_cost",
  feature_survey_sensitivity_column = "survey_sensitivity",
  feature_survey_specificity_column = "survey_specificity",
  feature_model_sensitivity_column = "model_sensitivity",
  feature_model_specificity_column = "model_specificity",
  feature_target_column = "target",
  total_budget = total_budget)

# print value
print(evd_current)

## [1] 0.9443835

We can potentially improve the expected value of the decision given current information by learning more about which sites are more likely (and less likely) to contain the conservation feature.

Survey schemes

Now we will generate some candidate survey schemes to see if we can improve the management decision. To achieve this, we will set a budget for surveying additional sites. Specifically, this survey budget (i.e. survey_budget) will be set at 25% of the survey costs for the unsurveyed sites. Note that our total budget must always be greater than or equal to the survey budget.

# calculate budget for surveying sites
#   add column to site_data indicating if the sites already have data or not
site_data$surveyed <- site_data$n1 > 0.5

#   add column to site_data containing the additional survey costs,
#   i.e. sites that already have data have zero cost, and
#   sites that are missing data retain their cost values
site_data <-
  site_data %>%
  mutate(new_survey_cost = if_else(surveyed, 0, survey_cost))

#   calculate total cost of surveying remaining unsurveyed sites
total_cost_of_surveying_remaining_sites <-
  sum(site_data$new_survey_cost)

#   calculate budget for surveying sites
survey_budget <- total_cost_of_surveying_remaining_sites * 0.25

# print budgets
print(survey_budget)

## [1] 30.75

print(total_budget)

## [1] 7498.9

We will generate survey schemes by selecting unsurveyed sites that (i) increase geographic coverage among surveyed sites (Yates 1948), (ii) increase coverage of environmental conditions among surveyed sites [i.e. environmental diversity; Faith & Walker (1996)], (iii) increase coverage of sites with highly uncertain information (Guisan et al. 2006), (iv) increase coverage of sites where species are predicted to occur (Le Lay et al. 2010), and (v) increase coverage of sites that have low management costs.

# (i) generate survey scheme to increase geographic coverage
geo_scheme <-
  geo_cov_survey_scheme(
    site_data, "new_survey_cost", survey_budget, locked_out = "surveyed")

# (ii) generate survey scheme to increase environmental diversity,
# environmental distances are calculated using Euclidean distances here,
# though we might consider something like Mahalanobis distances for a
# real dataset to account for correlations among environmental variables)
env_scheme <-
  env_div_survey_scheme(
    site_data, "new_survey_cost", survey_budget, c("e1", "e2", "e3"),
    locked_out = "surveyed", method = "euclidean")

# (iii) generate survey scheme using site uncertainty scores
# calculate site uncertainty scores
site_data$uncertainty_score <- relative_site_uncertainty_scores(site_data, "p1")

# generate survey scheme
unc_scheme <-
  weighted_survey_scheme(
    site_data, "new_survey_cost", survey_budget, "uncertainty_score",
    locked_out = "surveyed")

# (iv) generate survey scheme using lowest cost of site management
# (i.e. inverse management cost)
site_data$inv_management_cost <- 1 / site_data$management_cost
cheap_scheme <-
  weighted_survey_scheme(
    site_data, "new_survey_cost", survey_budget, "inv_management_cost",
    locked_out = "surveyed")

# (v) generate survey scheme using site species richness scores
# calculate site species richness scores
site_data$richness_score <- relative_site_richness_scores(site_data, "p1")

# generate survey scheme
rich_scheme <-
  weighted_survey_scheme(
    site_data, "new_survey_cost", survey_budget, "richness_score",
    locked_out = "surveyed")

Let’s visualize the different survey schemes.

# add schemes to site_data
site_data$geo_scheme <- c(geo_scheme)
site_data$env_scheme <- c(env_scheme)
site_data$unc_scheme <- c(unc_scheme)
site_data$cheap_scheme <- c(cheap_scheme)
site_data$rich_scheme <- c(rich_scheme)

# plot the schemes
site_data %>%
  select(contains("scheme")) %>%
  gather(name, value, -geometry) %>%
  mutate_if(is.logical, as.character) %>%
  mutate(name = factor(name, levels = unique(name))) %>%
  ggplot() +
    geom_sf(aes(color = value)) +
    facet_wrap(~ name, nrow = 2) +
    scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black")) +
  labs(x = "X coordinate", y = "Y coordinate")

We can see that different approaches yield different survey schemes – but how well do they perform?

Expected value of the decision given sample information

Now that we’ve generated the survey schemes, let’s calculate the expected value of the decision given sample information for each survey scheme.

# create table to store results
evd_survey_schemes <-
  tibble(name = c("geo_scheme", "env_scheme", "unc_scheme", "cheap_scheme",
                  "rich_scheme"))

# expected value of the decision given each survey scheme
evd_survey_schemes$value <- sapply(
  evd_survey_schemes$name, function(x) {
    evdsi(
      site_data = site_data,
      feature_data = feature_data,
      site_detection_columns = c("f1"),
      site_n_surveys_columns = c("n1"),
      site_probability_columns = c("p1"),
      site_survey_scheme_column = as.character(x),
      site_management_cost_column = "management_cost",
      site_survey_cost_column = "survey_cost",
      feature_survey_column = "survey",
      feature_survey_sensitivity_column = "survey_sensitivity",
      feature_survey_specificity_column = "survey_specificity",
      feature_model_sensitivity_column = "model_sensitivity",
      feature_model_specificity_column = "model_specificity",
      feature_target_column = "target",
      total_budget = total_budget)
})

# print values
print(evd_survey_schemes)

## # A tibble: 5 × 2
##   name            value
##   <chr>           <dbl>
## 1 geo_scheme   0.981973
## 2 env_scheme   0.986738
## 3 unc_scheme   0.986738
## 4 cheap_scheme 0.993256
## 5 rich_scheme  0.988133

We can also calculate how much the information gained from each of the survey schemes is expected to improve the management decision. This quantity is called the expected value of sample information (EVSI) for each survey scheme.

# estimate expected value of sample information for each survey scheme
evd_survey_schemes$evsi <-
  evd_survey_schemes$value - evd_current

# print values
print(evd_survey_schemes)

## # A tibble: 5 × 3
##   name            value      evsi
##   <chr>           <dbl>     <dbl>
## 1 geo_scheme   0.981973 0.0375896
## 2 env_scheme   0.986738 0.0423546
## 3 unc_scheme   0.986738 0.0423546
## 4 cheap_scheme 0.993256 0.0488727
## 5 rich_scheme  0.988133 0.0437499

# visualize the expected value of sample information for each survey scheme
# color the best survey scheme in blue
evd_survey_schemes %>%
  mutate(name = factor(name, levels = name),
         is_best = evsi == max(evsi)) %>%
  ggplot(aes(x = name, y = evsi)) +
    geom_col(aes(fill = is_best, color = is_best)) +
    xlab("Survey scheme") +
    ylab("Expected value of sample information") +
    scale_color_manual(values = c("TRUE" = "#3366FF", "FALSE" = "black")) +
    scale_fill_manual(values = c("TRUE" = "#3366FF", "FALSE" = "black")) +
    theme(axis.text.x = element_text(angle = 30, vjust = 0.65),
          legend.position = "none")

In this particular simulation, we can see that all of the survey schemes have a low expected value of sample information (i.e. most values are close to zero). This means that none of these survey schemes would likely lead to a substantially better conservation outcome when considering the funds spent on conducting them. If the survey schemes had negative values, then this means that they would be expected to poorer conservation outcomes than simply using existing information. We can see that surveying sites with the cheapest management costs is the best strategy – in this particular situation – because it has the highest expected value of sample information, but can we do even better with a different scheme?

Optimized survey scheme

Now let’s generate an optimized survey scheme by directly maximizing the expected value of the decision given a survey scheme.

# generate optimized survey scheme(s)
opt_scheme <- approx_near_optimal_survey_scheme(
    site_data = site_data,
    feature_data = feature_data,
    site_detection_columns = c("f1"),
    site_n_surveys_columns = c("n1"),
    site_probability_columns = c("p1"),
    site_management_cost_column = "management_cost",
    site_survey_cost_column = "survey_cost",
    feature_survey_column = "survey",
    feature_survey_sensitivity_column = "survey_sensitivity",
    feature_survey_specificity_column = "survey_specificity",
    feature_model_sensitivity_column = "model_sensitivity",
    feature_model_specificity_column = "model_specificity",
    feature_target_column = "target",
    total_budget = total_budget,
    survey_budget = total_budget,
    n_approx_replicates = 5,
    n_approx_outcomes_per_replicate = 10000,
    verbose = TRUE)

# print number of optimized survey schemes
# if there are multiple optimized survey schemes,
# this means that multiple different survey schemes are likely to deliver
# similar results (even if they select different sites for surveys)
print(nrow(opt_scheme))

## [1] 1

# add first optimized scheme to site data
site_data$opt_scheme <- c(opt_scheme[1, ])

# plot optimized scheme
site_data %>%
mutate(name = "opt_scheme") %>%
ggplot() +
geom_sf(aes(color = opt_scheme)) +
facet_wrap(~ name, nrow = 1) +
scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black")) +
labs(x = "X coordinate", y = "Y coordinate")

We can see that the optimized survey scheme (opt_scheme) is different to the previous survey schemes.

# calculate expected value of sample information for the optimized scheme
evd_opt <- evdsi(
    site_data = site_data,
    feature_data = feature_data,
    site_detection_columns = c("f1"),
    site_n_surveys_columns = c("n1"),
    site_probability_columns = c("p1"),
    site_survey_scheme_column = "opt_scheme",
    site_management_cost_column = "management_cost",
    site_survey_cost_column = "survey_cost",
    feature_survey_column = "survey",
    feature_survey_sensitivity_column = "survey_sensitivity",
    feature_survey_specificity_column = "survey_specificity",
    feature_model_sensitivity_column = "model_sensitivity",
    feature_model_specificity_column = "model_specificity",
    feature_target_column = "target",
    total_budget = total_budget)

# calculate value
print(evd_opt)

## [1] 0.9934998

# append optimized results to results table
evd_survey_schemes <- rbind(
  evd_survey_schemes,
  tibble(name = "opt_scheme", value = evd_opt, evsi = evd_opt - evd_current))

# print updated results table
print(evd_survey_schemes)

## # A tibble: 6 × 3
##   name            value      evsi
##   <chr>           <dbl>     <dbl>
## 1 geo_scheme   0.981973 0.0375896
## 2 env_scheme   0.986738 0.0423546
## 3 unc_scheme   0.986738 0.0423546
## 4 cheap_scheme 0.993256 0.0488727
## 5 rich_scheme  0.988133 0.0437499
## 6 opt_scheme   0.993500 0.0491163

# visualize expected value of sample information
# color the best survey scheme in blue
evd_survey_schemes %>%
  mutate(name = factor(name, levels = name),
         is_best = evsi == max(evsi)) %>%
  ggplot(aes(x = name, y = evsi)) +
    geom_col(aes(fill = is_best, color = is_best)) +
    xlab("Survey scheme") +
    ylab("Expected value of sample information") +
    scale_color_manual(values = c("TRUE" = "#3366FF", "FALSE" = "black")) +
    scale_fill_manual(values = c("TRUE" = "#3366FF", "FALSE" = "black")) +
    theme(axis.text.x = element_text(angle = 30, vjust = 0.65),
          legend.position = "none")

We can see that the optimized survey scheme has the highest expected value of sample information of all the candidate survey schemes. To better understand how sub-optimal the candidate survey schemes are, let’s compute their relative performance and visualize them.

# express values in terms of relative performance
evd_survey_schemes$relative_performance <-
  ((max(evd_survey_schemes$evsi) - evd_survey_schemes$evsi) /
   evd_survey_schemes$evsi) * 100

# visualize relative performance
# zero = same performance as optimized scheme,
# higher values indicate greater sub-optimality
evd_survey_schemes %>%
  mutate(name = factor(name, levels = name),
         relative_performance = abs(relative_performance),
         is_best = relative_performance == min(relative_performance)) %>%
  ggplot(aes(x = name, y = relative_performance)) +
    geom_point(aes(fill = is_best, color = is_best)) +
    xlab("Survey scheme") +
    ylab("Performance gap (%)") +
    scale_color_manual(values = c("TRUE" = "#3366FF", "FALSE" = "black")) +
    scale_fill_manual(values = c("TRUE" = "#3366FF", "FALSE" = "black")) +
    theme(axis.text.x = element_text(angle = 30, vjust = 0.65),
          legend.position = "none")

We can see that the optimized survey scheme performs better than the other survey schemes. Although the optimized survey scheme doesn’t provide a substantial improvement in this particular situation, we can see how value of information analysis can potentially improve management decisions by strategically allocating funds to surveys and conservation management. Indeed, since we only considered a single species and a handful of sites – to keep the tutorial simple and reduce computational burden – it was unlikely that an optimized survey scheme would perform substantially better than simply using current information. If you want to try something more complex, try adapting the code in this tutorial to simulate a larger number of sites and multiple species?

Conclusion

Hopefully, this tutorial has been useful. If you have any questions about using the surveyvoi R package or suggestions for improving it, please file an issue on the package’s online coding repository (https://github.com/prioritizr/surveyvoi/issues). For information on the mathematical details that underpin the package, please see Hanson et al. (2023).

References

Faith, D.P. & Walker, P. (1996). Environmental diversity: On the best-possible use of surrogate data for assessing the relative biodiversity of sets of areas. Biodiversity & Conservation, 5, 399–415.

Guisan, A., Broennimann, O., Engler, R., Vust, M., Yoccoz, N.G., Lehmann, A. & Zimmermann, N.E. (2006). Using niche-based models to improve the sampling of rare species. Conservation Biology, 20, 501–511.

Hanson, J.O., McCune, J.L., Chadès, I., Proctor, C.A., Hudgins, E.J. & Bennett, J.R. (2023). Optimizing ecological surveys for conservation. Journal of Applied Ecology, 60, 41–51.

Le Lay, G., Engler, R., Franc, E. & Guisan, A. (2010). Prospective sampling based on model ensembles improves the detection of rare species. Ecography, 33, 1015–1027.

Yates, F. (1948). Systematic sampling. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 241, 345–377.

Jeffrey O. Hanson

2025-04-14