Species distribution modelling is now in wide use for conservation and biogeographical purposes. Three main sources of inaccuracy affect the performance of these models: the predictive capacity of the selected explanatory variables, the modelization technique used to estimate the parameters of the function, and the quality of the dependent variable
-
As some authors state, the explanatory variables are better if they have a causal relationship with the distribution of the species since they avoid the occurrence of overdispersion so allowing generating reliable interpolations and even extrapolations. Evidently, causal related variables are generally unknown so frequently we need to work with variables that in the best scenario could be correlated with the truly causal factors (nature is spatially correlated). To discriminate against these variables requires carrying out physiologic and biological experimental studies that allow us to causally link the variation in the distribution of the species with diverse environmental factors. Unfortunately, although we could identify these influential variables others contingent or unique factors difficult to include in the models can always bias our results (historical factors, biotic interactions or dispersal restrictions).
-
They are different modelization techniques which vary, not in the capacity to identify causal variables, but in the complexity of the established relationships among dependent and independent variables that they are able to include. Different studies examine the comparative performance of several modelization techniques always recommending as advantageous those techniques that establish more complex relationships between dependent and independent variables. However, the performance of the models obtained by these complex methods can be influenced both by the quality of the dependent variable and by the explanatory efficiency of the independent variables. A method of modelización may have more probabilities to find significant relationships although the used explanatory variables are not causally related with the species distributions or the data have a remarkable level of false absences.
-
The quality of the dependent variable depends of if the used data are well distributed across the spatial and environmental gradients, of the number of observations and, mainly, of the level of false absences in the used data. Good models need good data and mathematics cannot replace bad data. The research attacked by me and some collaborators search to find the model error sources derived from the use of biased biological data, wrong measures of validation and bad practices to convert the derived “probabilities” in presence-absence data. Thus, we have shown:
- That good absence data can be partially derived from databases. Because data are generally compiled from heterogeneous sources (i.e., herbarium sheets, surveys using different methodologies), the number of database records when they are compiled in an exhaustive form can be used as a surrogate of sampling effort to measure the completeness.
- That derived probability or suitability values are highly dependent on the relative proportion of each event in the sample, being necessary to apply an appropriate cut-off to convert continuous derived scores in presence-absence ones. The hypothesis and conclusions derived from models formerly accomplished in which inappropriate thresholds are applied should be revised.
- That the area under the ROC, currently considered to be the standard method to assess the accuracy of predictive distribution models, is a misleading measure for five reasons: i) it ignores the predicted probability values and the goodness of fit of the model, ii) it summarizes the test performance over regions of the ROC space in which one would rarely operate, ii) it does not give information about the spatial distribution of model errors, iv) it weighs omission and commission errors equally, and most importantly, v) the total extent to which models are carried out highly influences the rate of well-predicted absences and the AUC scores.
- That different distribution maps can be obtained for the same species if localities where species are present are mapped at different times because a temporal growth in distributional information exists. The date of capture of specimens can be explained by the environmental and spatial variables associated to the collection sites. These biases could affect the weighting of environmental factors that influence species distributions, as well as the accuracy of predictive distribution models.
- That it is essential to distinguish true absences from false absences in order to accomplish reliable distribution models. Regardless the potential accuracy of modelling techniques, the reliability of model output is highly reliant on the biological data used. Environmentally-structured false absences can explain the variability in the training data acceptably but their explanation of true species distribution is inaccurate and spatially biased. Environmentally-structured false absences (which are used most frequently) reduce the reliability of model predictions.
- That the inclusion of reliable absence data significantly improves model predictions, especially for smaller territories with a less variable environment. On the contrary, absences randomly distributed throughout a larger territory lead to better predictions through concomitant reduction of the possibility of including false absences in the case of greater environment variability.
-
Some papers dealing with these topics
That the method applied to select pseudo-absences greatly influence accuracy measures, as well as the estimated area of distribution. When we extract pseudo-absences from environmental regions further from the optimum established by presence data the models generated obtain better accuracy scores, and overprediction increases. When variables other than environmental ones influence the distribution of the species (i.e., non-equilibrium state) and precise information on absences is non-existent, the random selection of pseudo-absences generates the most constrained predictive distribution map, and absences can even be located within environmentally suitable areas.