Tree Species Suitability

Potential distribution of forest tree species is modelled using the Random Forest (RF) algorithm. Maps corresponding to RF regression and RF classification algorithms are available for download on this website. Presence/absence records of the most dominant European tree species are used to produce a RF classification in order to predict the presence/absence of each species and a RF regression to predict continuous habitat suitability values from absence to presence of that species (see Casalegno et al., 2010).

Vegetation distribution modelling helps us to understand forest ecosystem functioning and allows to maps vegetation suitability (or potential vegetation distribution) under current climate and to project future scenarios of vegetation change. Habitat suitability maps or potential distribution maps are based on empirical statistical models largely applied in the assessment of climate change impacts on forest, in conservation and forest sustainable management applications. Those empirical models predict the potential distribution of biota as a function of environmental factors. Suitability maps derived using regression models represent the degree of affinity between biota and environment as a continuous field while classification represent the category that is best fitted to the corresponding environment as binary choices: presence/absence.
1. Rationale
In this part we describe modelling techniques and assumptions. The main working hypothesis it that vegetation distribution is strongly determined by environmental factors and that tree species response is constrained by environmental gradients. Further details on model assumptions and theoretical backgrounds of suitability modelling are found in Casalegno et al. (2010), Guisan et al. (2005) and Austin (2007).

Preliminary and exploratory data analysis also served to define the modelling technique. The RF algorithm was selected because of proved efficiency in tree species distribution modelling (Benito Garzón et al., 2006; Prasad et al., 2006). RF is based on a recursive partition of the input data in binary splits which creates a dendrogram. At each node, a selection rule is created based on the best prediction variable. RF creates multiple bootstrapped classification trees with a randomized subset of predictors. For each classification tree of the ensemble it automatically creates an inbag/out-of-bag dataset for model calibration and external validation. This provides a reliable estimate of error using data that is randomly withheld from each iteration of tree development, making it unnecessary an independent external validation dataset (Breiman, 2001; Lawrence et al., 2006). The ensemble model output is determined as the mode of the bootstrapped models. RF does not need explanatory variable pruning because it automatically excludes those factors which are uncorrelated to the response variable. Therefore RF is considered a powerful tool able to deal with complex data distributions. In practice, for a large geographical area, such as the European extent of our application, RF is able to overcome the assumption (as in traditional regression techniques) of spatial stationarity in the relationship between variables. RF also handles missing values, interactions and skewed distribution of predictor variables, and avoids the need for pre-modelling multicolinearity tests on the input data. These features of the methodology provide a highly flexible modelling framework that can be used to model habitat suitability based on automatic variable selection/importance-score iterations.

2. Data Preparation
Habitat suitability modelling requires an input response/predictor table for fitting the model and a set of response surface maps to predict model outputs and mapping.

• Bioclimatic and environmental data
We used 47 environmental predictor surface maps with a grid size of 1km. The predictors included: Two soil variables selected from the European Soil Database: soil parent material (eight classes) and FAO soil type classification (25 classes); Six geo-morphological factors derived from the SRTM digital elevation model: minimum, maximum, mean altitudes, altitude standard deviation, slope and dominant orientation (north, south, east, west); and 39 bioclimatic factors and indices computed from minimum and maximum monthly averaged temperatures and monthly precipitations from the WORLDCLIM database.

The bioclimatic predictors are listed in table 1, and include: Holdridge annual biotemperature (Holdridge, 1947), thermicity index, ombrothermic index, compensated summer ombrothermic index (Rivas-Martinez, 1995), de Martonne adjusted aridity index (de Martonne, 1941; Grieser et al., 2006), Emberger pluviometric quotient (Dufour-Dror and Ertas, 2004), Kira’s warmth and coldness indexes (Kira, 1991), Ellenberg warmth/precipitation quotient (Ellenberg, 1986), Mitrakos annual cold and drought stress indexes (Mitrakos, 1980), accumulated annual potential evapotranspiration (Thornthwaite, 1948), Box moisture index of precipitation/evapotranspiration (Box, 1981), continentality index (annual range on mean temperatures) and isothermality (mean temperature monthly range/temperature annual range). Potential evapotranspiration was computed using GRASS r.sun model algorithm under the 1km averaged SRTM data.

Table 1 - Bioclimatic response variables.

• Field data
Response variables are empirical data collected in the field and stored in the Forest Focus database. Level I and level II Forest Focus database (released 2004) are merged. For each field data plot location, the presence/absence of 32 species (table 2) are extracted and the values of the environmental factors existing at that locations are assigned. The selected species are the most dominant found in Europe according to the Forest Focus database. These 32 species are found in more than 50 sampling sites in Europe.

Table 2 - List of European tree species modelled and number of observations where the specie occur within the set of more than 6000 field records. Model performance according to kappa statistics of the RF classification model.

3. Model fitting and validation
RF requires to be tuned by the number of variables randomly sampled at each iteration, and the number of regression or classification trees within the ensemble. For each species several parameterization tests were done for the selection of these two parameters.

Due to the large dataset, input response/predictor tables were randomly split into 2 subsets (80% in bag, 20% out-of-bag) for model calibration and external validation. This allows computing Kappa statistic (Cohen, 1968) and assessing model reliability. In Table 2 the model performance is classified from poor to very good. This evaluation is based on the agreement between observed and predicted values according to the weighted kappa and following interpretative rules of kappa values (Altman 1991):

•Poor: K < 0.20
•Fair: 0.20 < K <  0.40
•Moderate: 0.40 < K < 0.60
•Good: 0.60 < K < 0.80
•Very good: 0.80 < K < 1.00

4. Mapping
Using the model output (RF classification and RF regression) rules and environmental maps we predicted the current distribution of dominant tree species habitat suitability. The processing is based on tiles of 10 x 4000 km that are then merged into final maps.

The RF classification models show conservative results. They highlight the optimal area for a species to grow and give a simplified view of the field reality. On the other hand, RF regression models have higher potentiality allowing the option of selecting different levels of suitability. Areas which are not suitable for a particular species according to RF classification can be suitable according to RF regression tuned with a specific threshold. This allows for instance, studying boundary zones between optimal suitability and no suitability. Therefore, providing users with a continuous probability surface may be the most versatile option, allowing threshold choice to be matched with available maps (Freeman and Moisen, 2008) and/or users’ area of interest.

As result of the parametrisation process, the number of trees in the ensemble was increased to 1500 from the default value of 500 in all models. The variable tuning scores are 10, 19 or 38 depending on the species RF classification. Considering the validation scores over the 30 species modelled, four species have a model with poor performance, three have fair performance and the other 27 have acceptable model performance: 11 species show moderate, nine good and three species very good performance.

Maps available here.

•Altman, D.G. (1991): Practical Statistics for Medical Research. Chapman and Hall. London.
•Austin, M. (2007): Species distribution models and ecological theory: A critical assessment and some possible new approaches. Ecological Modelling, 200, 1–19.
•Benito Garzón, M., Blazek, R., Neteler, M., Sanchez de Rios, R., Sainz Ollero, H., Furlanello, C. (2006): Predicting habitat suitability with machine learning models: The potential area of Pinus sylvestris L. in the Iberian Peninsula. Ecological Modelling, 197, 383–393.
•Box, E.O. (1981): Macroclimate and plant form: An introduction to predictive modeling in phytogeography. Junk, The Hague.
•Breiman, L. (2001): Random Forests. Machine Learning, 45, 5–32.
•Casalegno, S., Amatulli, G., Camia, A., Nelson, A., Pekkarinen, A. (2010): Vulnerability of Pinus cembra L. in the Alps and the Carpathian mountains under present and future climates. Forest Ecology and Management, 259, 750-761.
•Cohen, J. (1968): Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213-220.
•De Martonne, E. (1941): Nouvelle carte mondiale de l’indice s’aridité. Météorol, 3-26.
•Dufour-Dror, J.M., Ertas, A. (2004): Bioclimatic perspectives in the distribution of Quercus ithaburensis Decne. subspecies in Turkey and in the Levant. Journal of Biogeography, 31, 461–474.
•Ellenberg, H. (1986): Vegetation Mitteleuropas mit den Alpen. 4th Edition, Fischer, Stuttgart.
•Freeman, E.A., Moisen, G. (2008): A comparison of the performance of threshold criteria for binary classification in terms of predicted prevalence and kappa. Ecological Modelling, 217, 48-58.
•Grieser, J., Gommes, R., Cofield, S., Bernardi, M. (2006): Data sources for FAO worldmaps of Koeppen climatologies and climatic net primary production. FAO, The Agromet Group, SDRN.
•Guisan, A., Thuiller, W. (2005): Predicting species distribution: offering more than simple habitat models. Ecology Letters, 8, 993–1009.
•Holdridge, L.R. (1947): Determination of world plant formations from simple climatic data. Science, 105, 367-368.
•Kira, T. (1991): Forest ecosystems of East Asia and Southeast Asia in a global perspective. Ecological Research 6, 185-200.
•Lawrence, R.L., Wood, S.D., Sheley, R.L. (2006): Mapping invasive plants using hyperspectral imagery and Breiman Cutler classifications (RandomForest). Remote Sensing of Environment, 100, 356–362.
•Mitrakos, K. (1980): A theory for Mediterranean plant life. Acta Oecologica, 1, 245-252.
•Prasad, A.M., Iverson, L.R., Liaw, A. (2006): Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems, 9, 181–199.
•Rivas-Martínez, S. (1995): Clasificación bioclimática de la Tierra. Folia Botanica Matrietensis, 16, 1-25.
• Thornthwaite, W.C. (1948): An approach toward a rational classification of climate. Geographical Review, 38, 55-94.



As the science and knowledge service of the European Commission, the Joint Research Centre's mission is to support EU policies with independent evidence throughout the whole policy cycle.