RESULTS
We generated three gSAD distributions based on summing individual observations of species across all ecological plots, by summing all observations across all other botanical observation records, and by summing the non-plot observation records found within 100 km distance from each ecological plot. Our analyses reveal that a large fraction of the plant species on Earth are rare (Figs. 2 and 3). Analyzing the distribution of the number of observations per species reveals that the global-scale distribution is highly skewed and lacking a central tendency (i.e., the mode of the gSAD is at N = 1; Fig. 2). The total number of land plant species is ~435,000 (the number of species before geovalidation based on 66,334,188 observation records), a large fraction of these species, 36.5% or 158,535, are rare, with just five observations or fewer, while 28.3% or 123,149 have just three observations or fewer. The large number of rare species is consistent with past claims that when biodiversity observations are compiled at increasingly larger spatial (15) and temporal scales (24), rare species should comprise an increasing majority of species.
Global species abundance distribution
We tested several long-standing hypotheses concerning the processes creating and maintaining large-scale patterns of commonness and rarity. Specifically, we assessed whether the number of observed rare species follows predictions from biodiversity theory by comparing several proposed statistical distributions for the gSAD. First, we assessed two contrasting sets of predictions for the distribution of commonness and rarity of species (Fig. 2). Specifically, at increasingly larger geographic scales, both the unified neutral theory of biogeography (UNTB) (25) and the k-niche model (26) predict that the gSAD will converge on Fisher’s log-series distribution (27)(1)where is the expected number of species, n is the total number of observations per species, α is the diversity parameter, and x is a nuisance parameter that is defined by α and the total number of individuals sampled, N, x = N/(N − α). The UNTB further makes two predictions: (i) At increasingly large spatial scales (such as continental and global scales), the Fisher’s log-series distribution will also increasingly converge to approximate a “power law” (or a Pareto distribution) over most of the range of the distribution (see Fig. 2A) (28), where(2)where (ii) the value of β, the scaling exponent or slope on a log-log plot, will equal −1.0. For the continuous Pareto or power law distribution, n0 is the minimum scale of the distribution, and β is the scaling exponent (29). For the BIEN data, the minimum number of observations for a species is 1, so it was set at 1.
The UNTB predicts that the gSAD (called the regional pool in neutral theory) will follow a log-series distribution. Pueyo (28) notes that the log-series distribution consists of two parts multiplied together: a Pareto distribution with exponent β = −1 that is the result of neutral dynamics and an exponential “bend” that takes effect at very high abundances due to the finite size assumption. Pueyo (28) also suggests a generalization of the Pareto and log series that incorporates a Pareto where the exponent β is allowed to vary combined with an exponential finite size term, which we call here “Pareto with exponential finite adjustment.” Thus, testing whether the gSAD is best fit by a log series (where β = −1), a Pareto distribution (where β is allowed to vary), or a Pareto with exponential finite adjustment (where β is also allowed to vary) provides a test of neutral dynamics. In sum, both the UNTB and k-niche model predict that the log-series distribution will best fit gSADs, but at large geographic scales, this distribution will also converge to a Pareto distribution. Thus, fitting the Pareto or the Pareto with exponential finite adjustment provides a simultaneous test of whether neutral or niche dynamics are consistent with the data (28). A poor fit or a value of β ≠ 1.0 rejects neutral theory. A poor fit of the Pareto regardless of the value of β further rejects the k-niche model (28). In addition, the value of β is then a useful ecological and evolutionary indicator of whether Earth has more rare species (β < −1; the slope of the function is steeper) or fewer (β > −1; the slope of the function is flatter) rare species than expected under zero-sum neutral evolutionary dynamics (28, 30).
In contrast to the predictions from the UNTB and k-niche model, the central limit theorem (CLT) predicts that gSADs will be characterized by a lognormal distribution. If the abundance of a species is the result of several multiplicative processes acting together (31), then lognormal distributions are expected. Because of the CLT, a lognormal distribution is expected any time many variables interact multiplicatively to influence abundance, such as many differing biotic and abiotic factors [see references in (32)]. Common processes in ecology and evolution are known to interact multiplicatively to influence species abundance (see Supplementary Document) (32). One context in which random variables are multiplied (yielding a lognormal) is consecutive annual population growth rates, although the applicability of this across species (i.e., to generate SADs) is controversial (33), relying on subtle philosophical interpretations of exchangeability. Some authors such as May (34) and MacArthur (35) say it can, while others such as Pielou (36) (see page 48) say it cannot produce a lognormal. This debate, however, is a red herring because many other biological processes in ecology and evolution also interact multiplicatively and can influence variation in interspecific abundance. For example processes that lead to niche partitioning, stochastic density-dependent differential equation models (37), models of rates of fixation of favorable alleles (35), or hurdle models (15) can generate lognormal SADs. Note that in the case of discrete abundances sampled from a continuous lognormal, we have a Poisson lognormal (38).
Next, we fit several additional models and statistical distributions that have been proposed to describe the distribution of commonness and rarity [see (39, 40) and the Supplementary Materials]. Using maximum likelihood estimations (MLEs), we fit each distribution to three ways to assess empirical gSAD: (i) for all of the species observation records within the BIEN database, (ii) for all species recorded only from ecological plots, and (iii) for all specimens found within 100 km around each ecological plot. Comparing the goodness of fit of various models for each of these gSADs allows us to compare potential sampling biases in botanical data.
The best model varied with which measure of goodness of fit was used, as well as with the dataset used (Tables 1 and 2 and tables S1 and S2). However, in general, the truncated Pareto (i.e., a modified Pareto distribution that adds an additional parameter to allow the right tail to drop down because of finite sample size (28)] and the Poisson lognormal (41) both fit well. These models have strong skew on a log scale, indicative of many rare species. All three models (at the estimated parameter values) show the mode at species with one individual. The log series, while also showing a mode at one individual in a log plot, markedly underestimates the number of extremely rare species, and the remaining models fit the distribution even more poorly and have an interior mode, incorrectly predicting that the most common abundances will be intermediate.
The UNTB predicts that log-series distribution will be approximated by the fit of the Pareto power distribution, with an exponent, β = −1.0 (28). However, our fit of the log-series distribution shows that it was not the best fit and the fitted scaling exponent is steeper than −1.0 (β MLE = −1.41 for all of the BIEN observations and β MLE = −1.43 for the observations from ecological plot data; Fig. 2). Thus, a Pareto power distribution needs an exponent less than −1 to generate the number of rare species actually observed.
Together, these results underscore that, at continental to global scales, only a few species abundance distribution models are capable of producing sufficient numbers of rare species to match the observed data and that neutral dynamics under the UNTB is not one of them. The observed value of β for embryophytes is similar to what has been reported for an extensive dataset for other taxa including animals and marine phytoplankton (28), suggesting that the shape of the SAD at increasingly larger spatial scales may converge to a similar distribution across disparate taxa. In sum, the Poisson lognormal is best fit, the Pareto exponent is markedly steeper than −1.0, and the Pareto distribution is the second best fit on two of the three metrics.
Assessment of sampling or taxonomic bias
Next, an obvious question is whether the observed number of rare species is the result of sampling or taxonomic bias. Data from herbarium records are known to exhibit biases in collection and sampling (17, 18). However, do these biases influence our identification of whether a species is rare or not?
We conducted two tests: First, in Fig. 2, we compared the distributions of global abundance in (i) the total BIEN database (including plot surveys and herbarium records), (ii) only the plot datasets, and (iii) the subset of herbarium records that reflect the same geographic distribution as the survey data (e.g., all records within 100 km of any plot location) (Fig. 2). Ecological plots and surveys, in contrast to herbarium data, contain less sampling bias, as a robust effort is made to ensure all individuals within the sampling design are surveyed within a given area. In many cases, repeated visits ensure accurate identification to species. Thus, assessing whether the gSAD from plot data is different from the gSAD from all botanical observations based on sampling herbarium data at the globe or around plots enables us to assess potential bias and sampling effectiveness. As discussed below, both empirical gSADs are described by similar statistical distributions (e.g., the shape of gSADs in Fig. 2B are similar to each other and to Fig. 2A; Tables 1 and 2 and tables S1 and S2), indicating that sampling issues do not greatly influence conclusions regarding gSADs.
Next, to further assess whether rare species are truly rare or artifactual, we randomly sampled 300 rare species with three observations or fewer from the Americas. The Americas were chosen because our taxonomic expertise was focused on these two continents. For each species selected, we consulted taxonomic experts at the Missouri Botanical Garden and the New York Botanical Garden to sort each species into several classifications (Fig. 3 and see the Supplementary Materials). Taxonomic experts largely confirm that the majority of rare species identified by BIEN are rare, with only 7.3% that were clearly erroneous and recognized as abundant or large-ranged species. We conclude that our results are not driven by taxonomic and sampling biases.
Our results from Fig. 3 allow us to estimate the total number of native land plant species currently observed across the globe with estimates for taxonomic uncertainty. After correcting and standardizing data, we estimate that the total number of extant embryophyte (land plant) species on Earth is between ~358,000 and ~435,000. The lower limit stems from subtracting 17.8% from the total [10.3% from the remaining presence of naturalized non-native species +7.5% caused by the inflation of names due to “old names” (basionyms) not yet corrected for by taxonomic data cleaning; see Fig. 3]. Our estimates are consistent with previous estimates of the total number of embryophytes in the world of approximately 391,000 [see (42)] or 403,911 (43) (see the Supplementary Materials). However, now, we can quantify that ~36% of these species are highly rare with very little distributional information for each species. In sum, our results from Fig. 2 show that rarity is commonplace across the land plants. Little botanical information exists across the world’s herbaria and ecological collections for between 11.2 and 13.6% (species with one observation) or between 30.0 and 36.5% (species with fewer than or equal to five total observations) of all vascular plant species.
“Hotspots” of rare species
To identify the regions that harbor hotspots of rare species, in Fig. 4, we mapped the locations of rare species across the world. We controlled for variation in sampling effort by calculating both the Menhinick and Margalef indices (see Materials and Methods). Plotting the sampling-corrected number of rare species reveals several patterns. Rare species cluster in the Americas in (i) mountainous regions (particularly along the thin strip along the western flank of the Andean Mountains, Central America, and the southern Sierra Madre of Mexico), (ii) the Guiana shield in northern South America, and (iii) relatively small climatic regions that are strongly distinct from surrounding areas (the Atlantic Forest or Mata Atlântica in Brazil, the southern region of the California Floristic Province, and the Caribbean); in Africa in (iv) the Cape Floristic Region of South Africa, (v) mountainous regions of Madagascar, (vi) the coastal mountains of Cameroon, and (vii) the Ethiopian highlands and the Somali peninsula; and in Asia in (viii) southwestern China and the border regions of Myanmar, Laos, and Thailand, (ix) Malaysia, (x) New Guinea, and (xi) the mountainous strip from Iran through Turkey. In Europe, there are several regions of notably high diversity of rarity in and around (xii) the Mediterranean, including the Pyrenees and Caucasus.
There is a relative dearth of rare species throughout the Amazon basin, confirming past claims that the Amazon flora consists largely of widespread and relatively abundant species (44). The areas identified by our methods show some overlap with areas independently identified as biodiversity hotspots (45) (e.g., Mesoamerican highlands, the Andes, Southeast Asia, and New Guinea) but differ in other areas.
Drivers of the spatial distribution of rarity
To assess the drivers of the spatial distribution of rarity, we conducted ordinary least squares (OLS) linear regression and simultaneously autoregressive (SAR) models to analyze the relationship between rarity index and environmental variables, including present climate, glacial-interglacial climatic velocity or instability of climate, and topography. Our OLS models showed that all the variables (annual mean temperature, annual precipitation, temperature seasonality, precipitation seasonality, temperature velocity, precipitation velocity, elevation, and heterogeneity of elevation) have significant relationships with both the Menhinick rarity index (tables S3 to S5 and fig. S2) and the Margalef rarity index (tables S6 to S8 and fig. S3), with the largest effects from temperature velocity and heterogeneity of elevation. In comparing the group models [present climate (annual mean temperature, annual precipitation, temperature seasonality, and precipitation seasonality), stability of climate (temperature and precipitation velocity), and topography (elevation and heterogeneity of elevation)], the model with instability of climate tended to outperform models with current climate and topography, while the full model showed the lowest Akaike’s information criterion (AIC). The exhaustively selected model did not include elevation as a predictor, although it had minor differences in model performance compared with the full model.
A Moran’s I test showed high spatial autocorrelation in the residuals of the OLS models, while we found no significant spatial autocorrelation in the residuals of the SAR models (tables S3 to S8). The coefficients of the SAR models were generally similar to those from OLS models, with the exceptions that signs of annual mean temperature, precipitation seasonality, and precipitation velocity switched from positive to negative in the SAR model. Temperature velocity remains the largest negative effect, and heterogeneity of elevation remains the largest positive effect in the SAR models (see figs. S2 and S3 and tables S3 to S8). Models incorporating climate stability and topography outperformed the model with current climate, while the full model remains the best-performing SAR model. The modeling results based on Menhinick and Margalef rarity index showed comparable results (tables S3 to S8 and figs. S2 and S3).
To summarize, areas that contain a higher number of rare species have had a more stable climate. The best predictor of plant rarity is the historical temperature velocity. Climate velocity describes climate instability with ecologically relevant units (distance/time; see discussion in Supplementary Document). In addition, mountainous area, as measured by the SD of elevation, is also a predictor with positive effect (tables S3 to S8 and figs. S2 and S3). Adding short-term annual variation (annual seasonality) in temperature and precipitation and mountainous conditions in addition to climate velocity does improve the explanation of the current spatial distribution of rarity (e.g., the proportion of variation explained, R2, increased to 0.193 for the OLS model and to 0.518 for the SAR model of Menhinick rarity index but less so for Margalef rarity index, 0.176 for the OLS model and 0.263 for the SAR model; tables S3 to S8). Together, these results are consistent with previous results [see (46, 47) and references therein], indicating that increased rates of climate change velocity negatively affect the retention of rare species, presumably because of increased rates of extinction during times of rapid climate change.
The overlaps between future climate velocity and human footprint and rarity indices
Our environment is facing rapid human changes at the global scale, so we quantified the intensity of human impact on the area with rare species (48). Regions with rare species are currently characterized by higher human impact and will experience faster rates of future climate change under representative concentration pathway 8.5 (RCP8.5) (Fig. 5). Areas with rare species have human footprint values of 8.5 ± 5.8, which is ~1.6 times higher (P < 0.001, Wilcoxon test) than that of the globe on average (5.2 ± 5.8). Furthermore, on average, areas with rare species are predicted to experience ~200 (±58) times greater rates of temperature velocity than those same areas experienced historically in terms of the overall glacial-interglacial climate shift across the past 21,000 years [from the last glacial maximum (LGM) to the present]. The ratio between future temperature velocity and this long-term overall historical temperature velocity is ~1.2 times greater (P < 0.001, Wilcoxon test) for areas with rare species than the globe will experience on average (170 ± 77) (Fig. 5). This is because areas with concentrations of rare species have previously been characterized by relatively more stable climates, but under the predicted climate change under RCP8.5, they will now experience velocities as high as the rest of the globe (see fig. S5).
Predicted changes of rarity indices
With the previously calibrated OLS and SAR full models, we made predictions of rarity indices under future projected climate. These showed worldwide decreases in rarity indices (Fig. 6), with the southern Andes and Southeast Asia predicted to experience the largest decreases. These decreases were likely due to the accelerated future climate velocities under RCP8.5, which are two orders of magnitude higher than those experienced from LGM [~21 thousand years (ka) ago] to the present day (see fig. S5). Note, however, that future velocities are estimated over a shorter time frame, which will tend to produce higher estimates.