Appendix C. Control analyses for robustness against missing trait values.
Allowing missing trait values for calculating a plot mean might affect the selection of sites for analysis and the trait-trait and trait-environment relationships. In this appendix it is shown that (i) the selected sites are not a biased selection from all sites available, (ii) the average number of species to calculate is in fact much higher than the lower bound set (50% and 20% respectively), (iii) the slopes of the regression lines are not significantly affected by introducing missing trait values, the increase in the uncertainty of the slopes is smaller than the increase of the percentage of missing trait data, and finally, (iv) the structure and significance of the SEM is not affected by the RGR values used. As a result, we conclude that our main conclusions that nutrient availability and disturbance partly affect the same suit of traits and that trait-trait constraints play an important rolestill hold. Step 1 and 2 are done for both the simplified model (Fig. 3) and the complex model of Appendix F. Step 3 is done for the model in Appendix F.
1. A biased subset of the total available number of sites?
The six available data sources represented 19 different vegetation types. The selected set of sites for which trait information was available was not biased compared to all available sites, because selected sites covered all 19 vegetation types. Additionally, the sites not included in the analysis were randomly spread over these 19 vegetation types: The correlation between the number of sites per vegetation type for the selected sites and the total number of available sites was 0.94 (on 10log transformed data to fulfill homogeneity of variance). Therefore the selection criteria have not led to an unbalanced data set.
2. A biased estimate of trait plot means?
Within our data set, plots had been chosen that had sufficient information for at least of 50% of the species (or 20% in case of RGR and LPC). However, might this selection and incomplete information have led to biased results? If missing data are non-randomly distributed, then this can lead to a biased estimate of the plot mean and of biased environment-trait and trait-trait relationships and thus to a different conclusion about the relative contribution of environmental drivers and trait-trait constraints. We performed a three step analysis to test this crucial issue. First, we calculated the actual percentage of species with trait information that were used to calculate plot means. Next, we tested whether the slopes of the paths of our SEM are significantly affected when allowing trait plot means to be based on incomplete data. Finally, we incorporated modelled RGR values in a SEM (of Appendix F) and tested whether the structure still holds.
Step 1: Was the percentage of species with trait information to calculate plot means really that low?
Our selection criterion set a minimum to the availability trait information in order to include the majority of the species per plot. This threshold was 50% of the species, assuming that these species give a good estimate of the true plot trait mean; this minimum was lowered to > 20% for LPC and RGR as these traits were less well covered in the database but were core traits and essential to the analysis. In fact the average percentage of species used to calculate a plot mean was much higher than this minimum percentage (Fig. C1). In reality, the chances for a potential bias (if, in addition, species selection would have been selective and not random) are therefore much smaller than might have been concluded based on only this threshold stated in the manuscript. The only exception is for by RGR, for which on average indeed only 50% of the species had trait information. Given that we had already combined all available trait databases we tested as a second step whether correlations (and thus paths in our SEM) could have been affected by the non-complete trait information.
FIG. C1. The percentage of species used to calculate of plot mean for different traits (abbreviations of traits: Leaf nitrogen content (LNC), leaf phosphorus content (LPC), specific leaf area (SLA), 10log seed mass of the germinule (SM_g), 10log seed mass of the dispergule (SM_d), 10log maximum canopy height (maxCH), Growth form (GF), seedling relative growth rate (RGR), 10log germination onset (GO), flowering onset FO)).
Step 2: The comparison of the slopes of a complete set and the available data set and estimation of the uncertainty in the slopes
Our claim that ‘regenerative and establishment traits are linked’ was based both on the overall fit of the SEM and on the significance of those path coefficients linking the two groups of traits. We deal with the overall fit in the next section. Here we consider how missing data may affect the significance of the relevant path coefficients in our models. Since the path coefficients in a SEM are conceptually similar to the slopes of the equivalent regressions, the SEM model should be robust against missing trait values if the slopes of the relevant single regressions obtained from our data set (as used in this paper) and a smaller subset of the data set for which trait information is available for all species are not significantly different. Additionally, if the slopes of these regression lines are not significantly different for these two data sets, then the relative contribution of environmental drivers to trait selection and the significance of traits to trait selection will by definition remain unchanged.
To test for this, we used two data sets. The subset with which we compared our data set was defined by selecting those sites that had more than 90% of the species trait data available. Setting this criterium at 90% (and 70% for RGR analyses) ensured that at least 10 sites (average 42 sites) were available for the regression analysis. Note that this 90% means that on average only 1.4 species per plot were missing (a plot contained on average 18 species). The full set was defined as all 156 sites used in the SEM.
A significance test between the slopes of the two regression lines was performed as follows: a dummy variable (0 and 1 for the two data sets, respectively) was included in the regression: Y = a × X + b + c × group + d × group × X. If the slope of the subset is significantly different from the full set, then the parameter d will be significantly different from zero. Running these regressions for all environment-trait and trait-trait-trait relationships of the SEM model presented in Fig. 3 of the manuscript and the model presented in Appendix F (Fig. F1) showed that none of the regressions of the full set differed significantly from the subset (P > 0.05); in other words, the slopes of the regression were not significantly affected by allowing missing trait values up to a maximum of 80% for LPC and RGR and 50% for the other traits (See step 1, Fig. C1). This implies that the SEM would have the same slopes and the same cause-effect relationships if it would have been based on the subset (but with much less power, given the fewer degrees of freedom). Our claim that ‘regenerative and establishment traits are linked’ thus holds. In Table C1 we present the P values and estimates of the slopes.
TABLE C1. Comparison of slopes between subset and full set for all relationships used in the SEM (including the number of sites used (N), estimates of the parameters and P values). Non-significant parameters are indicated in bold.
|log10 Soil C/N||LNC||13||Intercept||28.619||0.0000|
|log10 Soil CN||-4.967||0.0000|
|group (0 = full set, 1 = subset)||5.531||0.5020|
|group × log10 Soil CN||-3.950||0.5230|
|log10 Soil C/N||SLA||50||Intercept||24.695||0.0000|
|log10 Soil CN||-2.549||0.0749|
|group (0 = full set, 1 = subset)||1.104||0.7698|
|group × log10 Soil CN||-1.629||0.5815|
|log10 Soil C/N||LPC||13||Intercept||2.6463||0.0000|
|log10 Soil CN||-0.6484||0.0007|
|group (0 = full set, 1 = subset)||-0.0324||0.9810|
|group × log10 Soil CN||-0.0474||0.9627|
|log10 Soil C/N||SM_g||20||Intercept||-0.4638||0.0208|
|log10 Soil CN||-0.5799||0.0000|
|group (0 = full set, 1 = subset)||0.0017||0.9979|
|group × log10 Soil CN||-0.1915||0.7088|
|log10 Soil C/P||LPC||70||Intercept||3.493||0.0000|
|log10 Soil CP||-0.751||0.0000|
|group (0 = full set, 1 = subset)||-1.352||0.0559|
|group × log10 Soil CP||0.546||0.0803|
|log10 Soil C/P||LNC||13||Intercept||32.782||0.0000|
|log10 Soil CP||-4.696||0.0000|
|group (0 = full set, 1 = subset)||-3.873||0.3790|
|group × log10 Soil CP||1.706||0.3860|
|log10 Soil C/P||SLA||50||Intercept||29.025||0.0000|
|log10 Soil CP||-3.408||0.0000|
|group (0 = full set, 1 = subset)||-1.170||0.7210|
|group × log10 Soil CP||0.093||0.9490|
|log10 Soil C/P||GO||30||Intercept||0.263||0.0278|
|log10 Soil CP||0.124||0.0203|
|group (0 = full set, 1 = subset)||-0.123||0.6444|
|group × log10 Soil CP||0.010||0.9311|
|log10 Soil C/P||FO||139||Intercept||26.505||0.0000|
|log10 Soil CP||0.542||0.0203|
|group (0 = full set, 1 = subset)||0.544||0.4795|
|group × log10 Soil CP||-0.218||0.5280|
|log10 Soil C/P||SM_g||20||Intercept||0.431||0.0316|
|log10 Soil CP||-0.313||0.0006|
|group (0 = full set, 1 = subset)||-1.216||0.0457|
|group × log10 Soil CP||0.449||0.0943|
|log10 Soil C/P||RGR||11||Intercept||0.1599||0.0000|
|log10 Soil CP||-0.0041||0.5790|
|group (0 = full set, 1 = subset)||0.0316||0.6530|
|group × log10 Soil CP||-0.0352||0.3020|
|group (0 = full set, 1 = subset)||-0.419||0.5361|
|group × maxCH||-0.038||0.2293|
|group (0 = full set, 1 = subset)||-0.035||0.4850|
|group × TSD||0.002||0.3250|
|group (0 = full set, 1 = subset)||-0.045||0.6020|
|group × TSD||0.007||0.1290|
|group (0 = full set, 1 = subset)||0.052||0.6940|
|group × TSD||-0.001||0.8470|
|group (0 = full set, 1 = subset)||-0.0305||0.0680|
|group × TSD||0.0002||0.6210|
|group (0 = full set, 1 = subset)||-2.148||0.0054|
|group × maxCH||-1.950||0.3016|
|group (0 = full set, 1 = subset)||-1.139||0.2730|
|group × maxCH||-0.456||0.7680|
|group (0 = full set, 1 = subset)||-0.339||0.2561|
|group × maxCH||0.034||0.9276|
|group (0 = full set, 1 = subset)||0.275||0.2143|
|group × maxCH||1.277||0.1947|
|group (0 = full set, 1 = subset)||0.035||0.8360|
|group × maxCH||-0.022||0.9616|
|group (0 = full set, 1 = subset)||0.000||0.9782|
|group × maxCH||-0.004||0.8721|
|group (0 = full set, 1 = subset)||0.0083||0.5051|
|group × maxCH||-0.0328||0.3329|
|group (0 = full set, 1 = subset)||0.0416||0.7442|
|group × GF||0.0128||0.9763|
|group (0 = full set, 1 = subset)||0.0042||0.9176|
|group × GF||-0.0095||0.9249|
|group (0 = full set, 1 = subset)||-0.0261||0.6891|
|group × GF||0.3868||0.1680|
|group (0 = full set, 1 = subset)||0.0098||0.6116|
|group × GF||-0.0179||0.5623|
|group (0 = full set, 1 = subset)||-0.165||0.4473|
|group × SM_g||-0.242||0.5420|
|group (0 = full set, 1 = subset)||0.831||0.0589|
|group × SM_g||0.734||0.3793|
|group (0 = full set, 1 = subset)||-0.2021||0.0175|
|group × LNC||0.0065||0.0566|
|group (0 = full set, 1 = subset)||-0.1301||0.0195|
|group × SLA||0.0042||0.0954|
|RGR||-6.5316||0.0000||group (0 = full set, 1 = subset)||3.5674||0.2805|
|group × RGR||-2.8471||0.3369|
Although for a SEM it is much more important to test to what extent the slopes of the relationships are significantly affected by missing trait data, we additionally investigated the role of missing trait data on the uncertainty of the slope estimates. To estimate the effect of missing trait data on the standard error of the slope, we ran a rarefying method which makes the number of trait data increasingly sparse. However, running the rarefying method and putting the newly calculated trait averages in the SEM for 500 or 1000 times would be a huge effort. Therefore, in analogue to the robustness test before, we ran the rarefying method for the bivariate trait-trait relationships which occur in the SEM. The proportion of missing trait data in the dependent variable was stepwise increased in steps of 5% up to 35% relative to the currently available data for that trait. Then new trait means were calculated for the plots and a regression was run on all plots to determine the slope and its standard error. Next, the standard error of the slope was calculated relative to the standard error of the slope of the bivariate relationships with the current number of available trait-data. This allows us to compare the increase in standard error among the bivariate relationships. This procedure was repeated 500 times to get a robust estimate of the standard error. The results are shown in Table C2. In all cases the standard error of the slope increases with increasing number of missing trait data. The results show that on average the standard error increases with 7% if 10% of the trait data is deleted. Particularly RGR is sensitive to missing trait data, but this is probably due to the already relative low availability of this trait. Also germination onset (GO) is sensitive to omissions of trait data. Although only 22% of the trait-data is missing, we think that this is because of the ordinal three point scale of this trait.
Based on these results, we think that the slope estimates are relatively robust against missing trait data, as the relative increase in the standard error of the slope is for most traits much less than the relative increase in missing trait data. Although the relative increase in the SE of GO is larger than 10% with an increase of 10% of missing trait data, we have the feeling that this does not really affect the SEM because GO is only affected by traits and not a parent of any other trait and because the number of trait data available for GO is among the highest of the traits (see table 1 of the manuscript), so the actual bias is relatively small. The increase in SE of the slope for RGR is also larger than 10%, with 10% more trait data missing. In the next section the effect of missing trait data on RGR has been analyzed in more detail.
TABLE C2. Relationship between the % missing trait data on the standard error of the slope for bivariate relationships. Slope indicates the increase of the standard error with increasing number of missing species. The last column indicates the % increase in standard error given a 10% loss of species trait data.
% increase in
st.error for 10%
Step 3: Test of SEM with modelled RGR values
In contrast to other relationships from the full data set vs. the subset, the relationships of the leaf traits vs. RGR were close to being significantly different for the two data sets. Also the standard error of the slopes was relatively large compared to the other traits. This probably means that the plot means of RGR deviated to some extent from the ‘real’ plot mean.
To test whether the structure and significance of the SEM was affected by the deviating estimates of RGR, we ran an additional SEM (tested on the extended model Appendix F only) that included better estimates of the RGR plot means. We did not run a SEM for only those plots for which we had sufficient trait information for RGR, as this would have led to too few degrees of freedom to run this SEM model. Instead, we fitted a multiple regression model in which RGR was predicted based on growth form, LNC and SLA for the subset with known unbiased estimates of plot means for RGR (at least 70% of the species cover available). The parameter estimates of the multiple regression were used to predict the RGR values for the remaining sites with insufficient trait information. To avoid over-fitting, a random number was added to the predicted values (drawn from a normal distribution with a mean of zero and a standard deviation equal to the standard deviation of the residuals of the multiple regression). This procedure ensured that relations between RGR and growth form, LNC and SLA were not made stronger than in the default model. These predicted RGR values replaced the original RGR values and were used in the SEM (everything else kept equal – Fig. F1). This procedure was repeated multiple times, because the numbers are randomly drawn from a normal distribution and thus can lead to an over- or underestimation of the fit, and showed that neither the validity of the full model (P values remained equal), nor the structure of the full model, or the significance of any individual path was different from the original model. Additionally, for all traits, the dominant drivers and the dominant trait-trait constraints remained unchanged. Furthermore, the relative contribution of the traits and the environmental drivers remained equal. There was only a slight increase in the role of the leaf traits in determining RGR and the explained variance of RGR (from 0.12 to 0.14 and from 0.49 to 0.56 respectively – compare Table C3 below and Table F8 in manuscript) and the explained variance of SLA and maxCH increased slightly. Therefore, the plot mean RGR values as calculated in the paper did not change the interpretation of the results and the conclusions about the contribution of the environmental drivers and the role of trait-trait constraints in trait assembly (See Table C3).
TABLE C3. The effect of environmental constraints (cause; columns) on the selection of individual traits (effect; rows) relative to the effect of trait-trait constraints with the modeled RGR values. In the most right column the explained variance of the SEM with the plot mean RGR values as used in the manuscript.
|Cause||Environmental constraint||Trait–trait constraints||
|TSD||DE > IE||