Ecological Archives E088-015-A4

Glenn De'ath. 2007. Boosted trees for ecological modeling and prediction. Ecology 88:243–251.

Appendix D. Fitting boosted trees to data with multi-level errors.

Observations from data with multi-levels errors are non-independent. Consider the case of a simple nesting set-up where observational units (OUs) are repeatedly measured. Data from the same OU are most often correlated, and failure to take this into account when using cross-validation to select models and/or to estimate predictive error, will lead to over-complex models and under-estimates of prediction error.

To overcome this, we can construct the cross-validation groups so that each group contains all data from an OU. Thus the “in-bag” and “out-of-bag” groups have no data from common OUs. The following example illustrates the case of a simple nesting.

set.seed(1)

nrep <- 5 # the number of simulations

ng <- 50 # the number of simulations

n <- nrep*ng # the number of simulations

sdg <- 2 # the sampling groups SD

sde <- 1 # the irreducible (pure) error SD

x <- rep(runif(ng), nrep) # a continuous covariate

g <- rep(1:ng, nrep) # the groups (OUs)

y <- 10*x + rep(rnorm(ng, sd = sdg), nrep) + rnorm(n, sd=sde) # the response

df <- data.frame(y, x, g)

# plot the data and fit the true line

plot(x, y)

abline(lm(y~x, data=df))

# fit the gbm ignoring the sampling structure; note the small prediction error – the expected value of this is 1.

fit1 <- gbm(y~x, data=df, shr=0.05, n.tree=5000);

summary(fit1, plot=F); plot(fit1, npt=500); pred.err(fit1)

# refit the gbm using the groups defined by "g"; gives a better estimate of the prediction error – the expected value of this is 5.

fit2 <- gbm(y~x, data=df, shr=0.01, n.tree=500, grp="g");

summary(fit2, plot=F); plot(fit2, npt=500); pred.err(fit2)



[Back to E088-015]