Appendix D. Fitting boosted trees to data with multi-level errors.
Observations from data with multi-levels errors are non-independent. Consider the case of a simple nesting set-up where observational units (OUs) are repeatedly measured. Data from the same OU are most often correlated, and failure to take this into account when using cross-validation to select models and/or to estimate predictive error, will lead to over-complex models and under-estimates of prediction error.
To overcome this, we can construct the cross-validation groups so that each group contains all data from an OU. Thus the “in-bag” and “out-of-bag” groups have no data from common OUs. The following example illustrates the case of a simple nesting.
set.seed(1)
nrep <- 5 # the number of simulations
ng <- 50 # the number of simulations
n <- nrep*ng # the number of simulations
sdg <- 2 # the sampling groups SD
sde <- 1 # the irreducible (pure) error SD
x <- rep(runif(ng), nrep) # a continuous covariate
g <- rep(1:ng, nrep) # the groups (OUs)
y <- 10*x + rep(rnorm(ng, sd = sdg), nrep) + rnorm(n, sd=sde) # the response
df <- data.frame(y, x, g)
# plot the data and fit the true line
plot(x, y)
abline(lm(y~x, data=df))
# fit the gbm ignoring the sampling structure; note the small prediction error – the expected value of this is 1.
fit1 <- gbm(y~x, data=df, shr=0.05, n.tree=5000);
summary(fit1, plot=F); plot(fit1, npt=500); pred.err(fit1)
# refit the gbm using the groups defined by "g"; gives a better estimate of the prediction error – the expected value of this is 5.
fit2 <- gbm(y~x, data=df, shr=0.01, n.tree=500, grp="g");
summary(fit2, plot=F); plot(fit2, npt=500); pred.err(fit2)