May 15, 2015
dorng)step() or the LARS algorithm for LASSO)apply() functions can help, but still doesn't use multiple processors.parallel package (thanks, Miranda!).foreach package! (Analytics and Weston 2014)Goal: predict arrival volume to inform management of bike stations
Make our K-fold test sets (and implicitly, our training sets)
load(url("http://www.stat.colostate.edu/~scharfh/CSP_parallel/data/arrivals_subset.RData"))
K <- 50
N <- dim(arrivals.sub)[1]
## for convenience kill off 8 observations (we have 5208) and make cv test sets
set.seed(1985)
discarded <- sample(1:N, size = 8)
cv.test.sets <- matrix(sample((1:N)[-discarded], size = N - 8), ncol = K)
library(splines)
lq.loss <- function(y, y.hat, q = 1) {(abs(y - y.hat))^q}
get.errs <- function(test.set = NULL, discarded = NULL, q = 1) {
sml.glm <- glm(arrivals ~
bs(hour, degree = 4)
+ weekend
+ as.factor(id),
data = arrivals.sub[-c(discarded, test.set), ],
family = "poisson")
med.glm <- glm(arrivals ~
bs(hour, degree = 4)*weekend
+ as.factor(id),
data = arrivals.sub[-c(discarded, test.set), ],
family = "poisson")
big.glm <- glm(arrivals ~
bs(hour, degree = 4)*weekend*as.factor(id),
data = arrivals.sub[-c(discarded, test.set), ],
family = "poisson")
sml.err <- mean(lq.loss(predict(object = sml.glm,
newdata = arrivals.sub[test.set, -7],
type = "response"),
arrivals.sub[test.set, 7],
q = q))
med.err <- mean(lq.loss(predict(object = med.glm,
newdata = arrivals.sub[test.set, -7],
type = "response"),
arrivals.sub[test.set, 7],
q = q))
big.err <- mean(lq.loss(predict(object = big.glm,
newdata = arrivals.sub[test.set, -7],
type = "response"),
arrivals.sub[test.set, 7],
q = q))
return(c(sml.err, med.err, big.err))
}
Using a naive for loop, we could implement this as:
err.for <- NULL
system.time(
for (i in 1:K) {
err.for <- cbind(err.for, get.errs(test.set = cv.test.sets[, i],
discarded = discarded,
q = 1))
}
)
## user system elapsed ## 25.133 1.113 26.319
If you're good with apply() functions you might upgrade to
system.time(
err.apply <- sapply(X = 1:K,
FUN = function(i) {
get.errs(test.set = cv.test.sets[, i],
discarded = discarded,
q = 1)
}
)
)
## user system elapsed ## 24.280 1.095 25.488
Both of these assume a single processor architecture. We want to chop the job into halves, fourths, etc. and use the whole computer!
library(foreach) library(doParallel) detectCores()
registerDoParallel(cl = 2)
system.time(
err.foreach <- foreach(i=1:K,
.inorder = FALSE,
.combine = "cbind",
.packages = "splines") %dopar% {
get.errs(test.set = cv.test.sets[, i],
discarded = discarded,
q = 1)
}
)
## user system elapsed ## 0.041 0.015 14.356
registerDoParallel(cl = 2)%___%
%do% performs the calculations in order and uses only one processor.%dopar% is what we generally wish to use.%dorng% will be discussed later, and is required for procedures that use randomly generated numbers.%:% can be used for nesting loops, which we'll see in an example at the end of the tutorial.foreach() is a function.combine can take on the intuitive values 'c', 'cbind', or '+' as well as more complex functions and tells foreach how to combine the outputs from each iteration. The default is to return a list..inorder is a TRUE/FALSE argument. In general, it is better to change this from the default of TRUE to FALSE whenever possible.apply() functions, the foreach takes an expression (in braces after %dopar%) instead of a function.Here is our same example with an iterator instead of a list.
library(iterators)
registerDoParallel(cl = 2)
system.time(
err.foreach.iter <- foreach(x = iter(cv.test.sets, by = "col"),
.inorder = FALSE,
.combine = "cbind",
.packages = "splines") %dopar% {
get.errs(test.set = x,
discarded = discarded,
q = 1)
}
)
## user system elapsed ## 40.037 2.090 14.322
There are many ways to implement doRNG, the two easiest of which are:
library(doRNG)
registerDoParallel(cl = 2)
foo <- foreach(x = 1:10,
.options.RNG = 1985,
.combine = 'c') %dorng% {
rnorm(1)
}
and
registerDoParallel(cl = 2)
registerDoRNG(seed = 1985)
bar <- foreach(x = 1:10,
.combine = 'c') %dopar% {
rnorm(1)
}
Note that this gives reproducible results!
sum(foo != bar)
## [1] 0
Some packages come with parallel functionality built in via foreach.
glmnetgamggmcmcplyrforeach's CRAN page.The following calculation wouldn't typically require parallelization because it isn't a huge task, however we use it for practice's sake.
Suppose we wish to figure out which day in May had the most combined arrivals across these seven stations. Here's a function to get started, you're welcome to scrap it for your own, or use it and fill in the gaps in what follows.
sum.arrivals <- function(date = NULL,
id = NULL){
sum.arr <- sum(arrivals.sub$arrivals[arrivals.sub$date == date &
arrivals.sub$id == id])
return(sum.arr)
}
The following code needs some help. When nesting foreach loops (see this vignette), we need to use %:%. Fill in the missing parts to make the code work.
library(doParallel)
registerDoParallel(cl = _)
system.time(
busiest <- foreach(date = ___,
._______ = FALSE,
.combine = ___,
.packages = ______,) %:%
foreach(id = unique(arrivals.sub$id),
.inorder = _____,
.combine = ___,
.packages = ______,) %_____% {
sum.arrivals(date = date,
id = id)
}
)
which(busiest==max(busiest))
registerDoParallel(cl = 4)
system.time(
busiest <- foreach(date = 1:31,
.inorder = FALSE,
.combine = 'c') %:%
foreach(id = unique(arrivals.sub$id),
.inorder = FALSE,
.combine = '+', .packages = 'splines') %dopar% {
sum.arrivals(date = date,
id = id)
})
## user system elapsed ## 0.157 0.023 0.206
which(busiest==max(busiest))
## [1] 20
busiest.for <- rep(0, 31)
system.time(
for(date in 1:31){
for(id in unique(arrivals.sub$id)){
busiest.for[date] <- sum(busiest.for[date],
sum.arrivals(date = date, id = id))
}
}
)
## user system elapsed ## 0.049 0.004 0.053
which(busiest.for==max(busiest.for))
## [1] 20
Getting Started with doParallel and foreach
Analytics, Revolution, and Steve Weston. 2014. Foreach: Foreach Looping Construct for R. http://CRAN.R-project.org/package=foreach.