May 15, 2015
dorng
)step()
or the LARS algorithm for LASSO)apply()
functions can help, but still doesn't use multiple processors.parallel
package (thanks, Miranda!).foreach
package! (Analytics and Weston 2014)Goal: predict arrival volume to inform management of bike stations
Make our K-fold test sets (and implicitly, our training sets)
load(url("http://www.stat.colostate.edu/~scharfh/CSP_parallel/data/arrivals_subset.RData")) K <- 50 N <- dim(arrivals.sub)[1] ## for convenience kill off 8 observations (we have 5208) and make cv test sets set.seed(1985) discarded <- sample(1:N, size = 8) cv.test.sets <- matrix(sample((1:N)[-discarded], size = N - 8), ncol = K)
library(splines) lq.loss <- function(y, y.hat, q = 1) {(abs(y - y.hat))^q} get.errs <- function(test.set = NULL, discarded = NULL, q = 1) { sml.glm <- glm(arrivals ~ bs(hour, degree = 4) + weekend + as.factor(id), data = arrivals.sub[-c(discarded, test.set), ], family = "poisson") med.glm <- glm(arrivals ~ bs(hour, degree = 4)*weekend + as.factor(id), data = arrivals.sub[-c(discarded, test.set), ], family = "poisson") big.glm <- glm(arrivals ~ bs(hour, degree = 4)*weekend*as.factor(id), data = arrivals.sub[-c(discarded, test.set), ], family = "poisson")
sml.err <- mean(lq.loss(predict(object = sml.glm, newdata = arrivals.sub[test.set, -7], type = "response"), arrivals.sub[test.set, 7], q = q)) med.err <- mean(lq.loss(predict(object = med.glm, newdata = arrivals.sub[test.set, -7], type = "response"), arrivals.sub[test.set, 7], q = q)) big.err <- mean(lq.loss(predict(object = big.glm, newdata = arrivals.sub[test.set, -7], type = "response"), arrivals.sub[test.set, 7], q = q)) return(c(sml.err, med.err, big.err)) }
Using a naive for loop, we could implement this as:
err.for <- NULL system.time( for (i in 1:K) { err.for <- cbind(err.for, get.errs(test.set = cv.test.sets[, i], discarded = discarded, q = 1)) } )
## user system elapsed ## 25.133 1.113 26.319
If you're good with apply()
functions you might upgrade to
system.time( err.apply <- sapply(X = 1:K, FUN = function(i) { get.errs(test.set = cv.test.sets[, i], discarded = discarded, q = 1) } ) )
## user system elapsed ## 24.280 1.095 25.488
Both of these assume a single processor architecture. We want to chop the job into halves, fourths, etc. and use the whole computer!
library(foreach) library(doParallel) detectCores()
registerDoParallel(cl = 2) system.time( err.foreach <- foreach(i=1:K, .inorder = FALSE, .combine = "cbind", .packages = "splines") %dopar% { get.errs(test.set = cv.test.sets[, i], discarded = discarded, q = 1) } )
## user system elapsed ## 0.041 0.015 14.356
registerDoParallel(cl = 2)
%___%
%do%
performs the calculations in order and uses only one processor.%dopar%
is what we generally wish to use.%dorng%
will be discussed later, and is required for procedures that use randomly generated numbers.%:%
can be used for nesting loops, which we'll see in an example at the end of the tutorial.foreach()
is a function.combine
can take on the intuitive values 'c'
, 'cbind'
, or '+'
as well as more complex functions and tells foreach
how to combine the outputs from each iteration. The default is to return a list..inorder
is a TRUE
/FALSE
argument. In general, it is better to change this from the default of TRUE
to FALSE
whenever possible.apply()
functions, the foreach
takes an expression (in braces after %dopar%
) instead of a function.Here is our same example with an iterator instead of a list.
library(iterators) registerDoParallel(cl = 2) system.time( err.foreach.iter <- foreach(x = iter(cv.test.sets, by = "col"), .inorder = FALSE, .combine = "cbind", .packages = "splines") %dopar% { get.errs(test.set = x, discarded = discarded, q = 1) } )
## user system elapsed ## 40.037 2.090 14.322
There are many ways to implement doRNG
, the two easiest of which are:
library(doRNG) registerDoParallel(cl = 2) foo <- foreach(x = 1:10, .options.RNG = 1985, .combine = 'c') %dorng% { rnorm(1) }
and
registerDoParallel(cl = 2) registerDoRNG(seed = 1985) bar <- foreach(x = 1:10, .combine = 'c') %dopar% { rnorm(1) }
Note that this gives reproducible results!
sum(foo != bar)
## [1] 0
Some packages come with parallel functionality built in via foreach
.
glmnet
gam
ggmcmc
plyr
foreach
's CRAN page.The following calculation wouldn't typically require parallelization because it isn't a huge task, however we use it for practice's sake.
Suppose we wish to figure out which day in May had the most combined arrivals across these seven stations. Here's a function to get started, you're welcome to scrap it for your own, or use it and fill in the gaps in what follows.
sum.arrivals <- function(date = NULL, id = NULL){ sum.arr <- sum(arrivals.sub$arrivals[arrivals.sub$date == date & arrivals.sub$id == id]) return(sum.arr) }
The following code needs some help. When nesting foreach
loops (see this vignette), we need to use %:%
. Fill in the missing parts to make the code work.
library(doParallel) registerDoParallel(cl = _) system.time( busiest <- foreach(date = ___, ._______ = FALSE, .combine = ___, .packages = ______,) %:% foreach(id = unique(arrivals.sub$id), .inorder = _____, .combine = ___, .packages = ______,) %_____% { sum.arrivals(date = date, id = id) } ) which(busiest==max(busiest))
registerDoParallel(cl = 4) system.time( busiest <- foreach(date = 1:31, .inorder = FALSE, .combine = 'c') %:% foreach(id = unique(arrivals.sub$id), .inorder = FALSE, .combine = '+', .packages = 'splines') %dopar% { sum.arrivals(date = date, id = id) })
## user system elapsed ## 0.157 0.023 0.206
which(busiest==max(busiest))
## [1] 20
busiest.for <- rep(0, 31) system.time( for(date in 1:31){ for(id in unique(arrivals.sub$id)){ busiest.for[date] <- sum(busiest.for[date], sum.arrivals(date = date, id = id)) } } )
## user system elapsed ## 0.049 0.004 0.053
which(busiest.for==max(busiest.for))
## [1] 20
Getting Started with doParallel and foreach
Analytics, Revolution, and Steve Weston. 2014. Foreach: Foreach Looping Construct for R. http://CRAN.R-project.org/package=foreach.