May 15, 2015

Why?

Embarassingly parallel tasks

parallel processes:

  • Bootstrapping
  • Cross-validation
  • Simulating independent random variables (dorng)

non-parallel processes:

  • MCMC algorithms
  • Several types of model selection (e.g.: step() or the LARS algorithm for LASSO)

What to do

Options

  • Changing from a for loop to one of the apply() functions can help, but still doesn't use multiple processors.
  • Use the parallel package (thanks, Miranda!).
  • Don't use R.
  • Use the foreach package! (Analytics and Weston 2014)

Why foreach?

  • Make use of our whole computer
  • Without having to invest large amounts of time in learning new programming languages
  • Our goal: transform a for loop into a foreach loop

Example: data and research question

citibike nyc

Goal: predict arrival volume to inform management of bike stations

  • 7 busiest locations from May 2014
  • response: # of arrivals each hour of every day in the month
  • covariates: hour of the day and whether the day is a weekend
locations

busiest 7 stations

Fitting GLMs and extracting prediction error

the Goldilocks approach

fit all

Use K-fold cross validation to compare prediction error.

Make our K-fold test sets (and implicitly, our training sets)

load(url("http://www.stat.colostate.edu/~scharfh/CSP_parallel/data/arrivals_subset.RData"))
K <- 50
N <- dim(arrivals.sub)[1]

## for convenience kill off 8 observations (we have 5208) and make cv test sets
set.seed(1985)
discarded <- sample(1:N, size = 8)
cv.test.sets <- matrix(sample((1:N)[-discarded], size = N - 8), ncol = K)

For each fold, fit the model…

library(splines)
lq.loss <- function(y, y.hat, q = 1) {(abs(y - y.hat))^q}
get.errs <- function(test.set = NULL, discarded = NULL, q = 1) {
    sml.glm <- glm(arrivals ~
                   bs(hour, degree = 4)
                   + weekend
                   + as.factor(id),
                   data = arrivals.sub[-c(discarded, test.set), ],
                   family = "poisson")
    med.glm <- glm(arrivals ~
                   bs(hour, degree = 4)*weekend
                   + as.factor(id),
                   data = arrivals.sub[-c(discarded, test.set), ],
                   family = "poisson")
    big.glm <- glm(arrivals ~
                   bs(hour, degree = 4)*weekend*as.factor(id),
                   data = arrivals.sub[-c(discarded, test.set), ],
                   family = "poisson")

…and get the training error

    sml.err <- mean(lq.loss(predict(object = sml.glm,
                                    newdata = arrivals.sub[test.set, -7],
                                    type = "response"),
                            arrivals.sub[test.set, 7],
                            q = q))
    med.err <- mean(lq.loss(predict(object = med.glm,
                                    newdata = arrivals.sub[test.set, -7],
                                    type = "response"),
                            arrivals.sub[test.set, 7],
                            q = q))
    big.err <- mean(lq.loss(predict(object = big.glm,
                                    newdata = arrivals.sub[test.set, -7],
                                    type = "response"),
                            arrivals.sub[test.set, 7],
                            q = q))
    return(c(sml.err, med.err, big.err))
}

K-fold CV with a for loop

Using a naive for loop, we could implement this as:

err.for <- NULL
system.time(
    for (i in 1:K) {
        err.for <- cbind(err.for, get.errs(test.set = cv.test.sets[, i],
                                           discarded = discarded,
                                           q = 1))
        }
    )
##    user  system elapsed 
##  25.133   1.113  26.319

K-fold CV with an apply function

If you're good with apply() functions you might upgrade to

system.time(
    err.apply <- sapply(X = 1:K, 
                        FUN = function(i) {
                            get.errs(test.set = cv.test.sets[, i],
                                     discarded = discarded,
                                     q = 1)
                            }
                        )
    )
##    user  system elapsed 
##  24.280   1.095  25.488

Both of these assume a single processor architecture. We want to chop the job into halves, fourths, etc. and use the whole computer!

K-fold CV with a foreach loop

library(foreach)
library(doParallel)
detectCores()
registerDoParallel(cl = 2)
system.time(
    err.foreach <- foreach(i=1:K,
                           .inorder = FALSE,
                           .combine = "cbind",
                           .packages = "splines") %dopar% {
                               get.errs(test.set = cv.test.sets[, i],
                                        discarded = discarded,
                                        q = 1)
                               }
    )
##    user  system elapsed 
##   0.041   0.015  14.356

The breakdown

Components of a foreach loop

  • registerDoParallel(cl = 2)
  • %___%
    • %do% performs the calculations in order and uses only one processor.
    • %dopar% is what we generally wish to use.
    • %dorng% will be discussed later, and is required for procedures that use randomly generated numbers.
    • %:% can be used for nesting loops, which we'll see in an example at the end of the tutorial.

foreach() is a function

Arguments

  • .combine can take on the intuitive values 'c', 'cbind', or '+' as well as more complex functions and tells foreach how to combine the outputs from each iteration. The default is to return a list.
  • .inorder is a TRUE/FALSE argument. In general, it is better to change this from the default of TRUE to FALSE whenever possible.

Notice that unlike apply() functions, the foreach takes an expression (in braces after %dopar%) instead of a function.

Results

error densities

Additional topics

Iterators: smart memory use

Here is our same example with an iterator instead of a list.

library(iterators)
registerDoParallel(cl = 2)
system.time(
    err.foreach.iter <- foreach(x = iter(cv.test.sets, by = "col"),
                                .inorder = FALSE,
                                .combine = "cbind",
                                .packages = "splines") %dopar% {
                                    get.errs(test.set = x,
                                             discarded = discarded,
                                             q = 1)
                                   }
    )
##    user  system elapsed 
##  40.037   2.090  14.322

Random numbers

There are many ways to implement doRNG, the two easiest of which are:

library(doRNG)
registerDoParallel(cl = 2)
foo <- foreach(x = 1:10, 
               .options.RNG = 1985,
               .combine = 'c') %dorng% {
                   rnorm(1)
                   }

and

registerDoParallel(cl = 2)
registerDoRNG(seed = 1985)
bar <- foreach(x = 1:10,
               .combine = 'c') %dopar% {
                   rnorm(1)
                   }

Random numbers

Note that this gives reproducible results!

sum(foo != bar) 
## [1] 0

Packages that support foreach

Some packages come with parallel functionality built in via foreach.

  • glmnet
  • gam
  • ggmcmc
  • plyr
  • Many others: see reverse suggests on foreach's CRAN page.

Exercise

A summary statistic

The following calculation wouldn't typically require parallelization because it isn't a huge task, however we use it for practice's sake.

Suppose we wish to figure out which day in May had the most combined arrivals across these seven stations. Here's a function to get started, you're welcome to scrap it for your own, or use it and fill in the gaps in what follows.

sum.arrivals <- function(date = NULL,
                         id = NULL){
    sum.arr <- sum(arrivals.sub$arrivals[arrivals.sub$date == date &
                                         arrivals.sub$id == id])
    return(sum.arr)
}

Fix up the following code

The following code needs some help. When nesting foreach loops (see this vignette), we need to use %:%. Fill in the missing parts to make the code work.

library(doParallel)
registerDoParallel(cl = _)
system.time(
    busiest <- foreach(date = ___,
                       ._______ = FALSE,
                       .combine = ___,
                       .packages = ______,) %:%
                           foreach(id = unique(arrivals.sub$id),
                                   .inorder = _____,
                                   .combine = ___,
                                   .packages = ______,) %_____% {
                                       sum.arrivals(date = date,
                                                    id = id)
                                       }
    )
which(busiest==max(busiest))    

One answer

registerDoParallel(cl = 4)
system.time(
    busiest <- foreach(date = 1:31,
                       .inorder = FALSE,
                       .combine = 'c') %:%
                           foreach(id = unique(arrivals.sub$id),
                                   .inorder = FALSE,
                                   .combine = '+', .packages = 'splines') %dopar% {
                                       sum.arrivals(date = date,
                                                    id = id)
                                   })
##    user  system elapsed 
##   0.157   0.023   0.206
which(busiest==max(busiest))    
## [1] 20

Too much overhead

busiest.for <- rep(0, 31)
system.time(
    for(date in 1:31){
        for(id in unique(arrivals.sub$id)){
            busiest.for[date] <- sum(busiest.for[date],
                                     sum.arrivals(date = date, id = id))
        }
    }
    )
##    user  system elapsed 
##   0.049   0.004   0.053
which(busiest.for==max(busiest.for))    
## [1] 20

Thanks!

parallel puppies

References