Moving from for to foreach

May 15, 2015

Why?

Embarassingly parallel tasks

parallel processes:

Bootstrapping
Cross-validation
Simulating independent random variables (dorng)

non-parallel processes:

MCMC algorithms
Several types of model selection (e.g.: step() or the LARS algorithm for LASSO)

What to do

Options

Changing from a for loop to one of the apply() functions can help, but still doesn't use multiple processors.
Use the parallel package (thanks, Miranda!).
Don't use R.
Use the foreach package! (Analytics and Weston 2014)

Why foreach?

Make use of our whole computer
Without having to invest large amounts of time in learning new programming languages
Our goal: transform a for loop into a foreach loop

Example: data and research question

citibike nyc

Goal: predict arrival volume to inform management of bike stations

7 busiest locations from May 2014
response: # of arrivals each hour of every day in the month
covariates: hour of the day and whether the day is a weekend

busiest 7 stations

Fitting GLMs and extracting prediction error

the Goldilocks approach

Use K-fold cross validation to compare prediction error.

Make our K-fold test sets (and implicitly, our training sets)

load(url("http://www.stat.colostate.edu/~scharfh/CSP_parallel/data/arrivals_subset.RData"))
K <- 50
N <- dim(arrivals.sub)[1]

## for convenience kill off 8 observations (we have 5208) and make cv test sets
set.seed(1985)
discarded <- sample(1:N, size = 8)
cv.test.sets <- matrix(sample((1:N)[-discarded], size = N - 8), ncol = K)

For each fold, fit the model…

library(splines)
lq.loss <- function(y, y.hat, q = 1) {(abs(y - y.hat))^q}
get.errs <- function(test.set = NULL, discarded = NULL, q = 1) {
    sml.glm <- glm(arrivals ~
                   bs(hour, degree = 4)
                   + weekend
                   + as.factor(id),
                   data = arrivals.sub[-c(discarded, test.set), ],
                   family = "poisson")
    med.glm <- glm(arrivals ~
                   bs(hour, degree = 4)*weekend
                   + as.factor(id),
                   data = arrivals.sub[-c(discarded, test.set), ],
                   family = "poisson")
    big.glm <- glm(arrivals ~
                   bs(hour, degree = 4)*weekend*as.factor(id),
                   data = arrivals.sub[-c(discarded, test.set), ],
                   family = "poisson")

…and get the training error

    sml.err <- mean(lq.loss(predict(object = sml.glm,
                                    newdata = arrivals.sub[test.set, -7],
                                    type = "response"),
                            arrivals.sub[test.set, 7],
                            q = q))
    med.err <- mean(lq.loss(predict(object = med.glm,
                                    newdata = arrivals.sub[test.set, -7],
                                    type = "response"),
                            arrivals.sub[test.set, 7],
                            q = q))
    big.err <- mean(lq.loss(predict(object = big.glm,
                                    newdata = arrivals.sub[test.set, -7],
                                    type = "response"),
                            arrivals.sub[test.set, 7],
                            q = q))
    return(c(sml.err, med.err, big.err))
}

K-fold CV with a for loop

Using a naive for loop, we could implement this as:

err.for <- NULL
system.time(
    for (i in 1:K) {
        err.for <- cbind(err.for, get.errs(test.set = cv.test.sets[, i],
                                           discarded = discarded,
                                           q = 1))
        }
    )

##    user  system elapsed 
##  25.133   1.113  26.319

K-fold CV with an apply function

If you're good with apply() functions you might upgrade to

system.time(
    err.apply <- sapply(X = 1:K, 
                        FUN = function(i) {
                            get.errs(test.set = cv.test.sets[, i],
                                     discarded = discarded,
                                     q = 1)
                            }
                        )
    )

##    user  system elapsed 
##  24.280   1.095  25.488

Both of these assume a single processor architecture. We want to chop the job into halves, fourths, etc. and use the whole computer!

K-fold CV with a foreach loop

library(foreach)
library(doParallel)
detectCores()

registerDoParallel(cl = 2)
system.time(
    err.foreach <- foreach(i=1:K,
                           .inorder = FALSE,
                           .combine = "cbind",
                           .packages = "splines") %dopar% {
                               get.errs(test.set = cv.test.sets[, i],
                                        discarded = discarded,
                                        q = 1)
                               }
    )

##    user  system elapsed 
##   0.041   0.015  14.356

The breakdown

Components of a foreach loop

registerDoParallel(cl = 2)
%___%
- %do% performs the calculations in order and uses only one processor.
- %dopar% is what we generally wish to use.
- %dorng% will be discussed later, and is required for procedures that use randomly generated numbers.
- %:% can be used for nesting loops, which we'll see in an example at the end of the tutorial.

`foreach()` is a function

Arguments

.combine can take on the intuitive values 'c', 'cbind', or '+' as well as more complex functions and tells foreach how to combine the outputs from each iteration. The default is to return a list.
.inorder is a TRUE/FALSE argument. In general, it is better to change this from the default of TRUE to FALSE whenever possible.

Notice that unlike `apply()` functions, the `foreach` takes an expression (in braces after `%dopar%`) instead of a function.

Results

error densities

Additional topics

Iterators: smart memory use

Here is our same example with an iterator instead of a list.

library(iterators)
registerDoParallel(cl = 2)
system.time(
    err.foreach.iter <- foreach(x = iter(cv.test.sets, by = "col"),
                                .inorder = FALSE,
                                .combine = "cbind",
                                .packages = "splines") %dopar% {
                                    get.errs(test.set = x,
                                             discarded = discarded,
                                             q = 1)
                                   }
    )

##    user  system elapsed 
##  40.037   2.090  14.322

Random numbers

There are many ways to implement doRNG, the two easiest of which are:

library(doRNG)
registerDoParallel(cl = 2)
foo <- foreach(x = 1:10, 
               .options.RNG = 1985,
               .combine = 'c') %dorng% {
                   rnorm(1)
                   }

and

registerDoParallel(cl = 2)
registerDoRNG(seed = 1985)
bar <- foreach(x = 1:10,
               .combine = 'c') %dopar% {
                   rnorm(1)
                   }

Random numbers

Note that this gives reproducible results!

sum(foo != bar)

## [1] 0

Packages that support foreach

Some packages come with parallel functionality built in via foreach.

glmnet
gam
ggmcmc
plyr
Many others: see reverse suggests on foreach's CRAN page.

Exercise

A summary statistic

The following calculation wouldn't typically require parallelization because it isn't a huge task, however we use it for practice's sake.

Suppose we wish to figure out which day in May had the most combined arrivals across these seven stations. Here's a function to get started, you're welcome to scrap it for your own, or use it and fill in the gaps in what follows.

sum.arrivals <- function(date = NULL,
                         id = NULL){
    sum.arr <- sum(arrivals.sub$arrivals[arrivals.sub$date == date &
                                         arrivals.sub$id == id])
    return(sum.arr)
}

Fix up the following code

The following code needs some help. When nesting foreach loops (see this vignette), we need to use %:%. Fill in the missing parts to make the code work.

library(doParallel)
registerDoParallel(cl = _)
system.time(
    busiest <- foreach(date = ___,
                       ._______ = FALSE,
                       .combine = ___,
                       .packages = ______,) %:%
                           foreach(id = unique(arrivals.sub$id),
                                   .inorder = _____,
                                   .combine = ___,
                                   .packages = ______,) %_____% {
                                       sum.arrivals(date = date,
                                                    id = id)
                                       }
    )
which(busiest==max(busiest))

One answer

registerDoParallel(cl = 4)
system.time(
    busiest <- foreach(date = 1:31,
                       .inorder = FALSE,
                       .combine = 'c') %:%
                           foreach(id = unique(arrivals.sub$id),
                                   .inorder = FALSE,
                                   .combine = '+', .packages = 'splines') %dopar% {
                                       sum.arrivals(date = date,
                                                    id = id)
                                   })

##    user  system elapsed 
##   0.157   0.023   0.206

which(busiest==max(busiest))

## [1] 20

Too much overhead

busiest.for <- rep(0, 31)
system.time(
    for(date in 1:31){
        for(id in unique(arrivals.sub$id)){
            busiest.for[date] <- sum(busiest.for[date],
                                     sum.arrivals(date = date, id = id))
        }
    }
    )

##    user  system elapsed 
##   0.049   0.004   0.053

which(busiest.for==max(busiest.for))

## [1] 20

Thanks!

parallel puppies

References

Data and a neat tool

citibike system data

menu meters

Analytics, Revolution, and Steve Weston. 2014. Foreach: Foreach Looping Construct for R. http://CRAN.R-project.org/package=foreach.

Why?

Embarassingly parallel tasks

parallel processes:

non-parallel processes:

What to do

Options

Why foreach?

Example: data and research question

citibike nyc

Fitting GLMs and extracting prediction error

the Goldilocks approach

Use K-fold cross validation to compare prediction error.

For each fold, fit the model…

…and get the training error

K-fold CV with a for loop

K-fold CV with an apply function

K-fold CV with a foreach loop

The breakdown

Components of a foreach loop

foreach() is a function

Arguments

Notice that unlike apply() functions, the foreach takes an expression (in braces after %dopar%) instead of a function.

Results

Additional topics

Iterators: smart memory use

Random numbers

Random numbers

Packages that support foreach

Exercise

A summary statistic

Fix up the following code

One answer

Too much overhead

Thanks!

References

Other tutorials

Data and a neat tool

`foreach()` is a function

Notice that unlike `apply()` functions, the `foreach` takes an expression (in braces after `%dopar%`) instead of a function.