1 why?

1.1 embarassingly parallel tasks

These are computational tasks which involve many separate, independently executable calculations. Some common statistical examples of embarassingly parallel processes:

  • bootstrapping
  • cross-validation
  • simulating independent random variables (dorng)

In contrast, some sequential or non-parallel processes:

  • MCMC algorithms
  • several types of model selection (e.g.: step() or the LARS algorithm for LASSO)

for loops that do not explicitly involve dependent calculations are wasteful if we have multiple processors available. Perhaps even worse, the time cost of using such an approach can put some useful statistical tools beyond our reach!

1.2 options

  • Changing from a for loop to one of the apply() functions can help, but still doesn’t use multiple processors.
  • Use the parallel package (thanks, Miranda!).
  • Don’t use R.
  • Use the foreach package! (Analytics and Weston 2014)

1.3 why foreach?

We would like to find a way to make use of our whole computer, and make valuable tasks like bootstrapping available, but without having to invest large amounts of time in learning new programming languages. Enter foreach, which keeps the structure of a for loop, but allows us to drop two key assumptions:

  • sequentiality
  • single processor architecture

Our goal: We will begin with a simple chunk of R code involving a for loop and transform it into a foreach loop. Along the way, we’ll take a look at the equivalent computation done with an apply() function, and see that using foreach and multiple processors outperforms this.

2 example: data and research question

We are going to look at data from the New York City bikeshare program Citibike.

One of the costliest parts of operating a bike share program comes from the finiteness of the bicycle stations. A station can only hold so many bicycles, and a full (empty) station means customers cannot drop off (pick up) a bike. Thus, managers are forced to use trucks to manually redistribute bicycles.

We want to find a model which can offer good prediction, with the hope that this will inform our plans for future station locations/sizes. For this example, we start with a few plausible models and use K-fold cross validation to decide which one to use.

2.1 locations of our 7 sites