Computing with R and Hadoop

19 December 2014

Learning goals

This section should teach participants:

What Hadoop is
Current R/Hadoop integrations
When to use R with Hadoop (guidelines)
How to use R with Hadoop (lab)

Brief introduction to Hadoop

Hadoop

Key ideas: enables distributed computing; open source; widely used

older, more mature "cloud computing" technology

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage…

Source: Apache Software Foundation - What is Apache Hadoop?

Key features

Two of Hadoop's main features are particularly relevant to this talk:

Feature	Solves problem
Distributed storage	How do we easily store and access large datasets?
Distributed/batch computing	How do we quickly run analyses on large datasets?

Distributed vs. parallel computing

Distributed computing is analogous to parallel computing

Parallel - multiple processors run code

Distributed - multiple computers run code

Overview

1. Hadoop links together servers to form a (storage + computing) cluster.

Overview

1. Hadoop links together servers to form a (storage + computing) cluster.

Overview

2. Creates a distributed file system called hdfs which splits large data files into smaller pieces that servers across the cluster store.

Overview

2. Creates a distributed file system called hdfs which splits large data files into smaller pieces that servers across the cluster store.

Overview

3. Uses the MapReduce programming model to implement
distributed (i.e., parallel) computing.

Important interlude!

Size matters

Parallelization occurs when data is stored on multiple computers
Files are typically split into 64MB or 128MB chunks
Small files won't parallelize

MapReduce programs have 3 main stages

Map: Apply a function to extract and group data
Shuffle/sort: Sort the function's outputs
Reduce: Compute summaries of the grouped outputs

Overview (MapReduce)

3.a. Users upload Map and Reduce analysis code to Hadoop.

Overview (MapReduce)

3.b. Hadoop distributes the Map code to the servers with the data. These servers runs local analyses that extract and group data.

Overview (MapReduce)

3.c. Hadoop merges extracted data on one or more separate servers. These servers run the Reduce code that computes grouped data summaries.

Overview (MapReduce)

3.d. Hadoop stores analytic results in its distributed file system, hdfs, on the server(s) that ran the Reduce code.

Overview (MapReduce)

3.e. Analysts can retrieve these results for review or follow-on analysis.

Hadoop - Natural strengths

Extract data
Group data
Compute group summaries

Hadoop - Natural "weaknesses"

"Everything else"

Iterative algorithms
Multi-step workflows

R/Hadoop integrations

Key integration projects

Project	Sponsors/Maintainers
RHadoop	RevolutionAnalytics
RHIPE	tesseradata

Integration purposes

Let people use Hadoop to execute R code
Let people use R to access data stored in Hadoop

Consider integrating R and Hadoop when…

Your computing needs align with natural strengths of R and Hadoop

Evaluate alignment with the following factors:

Factor	Mantra	Guideline
R's natural strength	Use R for statistical computing	Consider integrating when your project can be solved using code available in R, or when it is not easily solved in other languages
Hadoop's natural strength	Use Hadoop for distributed storage & batch computing	Consider integrating when your problem requires lots of storage or when it could benefit from parallelization
Coding effort	Work smart, not hard	R and Hadoop are tools, not "cure-all" panaceas. Consider not integrating if it is easier to solve your problem with other tools
Processing time	Work smart, not hard (2)	Although some problems can benefit from parallelization, consider not integrating if the gains are negligible since this can help you reduce the complexity of your project

Example applications

Scenario	Use R/Hadoop?	Why?	Example
Analyzing small data stored in Hadoop	Y	R can quickly download data analyze it locally	Want to analyze summary datasets derived from map reduce jobs done in Hadoop
Extracting complex features from large data stored in Hadoop	Y	R has more built-in and contributed functions that analyze data than many standard programming languages	R is a natural language to use to write an algorithm or classifier that extracts information about objects contained in images
Applying prediction and classification models to datasets	Y	R is better at modeling than many standard programming languages	Using a logistic regression model to generate predictions in a large dataset

Example applications

Scenario	Use R/Hadoop?	Why?	Example
Implementing an "iteration-based" machine learning algorithm	Maybe	1) Other languages may be faster than R for your analysis, 2) Hadoop reads and writes a lot of data to disks, other "big data" tools, like Spark (and SparkR) are designed for speed in these scenarios by working in memory	Training a k-means classification algorithm or logistic regression on a large dataset
Simple pre-processing of large data stored in Hadoop	N	Standard programming languages are much faster than R at executing many basic text and image processing tasks	Pre-processing twitter tweets for use in a natural language processing project

Lab - Use R/Hadoop

Get ready

Download lab materials from web (links on handout)
- R script
- Lab instructions
- Windows only: PuTTY or other ssh client
- Optional: presentation slides
Connect to TheCantina wireless network

Lab goals

Present an example of a problem to integrate
Connect to Hadoop via R
Work through a basic integration
Modify the analysis on your own

Lab problem (simplified)

A car insurance company launched a small pilot study to evaluate a new program they are considering offering to all of their customers. At the end of the study the participants were asked whether or not they would like to stay enrolled in the offering.
The company would like to use the participants' demographic information and their feedback to help predict whether the program can be profitable if offered to all customers.
For marketing purposes, they are additionally interested in knowing if the program is very popular with specific subsets of their customers.

Business questions:

Keep the program?
Focus marketing to specific groups?

Analytic approach (simplified)

In R, build a logistic regression model from the pilot study data
PARALLEL Use RHadoop to use the regression model to predict which customers would like in the program (map) and combine the results (reduce)
Use RHadoop to retrieve the summary data and conduct follow-on analyses in R or build charts and tables to help present the data

This will demonstrate…

Applying prediction and classification models to Hadoop data (customers)
Analyzing small data stored in Hadoop (summary data)

Connect to Hadoop (via command line)

Log in to a server that can submit MapReduce jobs to Hadoop
ssh cloudera@192.168.1.105 (password: cloudera)
Start R
R
Within R…
- Load key RHadoop library
  library(rmr2)
- Tell RHadoop where to find Hadoop commands
  Sys.setenv(HADOOP_CMD='/usr/bin/hadoop')
  Sys.setenv(HADOOP_STREAMING='/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar')

Monitor Hadoop (via web browser)

Open "The Hadoop UI" (HUE)
- HUE landing page: http://192.168.1.105:8888/
  - Username and password: cloudera
- View status of MapReduce jobs: http://192.168.1.105:8888/jobbrowser/
- View contents of hdfs: http://192.168.1.105:8888/filebrowser/#/

Load and fit training data

Load data from pilot program

  pilot.path = '/CSP/data/insurance/pilotProgram_results.csv'
  pilot = from.dfs(pilot.path, format='csv')
  pilot.data = pilot$val
  colnames.path = '/CSP/data/insurance/columnNames.csv'
  data.colnames = from.dfs(colnames.path, format='csv')
  colnames(pilot.data) = t(data.colnames$val)

Fit logistic regression model to data

  pilot.fitted = glm(stay ~ ., binomial, pilot.data)

Reduce model and view results

  library(MASS)
  pilot.fitted.reduced = stepAIC(pilot.fitted)
  summary(pilot.fitted.reduced)

MapReduce: Apply model to data

Write mapper

  predictor.names = t(data.colnames$val)[-1]
  predictor.levels = lapply(pilot.data, levels)[-1]
  predictors.count = length(predictor.levels)
  not.null = function(x) { !is.null(x) }
  columns.forFactors = (1:predictors.count)[sapply(predictor.levels, not.null)]

  prediction.mapper = function(key, customer){ 
        colnames(customer) = predictor.names
        for(factorCol in columns.forFactors) {
          unseenLevels = which(!(customer[,factorCol] %in% predictor.levels[[factorCol]]))
          customer[unseenLevels, factorCol] = NA
        }
        customer.pred = plogis(predict.glm(pilot.fitted.reduced, customer))
        keyval(paste(as.character(customer$region), as.character(customer$gender)),
               ifelse(customer.pred>.5, 1, 0))
  }

MapReduce: Summarize predictions

Write reducer

  counting.reducer = function(k, vv) { keyval(k, sum(vv, na.rm=T)/length(vv)) }

Set up and execute MapReduce job

  mapred.result = mapreduce(
        input = '/CSP/data/insurance/customer_profiles.csv',
        input.format = 'csv',
        output.format = 'csv',
        map = prediction.mapper,
        reduce = counting.reducer
  )

Retrieve analytic results

  mapred.result.data = from.dfs(mapred.result(), format='csv')

Variations

Use a different regression/classification model
Compute more detailed summaries of customer predictions
- E.g., Group by gender, age, and region
- E.g., Work with raw estimates and compute averages or variances instead of percentages
Identify model weaknesses
- What factor levels are present in customer records but not in pilot records?
- Which or how many customers have these types of levels?
Explore functions and objects in rmr2
- Look at mapred.result
- Write data to hdfs in different formats

Summary

Stay efficient, stay practical

Practical computing requires balancing computing speed, programming efficiency, and personal comfort. R/Hadoop integrations offer more ways to achieve balance.
R/Hadoop integrations give practitioners opportunities to use strengths of both technologies.
R/Hadoop integrations give R programmers access to "non-R" technologies.

Topics for further reading

R/Hadoop integrations

RHadoop documentation and examples on github
RHIPE's example analysis of airplane dataset
Any online tutorial for logistic regression or k-means via R/Hadoop
RevolutionAnalytics' integration of biglm with RHadoop

Hadoop

Hadoop documentation by Apache, Hortonworks, or others

Acknowledgements

Logos/Graphics:

Apache Software Foundation
Amazon Web Services
HortonWorks

Parallel computing in R

Learn once, use often

Parallelization fully uses computational resources and saves time
Parts of many statistical analyses are parallelizable
Many options for parallelizing R code - use what works for you!