19 December 2014

Learning goals

This section should teach participants:

  1. What Hadoop is
  2. Current R/Hadoop integrations
  3. When to use R with Hadoop (guidelines)
  4. How to use R with Hadoop (lab)

Brief introduction to Hadoop

Hadoop

Key ideas: enables distributed computing; open source; widely used

  • older, more mature "cloud computing" technology


The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage…

Source: Apache Software Foundation - What is Apache Hadoop?


Key features

Two of Hadoop's main features are particularly relevant to this talk:

Feature Solves problem
Distributed storage How do we easily store and access large datasets?
Distributed/batch computing How do we quickly run analyses on large datasets?

Distributed vs. parallel computing

Distributed computing is analogous to parallel computing

Parallel - multiple processors run code

Distributed - multiple computers run code

Overview

1.  Hadoop links together servers to form a (storage + computing) cluster.

Overview

1.  Hadoop links together servers to form a (storage + computing) cluster.

Overview

2.  Creates a distributed file system called hdfs which splits large data files into smaller pieces that servers across the cluster store.

Overview

2.  Creates a distributed file system called hdfs which splits large data files into smaller pieces that servers across the cluster store.

Overview

3.  Uses the MapReduce programming model to implement
     distributed (i.e., parallel) computing.

Important interlude!

Size matters

  • Parallelization occurs when data is stored on multiple computers
  • Files are typically split into 64MB or 128MB chunks
  • Small files won't parallelize


MapReduce programs have 3 main stages

  1. Map: Apply a function to extract and group data
  2. Shuffle/sort: Sort the function's outputs
  3. Reduce: Compute summaries of the grouped outputs

Overview (MapReduce)

3.a.  Users upload Map and Reduce analysis code to Hadoop.

Overview (MapReduce)

3.b.  Hadoop distributes the Map code to the servers with the data. These servers runs local analyses that extract and group data.

Overview (MapReduce)

3.c.  Hadoop merges extracted data on one or more separate servers. These servers run the Reduce code that computes grouped data summaries.

Overview (MapReduce)

3.d.  Hadoop stores analytic results in its distributed file system, hdfs, on the server(s) that ran the Reduce code.

Overview (MapReduce)

3.e.  Analysts can retrieve these results for review or follow-on analysis.

Hadoop - Natural strengths

  • Extract data
  • Group data
  • Compute group summaries

Hadoop - Natural "weaknesses"

"Everything else"

  • Iterative algorithms
  • Multi-step workflows

R/Hadoop integrations

Key integration projects

Project Sponsors/Maintainers
RHadoop RevolutionAnalytics
RHIPE tesseradata


Integration purposes

  • Let people use Hadoop to execute R code
  • Let people use R to access data stored in Hadoop

Consider integrating R and Hadoop when…

Your computing needs align with natural strengths of R and Hadoop

Evaluate alignment with the following factors:

Factor Mantra Guideline
R's natural strength Use R for statistical computing Consider integrating when your project can be solved using code available in R, or when it is not easily solved in other languages
Hadoop's natural strength Use Hadoop for distributed storage & batch computing Consider integrating when your problem requires lots of storage or when it could benefit from parallelization
Coding effort Work smart, not hard R and Hadoop are tools, not "cure-all" panaceas. Consider not integrating if it is easier to solve your problem with other tools
Processing time Work smart, not hard (2) Although some problems can benefit from parallelization, consider not integrating if the gains are negligible since this can help you reduce the complexity of your project

Example applications

Scenario Use R/Hadoop? Why? Example
Analyzing small data stored in Hadoop Y R can quickly download data analyze it locally Want to analyze summary datasets derived from map reduce jobs done in Hadoop
Extracting complex features from large data stored in Hadoop Y R has more built-in and contributed functions that analyze data than many standard programming languages R is a natural language to use to write an algorithm or classifier that extracts information about objects contained in images
Applying prediction and classification models to datasets Y R is better at modeling than many standard programming languages Using a logistic regression model to generate predictions in a large dataset

Example applications

Scenario Use R/Hadoop? Why? Example
Implementing an "iteration-based" machine learning algorithm Maybe 1) Other languages may be faster than R for your analysis, 2) Hadoop reads and writes a lot of data to disks, other "big data" tools, like Spark (and SparkR) are designed for speed in these scenarios by working in memory Training a k-means classification algorithm or logistic regression on a large dataset
Simple pre-processing of large data stored in Hadoop N Standard programming languages are much faster than R at executing many basic text and image processing tasks Pre-processing twitter tweets for use in a natural language processing project

Lab - Use R/Hadoop

Get ready

  1. Download lab materials from web (links on handout)
    • R script
    • Lab instructions
    • Windows only: PuTTY or other ssh client
    • Optional: presentation slides
  2. Connect to TheCantina wireless network

Lab goals

  1. Present an example of a problem to integrate
  2. Connect to Hadoop via R
  3. Work through a basic integration
  4. Modify the analysis on your own

Lab problem (simplified)

  • A car insurance company launched a small pilot study to evaluate a new program they are considering offering to all of their customers. At the end of the study the participants were asked whether or not they would like to stay enrolled in the offering.

  • The company would like to use the participants' demographic information and their feedback to help predict whether the program can be profitable if offered to all customers.

  • For marketing purposes, they are additionally interested in knowing if the program is very popular with specific subsets of their customers.

Business questions:

  • Keep the program?
  • Focus marketing to specific groups?

Analytic approach (simplified)

  1. In R, build a logistic regression model from the pilot study data
  2. PARALLEL Use RHadoop to use the regression model to predict which customers would like in the program (map) and combine the results (reduce)
  3. Use RHadoop to retrieve the summary data and conduct follow-on analyses in R or build charts and tables to help present the data

This will demonstrate…

  • Applying prediction and classification models to Hadoop data (customers)
  • Analyzing small data stored in Hadoop (summary data)

Connect to Hadoop (via command line)

  • Log in to a server that can submit MapReduce jobs to Hadoop

    ssh cloudera@192.168.1.105    (password: cloudera)

  • Start R

    R

  • Within R…
    • Load key RHadoop library

      library(rmr2)

    • Tell RHadoop where to find Hadoop commands

      Sys.setenv(HADOOP_CMD='/usr/bin/hadoop')
      Sys.setenv(HADOOP_STREAMING='/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar')

Monitor Hadoop (via web browser)

Load and fit training data

  • Load data from pilot program
  pilot.path = '/CSP/data/insurance/pilotProgram_results.csv'
  pilot = from.dfs(pilot.path, format='csv')
  pilot.data = pilot$val
  colnames.path = '/CSP/data/insurance/columnNames.csv'
  data.colnames = from.dfs(colnames.path, format='csv')
  colnames(pilot.data) = t(data.colnames$val)
  • Fit logistic regression model to data
  pilot.fitted = glm(stay ~ ., binomial, pilot.data)
  • Reduce model and view results
  library(MASS)
  pilot.fitted.reduced = stepAIC(pilot.fitted)
  summary(pilot.fitted.reduced)

MapReduce: Apply model to data

  • Write mapper
  predictor.names = t(data.colnames$val)[-1]
  predictor.levels = lapply(pilot.data, levels)[-1]
  predictors.count = length(predictor.levels)
  not.null = function(x) { !is.null(x) }
  columns.forFactors = (1:predictors.count)[sapply(predictor.levels, not.null)]
  prediction.mapper = function(key, customer){ 
        colnames(customer) = predictor.names
        for(factorCol in columns.forFactors) {
          unseenLevels = which(!(customer[,factorCol] %in% predictor.levels[[factorCol]]))
          customer[unseenLevels, factorCol] = NA
        }
        customer.pred = plogis(predict.glm(pilot.fitted.reduced, customer))
        keyval(paste(as.character(customer$region), as.character(customer$gender)),
               ifelse(customer.pred>.5, 1, 0))
  }

MapReduce: Summarize predictions

  • Write reducer
  counting.reducer = function(k, vv) { keyval(k, sum(vv, na.rm=T)/length(vv)) }
  • Set up and execute MapReduce job
  mapred.result = mapreduce(
        input = '/CSP/data/insurance/customer_profiles.csv',
        input.format = 'csv',
        output.format = 'csv',
        map = prediction.mapper,
        reduce = counting.reducer
  )
  • Retrieve analytic results
  mapred.result.data = from.dfs(mapred.result(), format='csv')

Variations

  • Use a different regression/classification model

  • Compute more detailed summaries of customer predictions
    • E.g., Group by gender, age, and region
    • E.g., Work with raw estimates and compute averages or variances instead of percentages
  • Identify model weaknesses
    • What factor levels are present in customer records but not in pilot records?
    • Which or how many customers have these types of levels?
  • Explore functions and objects in rmr2
    • Look at mapred.result
    • Write data to hdfs in different formats

Summary

Stay efficient, stay practical

  • Practical computing requires balancing computing speed, programming efficiency, and personal comfort. R/Hadoop integrations offer more ways to achieve balance.

  • R/Hadoop integrations give practitioners opportunities to use strengths of both technologies.

  • R/Hadoop integrations give R programmers access to "non-R" technologies.

Topics for further reading

Acknowledgements

Logos/Graphics:

  • Apache Software Foundation
  • Amazon Web Services
  • HortonWorks

Parallel computing in R

Learn once, use often

  • Parallelization fully uses computational resources and saves time

  • Parts of many statistical analyses are parallelizable

  • Many options for parallelizing R code - use what works for you!