This section should teach participants:
- What Hadoop is
- Current R/Hadoop integrations
- When to use R with Hadoop (guidelines)
- How to use R with Hadoop (lab)
19 December 2014
This section should teach participants:
Key ideas: enables distributed computing; open source; widely used
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage…
Source: Apache Software Foundation - What is Apache Hadoop?
Two of Hadoop's main features are particularly relevant to this talk:
Feature | Solves problem |
---|---|
Distributed storage | How do we easily store and access large datasets? |
Distributed/batch computing | How do we quickly run analyses on large datasets? |
Distributed computing is analogous to parallel computing
Parallel - multiple processors run code
Distributed - multiple computers run code
1. Hadoop links together servers to form a (storage + computing) cluster.
1. Hadoop links together servers to form a (storage + computing) cluster.
2. Creates a distributed file system called hdfs which splits large data files into smaller pieces that servers across the cluster store.
2. Creates a distributed file system called hdfs which splits large data files into smaller pieces that servers across the cluster store.
3. Uses the MapReduce programming model to implement
    distributed (i.e., parallel) computing.
Size matters
MapReduce programs have 3 main stages
3.a. Users upload Map and Reduce analysis code to Hadoop.
3.b. Hadoop distributes the Map code to the servers with the data. These servers runs local analyses that extract and group data.
3.c. Hadoop merges extracted data on one or more separate servers. These servers run the Reduce code that computes grouped data summaries.
3.d. Hadoop stores analytic results in its distributed file system, hdfs, on the server(s) that ran the Reduce code.
3.e. Analysts can retrieve these results for review or follow-on analysis.
"Everything else"
Project | Sponsors/Maintainers |
---|---|
RHadoop | RevolutionAnalytics |
RHIPE | tesseradata |
Your computing needs align with natural strengths of R and Hadoop
Evaluate alignment with the following factors:
Factor | Mantra | Guideline |
---|---|---|
R's natural strength | Use R for statistical computing | Consider integrating when your project can be solved using code available in R, or when it is not easily solved in other languages |
Hadoop's natural strength | Use Hadoop for distributed storage & batch computing | Consider integrating when your problem requires lots of storage or when it could benefit from parallelization |
Coding effort | Work smart, not hard | R and Hadoop are tools, not "cure-all" panaceas. Consider not integrating if it is easier to solve your problem with other tools |
Processing time | Work smart, not hard (2) | Although some problems can benefit from parallelization, consider not integrating if the gains are negligible since this can help you reduce the complexity of your project |
Scenario | Use R/Hadoop? | Why? | Example |
---|---|---|---|
Analyzing small data stored in Hadoop | Y | R can quickly download data analyze it locally | Want to analyze summary datasets derived from map reduce jobs done in Hadoop |
Extracting complex features from large data stored in Hadoop | Y | R has more built-in and contributed functions that analyze data than many standard programming languages | R is a natural language to use to write an algorithm or classifier that extracts information about objects contained in images |
Applying prediction and classification models to datasets | Y | R is better at modeling than many standard programming languages | Using a logistic regression model to generate predictions in a large dataset |
Scenario | Use R/Hadoop? | Why? | Example |
---|---|---|---|
Implementing an "iteration-based" machine learning algorithm | Maybe | 1) Other languages may be faster than R for your analysis, 2) Hadoop reads and writes a lot of data to disks, other "big data" tools, like Spark (and SparkR) are designed for speed in these scenarios by working in memory | Training a k-means classification algorithm or logistic regression on a large dataset |
Simple pre-processing of large data stored in Hadoop | N | Standard programming languages are much faster than R at executing many basic text and image processing tasks | Pre-processing twitter tweets for use in a natural language processing project |
TheCantina
wireless networkA car insurance company launched a small pilot study to evaluate a new program they are considering offering to all of their customers. At the end of the study the participants were asked whether or not they would like to stay enrolled in the offering.
The company would like to use the participants' demographic information and their feedback to help predict whether the program can be profitable if offered to all customers.
For marketing purposes, they are additionally interested in knowing if the program is very popular with specific subsets of their customers.
Business questions:
ssh cloudera@192.168.1.105
  (password: cloudera
)
R
library(rmr2)
Sys.setenv(HADOOP_CMD='/usr/bin/hadoop')
Sys.setenv(HADOOP_STREAMING='/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar')
cloudera
pilot.path = '/CSP/data/insurance/pilotProgram_results.csv' pilot = from.dfs(pilot.path, format='csv') pilot.data = pilot$val colnames.path = '/CSP/data/insurance/columnNames.csv' data.colnames = from.dfs(colnames.path, format='csv') colnames(pilot.data) = t(data.colnames$val)
pilot.fitted = glm(stay ~ ., binomial, pilot.data)
library(MASS) pilot.fitted.reduced = stepAIC(pilot.fitted) summary(pilot.fitted.reduced)
predictor.names = t(data.colnames$val)[-1] predictor.levels = lapply(pilot.data, levels)[-1] predictors.count = length(predictor.levels) not.null = function(x) { !is.null(x) } columns.forFactors = (1:predictors.count)[sapply(predictor.levels, not.null)]
prediction.mapper = function(key, customer){ colnames(customer) = predictor.names for(factorCol in columns.forFactors) { unseenLevels = which(!(customer[,factorCol] %in% predictor.levels[[factorCol]])) customer[unseenLevels, factorCol] = NA } customer.pred = plogis(predict.glm(pilot.fitted.reduced, customer)) keyval(paste(as.character(customer$region), as.character(customer$gender)), ifelse(customer.pred>.5, 1, 0)) }
counting.reducer = function(k, vv) { keyval(k, sum(vv, na.rm=T)/length(vv)) }
mapred.result = mapreduce( input = '/CSP/data/insurance/customer_profiles.csv', input.format = 'csv', output.format = 'csv', map = prediction.mapper, reduce = counting.reducer )
mapred.result.data = from.dfs(mapred.result(), format='csv')
Use a different regression/classification model
rmr2
mapred.result
Practical computing requires balancing computing speed, programming efficiency, and personal comfort. R/Hadoop integrations offer more ways to achieve balance.
R/Hadoop integrations give practitioners opportunities to use strengths of both technologies.
R/Hadoop integrations give R programmers access to "non-R" technologies.
biglm
with RHadoopLogos/Graphics:
Parallelization fully uses computational resources and saves time
Parts of many statistical analyses are parallelizable
Many options for parallelizing R code - use what works for you!