This section should teach participants:
- What Hadoop is
- Current R/Hadoop integrations
- When to use R with Hadoop (guidelines)
- How to use R with Hadoop (lab)
19 December 2014
This section should teach participants:
Key ideas: enables distributed computing; open source; widely used
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage…
Source: Apache Software Foundation - What is Apache Hadoop?
Two of Hadoop's main features are particularly relevant to this talk:
Feature | Solves problem |
---|---|
Distributed storage | How do we easily store and access large datasets? |
Distributed/batch computing | How do we quickly run analyses on large datasets? |
Distributed computing is analogous to parallel computing
Parallel - multiple processors run code
Distributed - multiple computers run code
1. Hadoop links together servers to form a (storage + computing) cluster.
1. Hadoop links together servers to form a (storage + computing) cluster.
2. Creates a distributed file system called hdfs which splits large data files into smaller pieces that servers across the cluster store.
2. Creates a distributed file system called hdfs which splits large data files into smaller pieces that servers across the cluster store.
3. Uses the MapReduce programming model to implement
    distributed (i.e., parallel) computing.
Size matters
MapReduce programs have 3 main stages
3.a. Users upload Map and Reduce analysis code to Hadoop.
3.b. Hadoop distributes the Map code to the servers with the data. These servers runs local analyses that extract and group data.
3.c. Hadoop merges extracted data on one or more separate servers. These servers run the Reduce code that computes grouped data summaries.
3.d. Hadoop stores analytic results in its distributed file system, hdfs, on the server(s) that ran the Reduce code.
3.e. Analysts can retrieve these results for review or follow-on analysis.
"Everything else"