Expertiza_Wiki - User contributions [en]

CSC/ECE 517 Spring 2017/OSS M1706 Tracking intermittent test failures over time

2017-04-07T01:09:26Z

Acweber2: /* Subsequent Steps (Round 2) */

== Introduction ==
This wiki provides details on new functionality programmed for the Servo OSS project.

===Background===
"[https://github.com/servo/servo/wiki/Design Servo] is a project to develop a new Web browser engine. Our goal is to create an architecture that takes advantage of parallelism at many levels while eliminating common sources of bugs and security vulnerabilities associated with incorrect memory management and data races." Servo can be used through Browser.html, embedded in a website, or natively in Mozilla Firefox. It is designed to load web pages more efficiently and more securely.

===Motivation===
This project is a request from the Servo OSS project to reduce the impact intermittent test failures have on the software. The [https://github.com/servo/servo/wiki/Tracking-intermittent-failures-over-time-project request] made is for a [http://flask.pocoo.org/docs/0.12/ Flask] service using [https://en.wikipedia.org/wiki/Python_(programming_language) Python 2.7]. The intermittent test failure tracker stores information regarding a test that fails intermittently and also provides means to quickly query for tests that have failed.

===Tasks===
The intermittent test failure tracker initial steps (for the OSS project) include:
* Build a Flask service
* Use a JSON file to store information
* Record required parameters: Test file, platform, test machine (builder), and related GitHub pull request number
* Query the store results given a particular test file name
* Use the known intermittent issue tracker as an example of a Simple flask server

Subsequent steps (for the final project) include:
* Add ability to query the service by a date range, to find out which were occurred the most often
* Build an HTML front-end to the service that queries using JS and reports the results
** Links to GitHub
** Sorting
* Make [https://github.com/servo/servo/blob/master/python/servo/testing_commands.py#L508-L574 filter-intermittents] command record a separate failure for each intermittent failure encountered
* Propogate the required information for recording failures in [https://github.com/servo/saltfs/issues/597 saltfs]

== Design ==

===Design Pattern===

The Servo and this project's code follow a [https://en.wikipedia.org/wiki/Service_layers_pattern Service Layer] design pattern. This design pattern breaks up functionality into smaller "services" and applies the services to the topmost "layer" of the project for which they are needed.

===Application Flow===

==== Saving a Test ====
The Servo build agent calls a webhook (a way for an app to provide other applications with real-time information) inside the test tracker. The webhook then calls a handler that contains any business logic necessary to transform the request. Finally the handler persists the request into the db, in this case a json file. This flow can be seen in the graph below.

<pre>
+---------------------------------------------+
| Intermittent Test Failure Tracker |
| |
+--------------+ | +-----------+ +---------+ +------+ |
| | | | | | | | | | +--------+
| Servo | | | | | | | | | | |
| Build +------> webhook +------> handler +----> db +---------> json |
| Server | | | | | | | | | | file |
| | | | | | | | | | | |
+--------------+ | +-----------+ +---------+ +------+ | +--------+
| |
+---------------------------------------------+

</pre>

==Subsequent Steps (Round 2)==
The first request is to add the ability to query the service by a date range, to find out which failures were most frequent. Given the fail_date is included in the addition call as an ISO date string, we should be able to build a function to query this date using standard date functions and a range given from the user. This will require a new query function that takes the range as parameters.

Build a HTML front-end to the service that queries it using JS and reports the results in a useful manner (linking to github, sorting, etc.). For this we should be able to repurpose the testing webpages that we built in the first round. Polishing these up and giving them the required JS request mechanism should suffice.

These two last steps are the full integration of this product into the servo pipeline and will require the forking of the servo project on github.

In the given test_command application we have to make filter-intermittents command record a separate failure for each intermittent failure encountered. This is the actual integration into the servo framework to talk with this tracking system.

The second integration into the servo project for this tracker will be to propagate the required information for recording failures in saltfs. This will also require a testing setup for saltfs or at least a mock setup for mimicking the integration into saltfs.

== Implementation ==
The implementation is entirely influenced by the request, the Servo team clearly defines what the service should do and how it would be made.

=== Data model ===
The model for an intermittent test is defined mostly by the request with a few additions to help with querying in later steps of the OSS request.

{| class="wikitable"
|-
! Name
! Type
! Description
|-
| test_file
| String
| Name of the intermittent test file
|-
| platform
| String
| Platform the test failed on
|-
| builder
| String
| The test machine (builder) the test failed on
|-
| number
| Integer
| The GitHub pull request number
|-
| fail_date
| ISO date (String)
| Date of the failure
|}
=== Datastore ===
To store the intermittent test failures, a library called [https://tinydb.readthedocs.io/en/latest/ TinyDB] is used. This library is a native python library that provides convenient [https://en.wikipedia.org/wiki/SQL SQL] command like helpers around a [https://www.w3schools.com/js/js_json_syntax.asp JSON] file to more easily use it like a database. The format of the JSON file is simply an array of JSON objects, making the file easily human-readable.

=== Flask Service ===
[http://flask.pocoo.org/ Flask] is a [https://en.wikipedia.org/wiki/Microservices microservice] framework written in Python. A flask service is a REST (representational state transfer) API that maps URL and HTTP verbs to python functions. Some basic examples of flask routes:

@app.route('/')
def index():
return 'Index page'

@app.route('/user/<username>')
def show_user(username):
return db.lookup(username)

The first method returns 'index page' at the root URL. The second method accepts a URL param after user and returns the user from a database.

== Test Plan ==

=== Functional Testing ===
As a convenience to the testers included in this code base is a set of [http://csc517oss.zachncst.com/ testing web applications] and is only for illustrating the project's functionality.
This simple set of forms allow a tester to exercise the functionality of the [https://en.wikipedia.org/wiki/Representational_state_transfer REST] endpoints without having to write any REST code.
The links on the page lead to demonstrations of the query and record handlers, as well as a display of the JSON file containing all the Intermittent Test Failure records.
All usable for thorough integration testing.

===Unit Testing===
The Unit Tests included in the code exercise the major functions of this system. The tests exercise the addition of a record into the database, the removal of a record given a filename, the retrieval of a record, and the assertion that a record will not be added if any of the record parameters (test_file, platform, builder, number) is missing. All unit tests are in tests.py.

{| class="wikitable"
! colspan="3" | Unit Test Summary
|-
! Test Purpose
! Function Tested
! Parameters
|-
| Add a record to a database
| db.add
| params[:self, :test_file, :platform, :builder, :number, :fail_date]
|-
| Delete a record from database
| db.remove
| params[:test_file]
|-
| Record a new Intermittent failure
| handlers.record
| params[:db, :test_file, :platform, :builder, :number]
|-
| Query the Intermittent failure records
| handlers.query
| params[:db, :test_file]
|-
| Record a new Intermittent failure, test invalid values - 4 tests for blanks for each input item
| handlers.record
| params[:db, :test_file, :platform, :builder, :number]
|-
|}

====Running Unit Tests and the App====

Before attempting either of the following, clone the [https://github.com/adamw17/csc517ossproject repo].

=====To Run Unit Tests=====
* In the cloned repo folder, use the command <code>python test.py</code>

=====To Run The App Locally=====
* In the cloned repo folder, use the command <code>python -m flask_server</code>
* To launch the app, go to http://localhost:5000

== Submission/Pull Requests ==

There is no Pull Request because Servo manager Josh Matthews requested that we start a new (non-branched) repository for this project. The work has been started in a new GitHub repo located [https://github.com/adamw17/csc517ossproject/tree/832969c1cf01d94be340731c744854c25fdbb441 here]. When Servo developers are ready, the project will be pulled in to the Servo project on GitHub. In the interim, we shared our repo with Josh, whose reply was "this looks really great! Thanks for tackling it!"

CSC/ECE 517 Spring 2017/OSS M1706 Tracking intermittent test failures over time

2017-04-06T02:29:42Z

Acweber2:

== Introduction ==
This wiki provides details on new functionality programmed for the Servo OSS project.

===Background===
"[https://github.com/servo/servo/wiki/Design Servo] is a project to develop a new Web browser engine. Our goal is to create an architecture that takes advantage of parallelism at many levels while eliminating common sources of bugs and security vulnerabilities associated with incorrect memory management and data races." Servo can be used through Browser.html, embedded in a website, or natively in Mozilla Firefox. It is designed to load web pages more efficiently and more securely.

===Motivation===
This project is a request from the Servo OSS project to reduce the impact intermittent test failures have on the software. The [https://github.com/servo/servo/wiki/Tracking-intermittent-failures-over-time-project request] made is for a [http://flask.pocoo.org/docs/0.12/ Flask] service using [https://en.wikipedia.org/wiki/Python_(programming_language) Python 2.7]. The intermittent test failure tracker stores information regarding a test that fails intermittently and also provides means to quickly query for tests that have failed.

===Tasks===
The intermittent test failure tracker initial steps (for the OSS project) include:
* Build a Flask service
* Use a JSON file to store information
* Record required parameters: Test file, platform, test machine (builder), and related GitHub pull request number
* Query the store results given a particular test file name
* Use the known intermittent issue tracker as an example of a Simple flask server

Subsequent steps (for the final project) include:
* Add ability to query the service by a date range, to find out which were occurred the most often
* Build an HTML front-end to the service that queries using JS and reports the results
** Links to GitHub
** Sorting
* Make [https://github.com/servo/servo/blob/master/python/servo/testing_commands.py#L508-L574 filter-intermittents] command record a separate failure for each intermittent failure encountered
* Propogate the required information for recording failures in [https://github.com/servo/saltfs/issues/597 saltfs]

== Design ==

===Design Pattern===

The Servo and this project's code follow a [https://en.wikipedia.org/wiki/Service_layers_pattern Service Layer] design pattern. This design pattern breaks up functionality into smaller "services" and applies the services to the topmost "layer" of the project for which they are needed.

===Application Flow===

==== Saving a Test ====
The Servo build agent calls a webhook (a way for an app to provide other applications with real-time information) inside the test tracker. The webhook then calls a handler that contains any business logic necessary to transform the request. Finally the handler persists the request into the db, in this case a json file. This flow can be seen in the graph below.

<pre>
+---------------------------------------------+
| Intermittent Test Failure Tracker |
| |
+--------------+ | +-----------+ +---------+ +------+ |
| | | | | | | | | | +--------+
| Servo | | | | | | | | | | |
| Build +------> webhook +------> handler +----> db +---------> json |
| Server | | | | | | | | | | file |
| | | | | | | | | | | |
+--------------+ | +-----------+ +---------+ +------+ | +--------+
| |
+---------------------------------------------+

</pre>

==Subsequent Steps (Round 2)==
The first request is to add the ability to query the service by a date range, to find out which failures were most frequent. Given the fail_date is included in the addition call as an ISO date string, we should be able to build a function to query this date using standard date functions and a range given from the user.

build a HTML front-end to the service that queries it using JS and reports the results in a useful manner (linking to github, sorting, etc.) - For this we should be able to repurpose the testing webpages that we built in the first round. Polishing these up and giving them the required JS request mechanism should suffice.

I suppose these two last steps are the full integration of this product into the servo pipeline.

make filter-intermittents command record a separate failure for each intermittent failure encountered - This should be interesting as we'll have to fork the servo repo as this is handled in a different part than the actual service for tracking the intermittent failures.

propagate the required information for recording failures in saltfs - Again this appears to be in the seperate testing_commands file in the larger servo repo.

== Implementation ==
The implementation is entirely influenced by the request, the Servo team clearly defines what the service should do and how it would be made.

=== Data model ===
The model for an intermittent test is defined mostly by the request with a few additions to help with querying in later steps of the OSS request.

{| class="wikitable"
|-
! Name
! Type
! Description
|-
| test_file
| String
| Name of the intermittent test file
|-
| platform
| String
| Platform the test failed on
|-
| builder
| String
| The test machine (builder) the test failed on
|-
| number
| Integer
| The GitHub pull request number
|-
| fail_date
| ISO date (String)
| Date of the failure
|}
=== Datastore ===
To store the intermittent test failures, a library called [https://tinydb.readthedocs.io/en/latest/ TinyDB] is used. This library is a native python library that provides convenient [https://en.wikipedia.org/wiki/SQL SQL] command like helpers around a [https://www.w3schools.com/js/js_json_syntax.asp JSON] file to more easily use it like a database. The format of the JSON file is simply an array of JSON objects, making the file easily human-readable.

=== Flask Service ===
[http://flask.pocoo.org/ Flask] is a [https://en.wikipedia.org/wiki/Microservices microservice] framework written in Python. A flask service is a REST (representational state transfer) API that maps URL and HTTP verbs to python functions. Some basic examples of flask routes:

@app.route('/')
def index():
return 'Index page'

@app.route('/user/<username>')
def show_user(username):
return db.lookup(username)

The first method returns 'index page' at the root URL. The second method accepts a URL param after user and returns the user from a database.

== Test Plan ==

=== Functional Testing ===
As a convenience to the testers included in this code base is a set of [http://csc517oss.zachncst.com/ testing web applications] and is only for illustrating the project's functionality.
This simple set of forms allow a tester to exercise the functionality of the [https://en.wikipedia.org/wiki/Representational_state_transfer REST] endpoints without having to write any REST code.
The links on the page lead to demonstrations of the query and record handlers, as well as a display of the JSON file containing all the Intermittent Test Failure records.
All usable for thorough integration testing.

===Unit Testing===
The Unit Tests included in the code exercise the major functions of this system. The tests exercise the addition of a record into the database, the removal of a record given a filename, the retrieval of a record, and the assertion that a record will not be added if any of the record parameters (test_file, platform, builder, number) is missing. All unit tests are in tests.py.

{| class="wikitable"
! colspan="3" | Unit Test Summary
|-
! Test Purpose
! Function Tested
! Parameters
|-
| Add a record to a database
| db.add
| params[:self, :test_file, :platform, :builder, :number, :fail_date]
|-
| Delete a record from database
| db.remove
| params[:test_file]
|-
| Record a new Intermittent failure
| handlers.record
| params[:db, :test_file, :platform, :builder, :number]
|-
| Query the Intermittent failure records
| handlers.query
| params[:db, :test_file]
|-
| Record a new Intermittent failure, test invalid values - 4 tests for blanks for each input item
| handlers.record
| params[:db, :test_file, :platform, :builder, :number]
|-
|}

====Running Unit Tests and the App====

Before attempting either of the following, clone the [https://github.com/adamw17/csc517ossproject repo].

=====To Run Unit Tests=====
* In the cloned repo folder, use the command <code>python test.py</code>

=====To Run The App Locally=====
* In the cloned repo folder, use the command <code>python -m flask_server</code>
* To launch the app, go to http://localhost:5000

== Submission/Pull Requests ==

There is no Pull Request because Servo manager Josh Matthews requested that we start a new (non-branched) repository for this project. The work has been started in a new GitHub repo located [https://github.com/adamw17/csc517ossproject/tree/832969c1cf01d94be340731c744854c25fdbb441 here]. When Servo developers are ready, the project will be pulled in to the Servo project on GitHub. In the interim, we shared our repo with Josh, whose reply was "this looks really great! Thanks for tackling it!"

CSC/ECE 517 Spring 2017/OSS M1706 Tracking intermittent test failures over time

2017-04-06T02:23:38Z

Acweber2:

== Introduction ==
This wiki provides details on new functionality programmed for the Servo OSS project.

===Background===
"[https://github.com/servo/servo/wiki/Design Servo] is a project to develop a new Web browser engine. Our goal is to create an architecture that takes advantage of parallelism at many levels while eliminating common sources of bugs and security vulnerabilities associated with incorrect memory management and data races." Servo can be used through Browser.html, embedded in a website, or natively in Mozilla Firefox. It is designed to load web pages more efficiently and more securely.

===Motivation===
This project is a request from the Servo OSS project to reduce the impact intermittent test failures have on the software. The [https://github.com/servo/servo/wiki/Tracking-intermittent-failures-over-time-project request] made is for a [http://flask.pocoo.org/docs/0.12/ Flask] service using [https://en.wikipedia.org/wiki/Python_(programming_language) Python 2.7]. The intermittent test failure tracker stores information regarding a test that fails intermittently and also provides means to quickly query for tests that have failed.

===Tasks===
The intermittent test failure tracker initial steps (for the OSS project) include:
* Build a Flask service
* Use a JSON file to store information
* Record required parameters: Test file, platform, test machine (builder), and related GitHub pull request number
* Query the store results given a particular test file name
* Use the known intermittent issue tracker as an example of a Simple flask server

Subsequent steps (for the final project) include:
* Add ability to query the service by a date range, to find out which were occurred the most often
* Build an HTML front-end to the service that queries using JS and reports the results
** Links to GitHub
** Sorting
* Make [https://github.com/servo/servo/blob/master/python/servo/testing_commands.py#L508-L574 filter-intermittents] command record a separate failure for each intermittent failure encountered
* Propogate the required information for recording failures in [https://github.com/servo/saltfs/issues/597 saltfs]

== Design ==

===Design Pattern===

The Servo and this project's code follow a [https://en.wikipedia.org/wiki/Service_layers_pattern Service Layer] design pattern. This design pattern breaks up functionality into smaller "services" and applies the services to the topmost "layer" of the project for which they are needed.

===Application Flow===

==== Saving a Test ====
The Servo build agent calls a webhook (a way for an app to provide other applications with real-time information) inside the test tracker. The webhook then calls a handler that contains any business logic necessary to transform the request. Finally the handler persists the request into the db, in this case a json file. This flow can be seen in the graph below.

<pre>
+---------------------------------------------+
| Intermittent Test Failure Tracker |
| |
+--------------+ | +-----------+ +---------+ +------+ |
| | | | | | | | | | +--------+
| Servo | | | | | | | | | | |
| Build +------> webhook +------> handler +----> db +---------> json |
| Server | | | | | | | | | | file |
| | | | | | | | | | | |
+--------------+ | +-----------+ +---------+ +------+ | +--------+
| |
+---------------------------------------------+

</pre>

== Implementation ==
The implementation is entirely influenced by the request, the Servo team clearly defines what the service should do and how it would be made.

==Subsequent Steps (Round 2)==
The first request is to add the ability to query the service by a date range, to find out which failures were most frequent. Given the fail_date is included in the addition call as an ISO date string, we should be able to build a function to query this date using standard date functions and a range given from the user.

build a HTML front-end to the service that queries it using JS and reports the results in a useful manner (linking to github, sorting, etc.) - For this we should be able to repurpose the testing webpages that we built in the first round. Polishing these up and giving them the required JS request mechanism should suffice.

I suppose these two last steps are the full integration of this product into the servo pipeline.

make filter-intermittents command record a separate failure for each intermittent failure encountered - This should be interesting as we'll have to fork the servo repo as this is handled in a different part than the actual service for tracking the intermittent failures.

propagate the required information for recording failures in saltfs - Again this appears to be in the seperate testing_commands file in the larger servo repo.

=== Data model ===
The model for an intermittent test is defined mostly by the request with a few additions to help with querying in later steps of the OSS request.

{| class="wikitable"
|-
! Name
! Type
! Description
|-
| test_file
| String
| Name of the intermittent test file
|-
| platform
| String
| Platform the test failed on
|-
| builder
| String
| The test machine (builder) the test failed on
|-
| number
| Integer
| The GitHub pull request number
|-
| fail_date
| ISO date (String)
| Date of the failure
|}
=== Datastore ===
To store the intermittent test failures, a library called [https://tinydb.readthedocs.io/en/latest/ TinyDB] is used. This library is a native python library that provides convenient [https://en.wikipedia.org/wiki/SQL SQL] command like helpers around a [https://www.w3schools.com/js/js_json_syntax.asp JSON] file to more easily use it like a database. The format of the JSON file is simply an array of JSON objects, making the file easily human-readable.

=== Flask Service ===
[http://flask.pocoo.org/ Flask] is a [https://en.wikipedia.org/wiki/Microservices microservice] framework written in Python. A flask service is a REST (representational state transfer) API that maps URL and HTTP verbs to python functions. Some basic examples of flask routes:

@app.route('/')
def index():
return 'Index page'

@app.route('/user/<username>')
def show_user(username):
return db.lookup(username)

The first method returns 'index page' at the root URL. The second method accepts a URL param after user and returns the user from a database.

== Test Plan ==

=== Functional Testing ===
As a convenience to the testers included in this code base is a set of [http://csc517oss.zachncst.com/ testing web applications] and is only for illustrating the project's functionality.
This simple set of forms allow a tester to exercise the functionality of the [https://en.wikipedia.org/wiki/Representational_state_transfer REST] endpoints without having to write any REST code.
The links on the page lead to demonstrations of the query and record handlers, as well as a display of the JSON file containing all the Intermittent Test Failure records.
All usable for thorough integration testing.

===Unit Testing===
The Unit Tests included in the code exercise the major functions of this system. The tests exercise the addition of a record into the database, the removal of a record given a filename, the retrieval of a record, and the assertion that a record will not be added if any of the record parameters (test_file, platform, builder, number) is missing. All unit tests are in tests.py.

{| class="wikitable"
! colspan="3" | Unit Test Summary
|-
! Test Purpose
! Function Tested
! Parameters
|-
| Add a record to a database
| db.add
| params[:self, :test_file, :platform, :builder, :number, :fail_date]
|-
| Delete a record from database
| db.remove
| params[:test_file]
|-
| Record a new Intermittent failure
| handlers.record
| params[:db, :test_file, :platform, :builder, :number]
|-
| Query the Intermittent failure records
| handlers.query
| params[:db, :test_file]
|-
| Record a new Intermittent failure, test invalid values - 4 tests for blanks for each input item
| handlers.record
| params[:db, :test_file, :platform, :builder, :number]
|-
|}

====Running Unit Tests and the App====

Before attempting either of the following, clone the [https://github.com/adamw17/csc517ossproject repo].

=====To Run Unit Tests=====
* In the cloned repo folder, use the command <code>python test.py</code>

=====To Run The App Locally=====
* In the cloned repo folder, use the command <code>python -m flask_server</code>
* To launch the app, go to http://localhost:5000

== Submission/Pull Requests ==

There is no Pull Request because Servo manager Josh Matthews requested that we start a new (non-branched) repository for this project. The work has been started in a new GitHub repo located [https://github.com/adamw17/csc517ossproject/tree/832969c1cf01d94be340731c744854c25fdbb441 here]. When Servo developers are ready, the project will be pulled in to the Servo project on GitHub. In the interim, we shared our repo with Josh, whose reply was "this looks really great! Thanks for tackling it!"

CSC/ECE 517 Spring 2017/oss M1706

2017-04-06T02:11:57Z

Acweber2:

===Description===
This purpose of this project is provide additional testing infrastructure for the Servo OSS project. "[https://github.com/servo/servo/wiki/Design Servo] is a project to develop a new Web browser engine. The goal is to create an architecture that takes advantage of parallelism at many levels while eliminating common sources of bugs and security vulnerabilities associated with incorrect memory management and data races." Servo can be used through Browser.html, embedded in a website, or natively in Mozilla Firefox. It is designed to load web pages more efficiently and more securely.

In particular, this project is a request from the Servo OSS project to reduce the impact intermittent test failures have on the software. Intermittent failures frequently occur but are normally ignored during continuous integration. The frequency of each intermittent failure signature, though not currently logged, would be useful in allowing developers to identify and resolved the most prevalent issues.

===Tasks to be completed===

===Current Implementation===

===UML Diagram===

===Proposed Design===

'''TASK 1''' -

'''TASK 2''' -

==== Design Pattern Used ====

====Features to be added====

===Testing Plan===

====Files Changed and Added====

====Models====

====Controllers====

====Views====

===Round 2===
The first request is to add the ability to query the service by a date range, to find out which failures were most frequent. Given the fail_date is included in the addition call as an ISO date string, we should be able to build a function to query this date using standard date functions and a range given from the user.

build a HTML front-end to the service that queries it using JS and reports the results in a useful manner (linking to github, sorting, etc.) - For this we should be able to repurpose the testing webpages that we built in the first round. Polishing these up and giving them the required JS request mechanism should suffice.

I suppose these two last steps are the full integration of this product into the servo pipeline.

make filter-intermittents command record a separate failure for each intermittent failure encountered - This should be interesting as we'll have to fork the servo repo as this is handled in a different part than the actual service for tracking the intermittent failures.

propagate the required information for recording failures in saltfs - Again this appears to be in the seperate testing_commands file in the larger servo repo.

===Important Links===

Link to Github repository : https://github.com/adamw17/csc517ossproject

Link to Pull request : The contact for this project asked us to create an entirely new repository. A pull request is not applicable.

===References===

CSC/ECE 517 Spring 2017/oss M1706

2017-04-06T02:11:33Z

Acweber2:

===Description===
This purpose of this project is provide additional testing infrastructure for the Servo OSS project. "[https://github.com/servo/servo/wiki/Design Servo] is a project to develop a new Web browser engine. The goal is to create an architecture that takes advantage of parallelism at many levels while eliminating common sources of bugs and security vulnerabilities associated with incorrect memory management and data races." Servo can be used through Browser.html, embedded in a website, or natively in Mozilla Firefox. It is designed to load web pages more efficiently and more securely.

In particular, this project is a request from the Servo OSS project to reduce the impact intermittent test failures have on the software. Intermittent failures frequently occur but are normally ignored during continuous integration. The frequency of each intermittent failure signature, though not currently logged, would be useful in allowing developers to identify and resolved the most prevalent issues.

===Tasks to be completed===

===Current Implementation===

===UML Diagram===

===Proposed Design===

'''TASK 1''' -

'''TASK 2''' -

==== Design Pattern Used ====

====Features to be added====

===Testing Plan===

====Files Changed and Added====

====Models====

====Controllers====

====Views====

====Round 2====
The first request is to add the ability to query the service by a date range, to find out which failures were most frequent. Given the fail_date is included in the addition call as an ISO date string, we should be able to build a function to query this date using standard date functions and a range given from the user.

build a HTML front-end to the service that queries it using JS and reports the results in a useful manner (linking to github, sorting, etc.) - For this we should be able to repurpose the testing webpages that we built in the first round. Polishing these up and giving them the required JS request mechanism should suffice.

I suppose these two last steps are the full integration of this product into the servo pipeline.

make filter-intermittents command record a separate failure for each intermittent failure encountered - This should be interesting as we'll have to fork the servo repo as this is handled in a different part than the actual service for tracking the intermittent failures.

propagate the required information for recording failures in saltfs - Again this appears to be in the seperate testing_commands file in the larger servo repo.

===Important Links===

Link to Github repository : https://github.com/adamw17/csc517ossproject

Link to Pull request : The contact for this project asked us to create an entirely new repository. A pull request is not applicable.

===References===

CSC/ECE 517 Spring 2017/oss M1706

2017-04-06T00:58:36Z

Acweber2:

===Description===
This purpose of this project is provide additional testing infrastructure for Mozilla Servo OSS project. "[https://github.com/servo/servo/wiki/Design Servo] is a project to develop a new Web browser engine. The goal is to create an architecture that takes advantage of parallelism at many levels while eliminating common sources of bugs and security vulnerabilities associated with incorrect memory management and data races." Servo can be used through Browser.html, embedded in a website, or natively in Mozilla Firefox. It is designed to load web pages more efficiently and more securely.

In particular, this project is a request from the Servo OSS project to reduce the impact intermittent test failures have on the software. Intermittent failures frequently occur but are normally ignored during continuous integration. The frequency of each intermittent failure signature, though not currently logged, would be useful in allowing developers to identify and resolved the most prevalent issues.

===Tasks to be completed===

===Current Implementation===

===UML Diagram===

===Proposed Design===

'''TASK 1''' -

'''TASK 2''' -

==== Design Pattern Used ====

====Features to be added====

===Testing Plan===

====Files Changed and Added====

====Models====

====Controllers====

====Views====

===Important Links===

Link to Github repository : https://github.com/adamw17/csc517ossproject

Link to Pull request : The contact for this project asked us to create an entirely new repository. A pull request is not applicable.

===References===

CSC/ECE 517 Spring 2017/oss M1706

2017-04-06T00:58:05Z

Acweber2: remove some unneeded sections

===Description===
This purpose of this project is provide additional testing infrastructure for Mozilla Servo OSS project. "[https://github.com/servo/servo/wiki/Design Servo] is a project to develop a new Web browser engine. The goal is to create an architecture that takes advantage of parallelism at many levels while eliminating common sources of bugs and security vulnerabilities associated with incorrect memory management and data races." Servo can be used through Browser.html, embedded in a website, or natively in Mozilla Firefox. It is designed to load web pages more efficiently and more securely.

In particular, this project is a request from the Servo OSS project to reduce the impact intermittent test failures have on the software. Intermittent failures frequently occur but are normally ignored during continuous integration. The frequency of each intermittent failure signature, though not currently logged, would be useful in allowing developers to identify and resolved the most prevalent issues.

===Tasks to be completed===

===Current Implementation===

===UML Diagram===

===Proposed Design===

'''TASK 1''' -

'''TASK 2''' -

==== Design Pattern Used ====

====Features to be added====

===Metrics View===

===Testing Plan===

====Files Changed and Added====

====Models====

====Controllers====

====Views====

===Important Links===

Link to Github repository : https://github.com/adamw17/csc517ossproject

Link to Pull request : The contact for this project asked us to create an entirely new repository. A pull request is not applicable.

===References===

CSC/ECE 517 Spring 2017/OSS M1706 Tracking intermittent test failures over time

2017-03-30T01:50:51Z

Acweber2: /* Test Plan */

CSC/ECE 506 Spring 2015/3b az

2015-02-17T02:29:23Z

Acweber2: /* Challenges */ update refs

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Parallel Processing meets MapReduce =

=== MapReduce for Clustered Systems ===
The defacto standard for using MapReduce is in a clustered environment of many separate machines. The purpose of MapReduce is to tranform a large set of data to another large set of data and possibly reduce the output. The cost of the clustered environments is latency of communication. This leaves clustered environments best suited for tasks where immediate feedback isn't necessary. Log analysis, data transformation and other types of problems are solved using the clustered environment implementations.

=== MapReduce for Shared-Memory Machines ===
A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a significant overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient. Problem sets that are expressed in key-value pairs best fit into the shared-memory model.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

# Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.
# [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.
# MapReduce-MPI and KMR implements Mapreduce for distributed memory systems.
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced its own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data so large, that it would take significantly longer to transfer the data over the network to a centralized processor, than to bring the computation to the location of the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called the Hadoop Distributed File System (HDFS)<ref>http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html</ref>. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
# Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
# The Jobtracker allocates map tasks to the TaskTrackers.
# JobTracker determines appropriate jobs based on how busy the TaskTracker is.
# TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
# The buffer is eventually flushed into two files.
# After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
# When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
# The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at its simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework for Apache Hadoop. Tez, like Spark, is a directed-acyclic-graph (DAG) engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Challenges ===

Running MapReduce on a Shared Memory System can show a significant increase in speed over the cluster/disk based systems due to little to no overhead IO overhead. However a few challenges present themselves in the shared memory environment <ref>http://www4.ncsu.edu/~dtiwari2/Papers/2012_IPDPS_Devesh_MapReduce.pdf</ref>.

* Immediate output is stored in memory, requiring a lot for large problem sets.
* The ratio of key-value pairs relative to the number of distinct pairs highly affects performance.
* Execution time of reduce phase is affected by task queue overhead.
* The size and shape of data structure for storing the immediate output affect map and reduce phases differently.

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

===Challenges===

#Susceptible to network outages.
#Node failure has to be handled and work rescheduled.
#There has to be a system that knows of all the workers and where they are.

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of standardized MPI. <ref>http://en.wikipedia.org/wiki/Message_Passing_Interface</ref> Unlike other implementations of MapReduce which are mostly in Java, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.<ref>http://mapreduce.sandia.gov/doc/Background.html</ref>

'''Example MR-MPI code''' <ref>http://mapreduce.sandia.gov/doc/Program.html</ref>
<pre>
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object
</pre>

The map function is the same in this implementation as in others. The collate function is the shuffle and sort of data that occurs after all the keys have been output by the mappers, and the reduce function is the same implementation that one would expect in any standard MapReduce implementation.

From this interface one writes MapReduce code, where the functions are processing keys and value like standard MapReduce implementations. The framework also allows for MapReduce-MPI jobs to be written in C, Python, and a scripting language they've built called OINK.

====KMR====

KMR is another MapReduce implementation based on MPI. KMR is more robust than MR-MPI, at the cost of being slightly more complex to build your MapReduce application. There isn't much that's very distinct about this implementation. It contains the ability to assign functions for mapping, functions for the shuffle, and the ability to assign a function for the shuffle.

'''Examples'''

See [http://mt.aics.riken.jp/kmr/docs/kmr-1.5/html/index.html#overview KMR Overview]

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.cse.ust.hk/gpuqp/Mars.html Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for its document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of its neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to its neighbors. Each neighbor will then update its state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-17T02:21:15Z

Acweber2: /* Apache’s Hadoop MapReduce */ fix markup

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Parallel Processing meets MapReduce =

=== MapReduce for Clustered Systems ===
The defacto standard for using MapReduce is in a clustered environment of many separate machines. The purpose of MapReduce is to tranform a large set of data to another large set of data and possibly reduce the output. The cost of the clustered environments is latency of communication. This leaves clustered environments best suited for tasks where immediate feedback isn't necessary. Log analysis, data transformation and other types of problems are solved using the clustered environment implementations.

=== MapReduce for Shared-Memory Machines ===
A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a significant overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient. Problem sets that are expressed in key-value pairs best fit into the shared-memory model.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

# Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.
# [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.
# MapReduce-MPI and KMR implements Mapreduce for distributed memory systems.
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced its own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data so large, that it would take significantly longer to transfer the data over the network to a centralized processor, than to bring the computation to the location of the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called the Hadoop Distributed File System (HDFS)<ref>http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html</ref>. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
# Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
# The Jobtracker allocates map tasks to the TaskTrackers.
# JobTracker determines appropriate jobs based on how busy the TaskTracker is.
# TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
# The buffer is eventually flushed into two files.
# After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
# When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
# The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at its simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework for Apache Hadoop. Tez, like Spark, is a directed-acyclic-graph (DAG) engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Challenges ===

Running MapReduce on a Shared Memory System can show a significant increase in speed over the cluster/disk based systems due to little to no overhead IO overhead. However a few challenges present themselves in the shared memory environment <ref>http://www4.ncsu.edu/~dtiwari2/Papers/2012_IPDPS_Devesh_MapReduce.pdf</ref>.

* Immediate output is stored in memory, requiring a lot for large problem sets.
* The ratio of key-value pairs relative to the number of distinct pairs highly affects performance.
* Execution time of reduce phase is affected by task queue overhead.
* The size and shape of data structure for storing the immediate output affect map and reduce phases differently.

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

===Challenges===

#Susceptible to network outages.
#Node failure has to be handled and work rescheduled.
#There has to be a system that knows of all the workers and where they are.

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of standardized MPI. <ref>http://en.wikipedia.org/wiki/Message_Passing_Interface</ref> Unlike other implementations of MapReduce which are mostly in Java, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.<ref>http://mapreduce.sandia.gov/doc/Background.html</ref>

'''Example MR-MPI code''' <ref>http://mapreduce.sandia.gov/doc/Program.html</ref>
<pre>
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object
</pre>

The map function is the same in this implementation as in others. The collate function is the shuffle and sort of data that occurs after all the keys have been output by the mappers, and the reduce function is the same implementation that one would expect in any standard MapReduce implementation.

From this interface one writes MapReduce code, where the functions are processing keys and value like standard MapReduce implementations. The framework also allows for MapReduce-MPI jobs to be written in C, Python, and a scripting language they've built called OINK.

====KMR====

KMR is another MapReduce implementation based on MPI. KMR is more robust than MR-MPI, at the cost of being slightly more complex to build your MapReduce application. There isn't much that's very distinct about this implementation. It contains the ability to assign functions for mapping, functions for the shuffle, and the ability to assign a function for the shuffle.

'''Examples'''

See [http://mt.aics.riken.jp/kmr/docs/kmr-1.5/html/index.html#overview KMR Overview]

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for its document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of its neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to its neighbors. Each neighbor will then update its state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-17T02:18:26Z

Acweber2: /* Tez */ clearer and now with more words!

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Parallel Processing meets MapReduce =

=== MapReduce for Clustered Systems ===
The defacto standard for using MapReduce is in a clustered environment of many separate machines. The purpose of MapReduce is to tranform a large set of data to another large set of data and possibly reduce the output. The cost of the clustered environments is latency of communication. This leaves clustered environments best suited for tasks where immediate feedback isn't necessary. Log analysis, data transformation and other types of problems are solved using the clustered environment implementations.

=== MapReduce for Shared-Memory Machines ===
A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a significant overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient. Problem sets that are expressed in key-value pairs best fit into the shared-memory model.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

# Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.
# [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.
# MapReduce-MPI and KMR implements Mapreduce for distributed memory systems.
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced its own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data so large, that it would take significantly longer to transfer the data over the network to a centralized processor, than to bring the computation to the location of the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called the Hadoop Distributed File System (HDFS)<ref>http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html</rev>. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
# Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
# The Jobtracker allocates map tasks to the TaskTrackers.
# JobTracker determines appropriate jobs based on how busy the TaskTracker is.
# TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
# The buffer is eventually flushed into two files.
# After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
# When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
# The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at its simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework for Apache Hadoop. Tez, like Spark, is a directed-acyclic-graph (DAG) engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Challenges ===

Running MapReduce on a Shared Memory System can show a significant increase in speed over the cluster/disk based systems due to little to no overhead IO overhead. However a few challenges present themselves in the shared memory environment <ref>http://www4.ncsu.edu/~dtiwari2/Papers/2012_IPDPS_Devesh_MapReduce.pdf</ref>.

* Immediate output is stored in memory, requiring a lot for large problem sets.
* The ratio of key-value pairs relative to the number of distinct pairs highly affects performance.
* Execution time of reduce phase is affected by task queue overhead.
* The size and shape of data structure for storing the immediate output affect map and reduce phases differently.

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

===Challenges===

#Susceptible to network outages.
#Node failure has to be handled and work rescheduled.
#There has to be a system that knows of all the workers and where they are.

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of standardized MPI. <ref>http://en.wikipedia.org/wiki/Message_Passing_Interface</ref> Unlike other implementations of MapReduce which are mostly in Java, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.<ref>http://mapreduce.sandia.gov/doc/Background.html</ref>

'''Example MR-MPI code''' <ref>http://mapreduce.sandia.gov/doc/Program.html</ref>
<pre>
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object
</pre>

The map function is the same in this implementation as in others. The collate function is the shuffle and sort of data that occurs after all the keys have been output by the mappers, and the reduce function is the same implementation that one would expect in any standard MapReduce implementation.

From this interface one writes MapReduce code, where the functions are processing keys and value like standard MapReduce implementations. The framework also allows for MapReduce-MPI jobs to be written in C, Python, and a scripting language they've built called OINK.

====KMR====

KMR is another MapReduce implementation based on MPI. KMR is more robust than MR-MPI, at the cost of being slightly more complex to build your MapReduce application. There isn't much that's very distinct about this implementation. It contains the ability to assign functions for mapping, functions for the shuffle, and the ability to assign a function for the shuffle.

'''Examples'''

See [http://mt.aics.riken.jp/kmr/docs/kmr-1.5/html/index.html#overview KMR Overview]

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for its document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of its neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to its neighbors. Each neighbor will then update its state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-17T02:12:49Z

Acweber2: /* Apache’s Hadoop MapReduce */ hdfs ref and explaination

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Parallel Processing meets MapReduce =

=== MapReduce for Clustered Systems ===
The defacto standard for using MapReduce is in a clustered environment of many separate machines. The purpose of MapReduce is to tranform a large set of data to another large set of data and possibly reduce the output. The cost of the clustered environments is latency of communication. This leaves clustered environments best suited for tasks where immediate feedback isn't necessary. Log analysis, data transformation and other types of problems are solved using the clustered environment implementations.

=== MapReduce for Shared-Memory Machines ===
A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a significant overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient. Problem sets that are expressed in key-value pairs best fit into the shared-memory model.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

# Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.
# [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.
# MapReduce-MPI and KMR implements Mapreduce for distributed memory systems.
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced its own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data so large, that it would take significantly longer to transfer the data over the network to a centralized processor, than to bring the computation to the location of the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called the Hadoop Distributed File System (HDFS)<ref>http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html</rev>. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
# Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
# The Jobtracker allocates map tasks to the TaskTrackers.
# JobTracker determines appropriate jobs based on how busy the TaskTracker is.
# TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
# The buffer is eventually flushed into two files.
# After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
# When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
# The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at its simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Challenges ===

Running MapReduce on a Shared Memory System can show a significant increase in speed over the cluster/disk based systems due to little to no overhead IO overhead. However a few challenges present themselves in the shared memory environment <ref>http://www4.ncsu.edu/~dtiwari2/Papers/2012_IPDPS_Devesh_MapReduce.pdf</ref>.

* Immediate output is stored in memory, requiring a lot for large problem sets.
* The ratio of key-value pairs relative to the number of distinct pairs highly affects performance.
* Execution time of reduce phase is affected by task queue overhead.
* The size and shape of data structure for storing the immediate output affect map and reduce phases differently.

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

===Challenges===

#Susceptible to network outages.
#Node failure has to be handled and work rescheduled.
#There has to be a system that knows of all the workers and where they are.

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of standardized MPI. <ref>http://en.wikipedia.org/wiki/Message_Passing_Interface</ref> Unlike other implementations of MapReduce which are mostly in Java, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.<ref>http://mapreduce.sandia.gov/doc/Background.html</ref>

'''Example MR-MPI code''' <ref>http://mapreduce.sandia.gov/doc/Program.html</ref>
<pre>
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object
</pre>

The map function is the same in this implementation as in others. The collate function is the shuffle and sort of data that occurs after all the keys have been output by the mappers, and the reduce function is the same implementation that one would expect in any standard MapReduce implementation.

From this interface one writes MapReduce code, where the functions are processing keys and value like standard MapReduce implementations. The framework also allows for MapReduce-MPI jobs to be written in C, Python, and a scripting language they've built called OINK.

====KMR====

KMR is another MapReduce implementation based on MPI. KMR is more robust than MR-MPI, at the cost of being slightly more complex to build your MapReduce application. There isn't much that's very distinct about this implementation. It contains the ability to assign functions for mapping, functions for the shuffle, and the ability to assign a function for the shuffle.

'''Examples'''

See [http://mt.aics.riken.jp/kmr/docs/kmr-1.5/html/index.html#overview KMR Overview]

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for its document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of its neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to its neighbors. Each neighbor will then update its state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-17T02:10:38Z

Acweber2: /* Apache’s Hadoop MapReduce */ clear up data locality.

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Parallel Processing meets MapReduce =

=== MapReduce for Clustered Systems ===
The defacto standard for using MapReduce is in a clustered environment of many separate machines. The purpose of MapReduce is to tranform a large set of data to another large set of data and possibly reduce the output. The cost of the clustered environments is latency of communication. This leaves clustered environments best suited for tasks where immediate feedback isn't necessary. Log analysis, data transformation and other types of problems are solved using the clustered environment implementations.

=== MapReduce for Shared-Memory Machines ===
A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a significant overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient. Problem sets that are expressed in key-value pairs best fit into the shared-memory model.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

# Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.
# [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.
# MapReduce-MPI and KMR implements Mapreduce for distributed memory systems.
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced its own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data so large, that it would take significantly longer to transfer the data over the network to a centralized processor, than to bring the computation to the location of the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
# Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
# The Jobtracker allocates map tasks to the TaskTrackers.
# JobTracker determines appropriate jobs based on how busy the TaskTracker is.
# TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
# The buffer is eventually flushed into two files.
# After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
# When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
# The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at its simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Challenges ===

Running MapReduce on a Shared Memory System can show a significant increase in speed over the cluster/disk based systems due to little to no overhead IO overhead. However a few challenges present themselves in the shared memory environment <ref>http://www4.ncsu.edu/~dtiwari2/Papers/2012_IPDPS_Devesh_MapReduce.pdf</ref>.

* Immediate output is stored in memory, requiring a lot for large problem sets.
* The ratio of key-value pairs relative to the number of distinct pairs highly affects performance.
* Execution time of reduce phase is affected by task queue overhead.
* The size and shape of data structure for storing the immediate output affect map and reduce phases differently.

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

===Challenges===

#Susceptible to network outages.
#Node failure has to be handled and work rescheduled.
#There has to be a system that knows of all the workers and where they are.

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of standardized MPI. <ref>http://en.wikipedia.org/wiki/Message_Passing_Interface</ref> Unlike other implementations of MapReduce which are mostly in Java, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.<ref>http://mapreduce.sandia.gov/doc/Background.html</ref>

'''Example MR-MPI code''' <ref>http://mapreduce.sandia.gov/doc/Program.html</ref>
<pre>
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object
</pre>

The map function is the same in this implementation as in others. The collate function is the shuffle and sort of data that occurs after all the keys have been output by the mappers, and the reduce function is the same implementation that one would expect in any standard MapReduce implementation.

From this interface one writes MapReduce code, where the functions are processing keys and value like standard MapReduce implementations. The framework also allows for MapReduce-MPI jobs to be written in C, Python, and a scripting language they've built called OINK.

====KMR====

KMR is another MapReduce implementation based on MPI. KMR is more robust than MR-MPI, at the cost of being slightly more complex to build your MapReduce application. There isn't much that's very distinct about this implementation. It contains the ability to assign functions for mapping, functions for the shuffle, and the ability to assign a function for the shuffle.

'''Examples'''

See [http://mt.aics.riken.jp/kmr/docs/kmr-1.5/html/index.html#overview KMR Overview]

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for its document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of its neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to its neighbors. Each neighbor will then update its state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-17T02:05:42Z

Acweber2: it's vs. its

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Parallel Processing meets MapReduce =

=== MapReduce for Clustered Systems ===
The defacto standard for using MapReduce is in a clustered environment of many separate machines. The purpose of MapReduce is to tranform a large set of data to another large set of data and possibly reduce the output. The cost of the clustered environments is latency of communication. This leaves clustered environments best suited for tasks where immediate feedback isn't necessary. Log analysis, data transformation and other types of problems are solved using the clustered environment implementations.

=== MapReduce for Shared-Memory Machines ===
A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a significant overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient. Problem sets that are expressed in key-value pairs best fit into the shared-memory model.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

# Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.
# [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.
# MapReduce-MPI and KMR implements Mapreduce for distributed memory systems.
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced its own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
# Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
# The Jobtracker allocates map tasks to the TaskTrackers.
# JobTracker determines appropriate jobs based on how busy the TaskTracker is.
# TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
# The buffer is eventually flushed into two files.
# After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
# When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
# The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at its simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Challenges ===

Running MapReduce on a Shared Memory System can show a significant increase in speed over the cluster/disk based systems due to little to no overhead IO overhead. However a few challenges present themselves in the shared memory environment <ref>http://www4.ncsu.edu/~dtiwari2/Papers/2012_IPDPS_Devesh_MapReduce.pdf</ref>.

* Immediate output is stored in memory, requiring a lot for large problem sets.
* The ratio of key-value pairs relative to the number of distinct pairs highly affects performance.
* Execution time of reduce phase is affected by task queue overhead.
* The size and shape of data structure for storing the immediate output affect map and reduce phases differently.

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

===Challenges===

#Susceptible to network outages.
#Node failure has to be handled and work rescheduled.
#There has to be a system that knows of all the workers and where they are.

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of standardized MPI. <ref>http://en.wikipedia.org/wiki/Message_Passing_Interface</ref> Unlike other implementations of MapReduce which are mostly in Java, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.<ref>http://mapreduce.sandia.gov/doc/Background.html</ref>

'''Example MR-MPI code''' <ref>http://mapreduce.sandia.gov/doc/Program.html</ref>
<pre>
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object
</pre>

The map function is the same in this implementation as in others. The collate function is the shuffle and sort of data that occurs after all the keys have been output by the mappers, and the reduce function is the same implementation that one would expect in any standard MapReduce implementation.

From this interface one writes MapReduce code, where the functions are processing keys and value like standard MapReduce implementations. The framework also allows for MapReduce-MPI jobs to be written in C, Python, and a scripting language they've built called OINK.

====KMR====

KMR is another MapReduce implementation based on MPI. KMR is more robust than MR-MPI, at the cost of being slightly more complex to build your MapReduce application. There isn't much that's very distinct about this implementation. It contains the ability to assign functions for mapping, functions for the shuffle, and the ability to assign a function for the shuffle.

'''Examples'''

See [http://mt.aics.riken.jp/kmr/docs/kmr-1.5/html/index.html#overview KMR Overview]

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for its document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of its neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to its neighbors. Each neighbor will then update its state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-14T02:21:20Z

Acweber2: /* Challenges */

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Parallel Processing meets MapReduce =

=== MapReduce for Clustered Systems ===
The defacto standard for using MapReduce is in a clustered environment of many separate machines. The purpose of MapReduce is to tranform a large set of data to another large set of data and possibly reduce the output. The cost of the clustered environments is latency of communication. This leaves clustered environments best suited for tasks where immediate feedback isn't necessary. Log analysis, data transformation and other types of problems are solved using the clustered environment implementations.

=== MapReduce for Shared-Memory Machines ===
A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a significant overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient. Problem sets that are expressed in key-value pairs best fit into the shared-memory model.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

# Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.
# [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.
# MapReduce-MPI and KMR implements Mapreduce for distributed memory systems.
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Challenges ===

Running MapReduce on a Shared Memory System can show a significant increase in speed over the cluster/disk based systems due to little to no overhead IO overhead. However a few challenges present themselves in the shared memory environment <ref>http://www4.ncsu.edu/~dtiwari2/Papers/2012_IPDPS_Devesh_MapReduce.pdf</ref>.

# Immediate output is stored in memory, requiring a lot for large problem sets.
# The ratio of key-value pairs relative to the number of distinct pairs highly affects performance.
# Execution time of reduce phase is affected by task queue overhead.
# The size and shape of data structure for storing the immediate output affect map and reduce phases differently.

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

===Challenges===

#Susceptible to network outages.
#Node failure has to be handled and work rescheduled.
#There has to be a system that knows of all the workers and where they are.

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of standardized MPI. <ref>http://en.wikipedia.org/wiki/Message_Passing_Interface</ref> Unlike other implementations of MapReduce which are mostly in Java, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.<ref>http://mapreduce.sandia.gov/doc/Background.html</ref>

'''Example MR-MPI code''' <ref>http://mapreduce.sandia.gov/doc/Program.html</ref>
<pre>
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object
</pre>

The map function is the same in this implementation as in others. The collate function is the shuffle and sort of data that occurs after all the keys have been output by the mappers, and the reduce function is the same implementation that one would expect in any standard MapReduce implementation.

From this interface one writes MapReduce code, where the functions are processing keys and value like standard MapReduce implementations. The framework also allows for MapReduce-MPI jobs to be written in C, Python, and a scripting language they've built called OINK.

====KMR====

KMR is another MapReduce implementation based on MPI. KMR is more robust than MR-MPI, at the cost of being slightly more complex to build your MapReduce application. There isn't much that's very distinct about this implementation. It contains the ability to assign functions for mapping, functions for the shuffle, and the ability to assign a function for the shuffle.

'''Examples'''

See [http://mt.aics.riken.jp/kmr/docs/kmr-1.5/html/index.html#overview KMR Overview]

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-14T02:15:11Z

Acweber2: /* Map Reduce on Distributed Memory Machines */ adding challenges

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Parallel Processing meets MapReduce =

=== MapReduce for Clustered Systems ===
The defacto standard for using MapReduce is in a clustered environment of many separate machines. The purpose of MapReduce is to tranform a large set of data to another large set of data and possibly reduce the output. The cost of the clustered environments is latency of communication. This leaves clustered environments best suited for tasks where immediate feedback isn't necessary. Log analysis, data transformation and other types of problems are solved using the clustered environment implementations.

=== MapReduce for Shared-Memory Machines ===
A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a significant overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient. Problem sets that are expressed in key-value pairs best fit into the shared-memory model.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Challenges ===

Running MapReduce on a Shared Memory System can show a significant increase in speed over the cluster/disk based systems due to little to no overhead IO overhead. However a few challenges present themselves in the shared memory environment <ref>http://www4.ncsu.edu/~dtiwari2/Papers/2012_IPDPS_Devesh_MapReduce.pdf</ref>.

# Immediate output is stored in memory, requiring a lot for large problem sets.
# The ratio of key-value pairs relative to the number of distinct pairs highly affects performance.
# Execution time of reduce phase is affected by task queue overhead.
# The size and shape of data structure for storing the immediate output affect map and reduce phases differently.

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

===Challenges===

#Susceptible to network outages.
#Node failure has to be handled and work rescheduled.

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of standardized MPI. <ref>http://en.wikipedia.org/wiki/Message_Passing_Interface</ref> Unlike other implementations of MapReduce which are mostly in Java, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.<ref>http://mapreduce.sandia.gov/doc/Background.html</ref>

'''Example MR-MPI code''' <ref>http://mapreduce.sandia.gov/doc/Program.html</ref>
<pre>
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object
</pre>

The map function is the same in this implementation as in others. The collate function is the shuffle and sort of data that occurs after all the keys have been output by the mappers, and the reduce function is the same implementation that one would expect in any standard MapReduce implementation.

From this interface one writes MapReduce code, where the functions are processing keys and value like standard MapReduce implementations. The framework also allows for MapReduce-MPI jobs to be written in C, Python, and a scripting language they've built called OINK.

====KMR====

KMR is another MapReduce implementation based on MPI. KMR is more robust than MR-MPI, at the cost of being slightly more complex to build your MapReduce application. There isn't much that's very distinct about this implementation. It contains the ability to assign functions for mapping, functions for the shuffle, and the ability to assign a function for the shuffle.

'''Examples'''

See [http://mt.aics.riken.jp/kmr/docs/kmr-1.5/html/index.html#overview KMR Overview]

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-14T02:08:33Z

Acweber2: /* Map Reduce on Distributed Memory Machines */ adding challenges

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Parallel Processing meets MapReduce =

=== MapReduce for Shared-Memory Machines ===

A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a major overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Challenges ===

Running MapReduce on a Shared Memory System can show a significant increase in speed over the cluster/disk based systems due to little to no overhead IO overhead. However a few challenges present themselves in the shared memory environment <ref>http://www4.ncsu.edu/~dtiwari2/Papers/2012_IPDPS_Devesh_MapReduce.pdf</ref>.

# Immediate output is stored in memory, requiring a lot for large problem sets.
# The ratio of key-value pairs relative to the number of distinct pairs highly affects performance.
# Execution time of reduce phase is affected by task queue overhead.
# The size and shape of data structure for storing the immediate output affect map and reduce phases differently.

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

===Challenges===

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of standardized MPI. <ref>http://en.wikipedia.org/wiki/Message_Passing_Interface</ref> Unlike other implementations of MapReduce which are mostly in Java, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.<ref>http://mapreduce.sandia.gov/doc/Background.html</ref>

'''Example MR-MPI code''' <ref>http://mapreduce.sandia.gov/doc/Program.html</ref>
<pre>
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object
</pre>

The map function is the same in this implementation as in others. The collate function is the shuffle and sort of data that occurs after all the keys have been output by the mappers, and the reduce function is the same implementation that one would expect in any standard MapReduce implementation.

From this interface one writes MapReduce code, where the functions are processing keys and value like standard MapReduce implementations. The framework also allows for MapReduce-MPI jobs to be written in C, Python, and a scripting language they've built called OINK.

====KMR====

KMR is another MapReduce implementation based on MPI. KMR is more robust than MR-MPI, at the cost of being slightly more complex to build your MapReduce application. There isn't much that's very distinct about this implementation. It contains the ability to assign functions for mapping, functions for the shuffle, and the ability to assign a function for the shuffle.

'''Examples'''

See [http://mt.aics.riken.jp/kmr/docs/kmr-1.5/html/index.html#overview KMR Overview]

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-14T01:57:24Z

Acweber2: /* KMR */ add example link

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Parallel Processing meets MapReduce =

=== MapReduce for Shared-Memory Machines ===

A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a major overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Challenges ===

Running MapReduce on a Shared Memory System can show a significant increase in speed over the cluster/disk based systems due to little to no overhead IO overhead. However a few challenges present themselves in the shared memory environment <ref>http://www4.ncsu.edu/~dtiwari2/Papers/2012_IPDPS_Devesh_MapReduce.pdf</ref>.

# Immediate output is stored in memory, requiring a lot for large problem sets.
# The ratio of key-value pairs relative to the number of distinct pairs highly affects performance.
# Execution time of reduce phase is affected by task queue overhead.
# The size and shape of data structure for storing the immediate output affect map and reduce phases differently.

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of standardized MPI. <ref>http://en.wikipedia.org/wiki/Message_Passing_Interface</ref> Unlike other implementations of MapReduce which are mostly in Java, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.<ref>http://mapreduce.sandia.gov/doc/Background.html</ref>

'''Example MR-MPI code''' <ref>http://mapreduce.sandia.gov/doc/Program.html</ref>
<pre>
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object
</pre>

The map function is the same in this implementation as in others. The collate function is the shuffle and sort of data that occurs after all the keys have been output by the mappers, and the reduce function is the same implementation that one would expect in any standard MapReduce implementation.

From this interface one writes MapReduce code, where the functions are processing keys and value like standard MapReduce implementations. The framework also allows for MapReduce-MPI jobs to be written in C, Python, and a scripting language they've built called OINK.

====KMR====

KMR is another MapReduce implementation based on MPI. KMR is more robust than MR-MPI, at the cost of being slightly more complex to build your MapReduce application. There isn't much that's very distinct about this implementation. It contains the ability to assign functions for mapping, functions for the shuffle, and the ability to assign a function for the shuffle.

'''Examples'''

See [http://mt.aics.riken.jp/kmr/docs/kmr-1.5/html/index.html#overview KMR Overview]

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-14T01:55:56Z

Acweber2: /* KMR */ bit more about kmr

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Parallel Processing meets MapReduce =

=== MapReduce for Shared-Memory Machines ===

A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a major overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Challenges ===

Running MapReduce on a Shared Memory System can show a significant increase in speed over the cluster/disk based systems due to little to no overhead IO overhead. However a few challenges present themselves in the shared memory environment <ref>http://www4.ncsu.edu/~dtiwari2/Papers/2012_IPDPS_Devesh_MapReduce.pdf</ref>.

# Immediate output is stored in memory, requiring a lot for large problem sets.
# The ratio of key-value pairs relative to the number of distinct pairs highly affects performance.
# Execution time of reduce phase is affected by task queue overhead.
# The size and shape of data structure for storing the immediate output affect map and reduce phases differently.

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of standardized MPI. <ref>http://en.wikipedia.org/wiki/Message_Passing_Interface</ref> Unlike other implementations of MapReduce which are mostly in Java, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.<ref>http://mapreduce.sandia.gov/doc/Background.html</ref>

'''Example MR-MPI code''' <ref>http://mapreduce.sandia.gov/doc/Program.html</ref>
<pre>
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object
</pre>

The map function is the same in this implementation as in others. The collate function is the shuffle and sort of data that occurs after all the keys have been output by the mappers, and the reduce function is the same implementation that one would expect in any standard MapReduce implementation.

From this interface one writes MapReduce code, where the functions are processing keys and value like standard MapReduce implementations. The framework also allows for MapReduce-MPI jobs to be written in C, Python, and a scripting language they've built called OINK.

====KMR====

KMR is another MapReduce implementation based on MPI. KMR is more robust than MR-MPI, at the cost of being slightly more complex to build your MapReduce application. There isn't much that's very distinct about this implementation. It contains the ability to assign functions for mapping, functions for the shuffle, and the ability to assign a function for the shuffle.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-14T01:51:36Z

Acweber2: /* Map Reduce on Distributed Memory Machines */ start on KMR

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Parallel Processing meets MapReduce =

=== MapReduce for Shared-Memory Machines ===

A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a major overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Challenges ===

Running MapReduce on a Shared Memory System can show a significant increase in speed over the cluster/disk based systems due to little to no overhead IO overhead. However a few challenges present themselves in the shared memory environment <ref>http://www4.ncsu.edu/~dtiwari2/Papers/2012_IPDPS_Devesh_MapReduce.pdf</ref>.

# Immediate output is stored in memory, requiring a lot for large problem sets.
# The ratio of key-value pairs relative to the number of distinct pairs highly affects performance.
# Execution time of reduce phase is affected by task queue overhead.
# The size and shape of data structure for storing the immediate output affect map and reduce phases differently.

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of standardized MPI. <ref>http://en.wikipedia.org/wiki/Message_Passing_Interface</ref> Unlike other implementations of MapReduce which are mostly in Java, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.<ref>http://mapreduce.sandia.gov/doc/Background.html</ref>

'''Example MR-MPI code''' <ref>http://mapreduce.sandia.gov/doc/Program.html</ref>
<pre>
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object
</pre>

The map function is the same in this implementation as in others. The collate function is the shuffle and sort of data that occurs after all the keys have been output by the mappers, and the reduce function is the same implementation that one would expect in any standard MapReduce implementation.

From this interface one writes MapReduce code, where the functions are processing keys and value like standard MapReduce implementations. The framework also allows for MapReduce-MPI jobs to be written in C, Python, and a scripting language they've built called OINK.

====KMR====

KMR is another MapReduce implementation based on MPI. KMR is more robust than MR-MPI, at the cost of being slightly more complex to build your MapReduce application.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-14T01:35:11Z

Acweber2: /* MapReduce-MPI */ add more about function calls.

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Parallel Processing meets MapReduce =

=== MapReduce for Shared-Memory Machines ===

A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a major overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of standardized MPI. <ref>http://en.wikipedia.org/wiki/Message_Passing_Interface</ref> Unlike other implementations of MapReduce which are mostly in Java, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.<ref>http://mapreduce.sandia.gov/doc/Background.html</ref>

'''Example MR-MPI code''' <ref>http://mapreduce.sandia.gov/doc/Program.html</ref>
<pre>
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object
</pre>

The map function is the same in this implementation as in others. The collate function is the shuffle and sort of data that occurs after all the keys have been output by the mappers, and the reduce function is the same implementation that one would expect in any standard MapReduce implementation.

From this interface one writes MapReduce code, where the functions are processing keys and value like standard MapReduce implementations. The framework also allows for MapReduce-MPI jobs to be written in C, Python, and a scripting language they've built called OINK.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-14T01:29:04Z

Acweber2: /* MapReduce-MPI */ more formatting exmples etc.

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Parallel Processing meets MapReduce =

=== MapReduce for Shared-Memory Machines ===

A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a major overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of standardized MPI. <ref>http://en.wikipedia.org/wiki/Message_Passing_Interface</ref> Unlike other implementations of MapReduce which are mostly in Java, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.<ref>http://mapreduce.sandia.gov/doc/Background.html</ref>

'''Example MR-MPI code''' <ref>http://mapreduce.sandia.gov/doc/Program.html</ref>
<pre>
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object
</pre>

From this interface one writes MapReduce code, where the functions are processing keys and value like standard MapReduce implementations. The framework also allows for MapReduce-MPI jobs to be written in C, Python, and a scripting language they've built called OINK.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-14T01:21:01Z

Acweber2: /* MapReduce-MPI */ more formatting

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Parallel Processing meets MapReduce =

=== MapReduce for Shared-Memory Machines ===

A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a major overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of the standardized MPI. Unlike other implementations of MapReduce, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.<ref>http://mapreduce.sandia.gov/doc/Background.html</ref>

Example MR-MPI code
<pre>
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object
</pre>

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-14T01:18:49Z

Acweber2: /* Current MapReduce Implementations */ start on MPI section

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Parallel Processing meets MapReduce =

=== MapReduce for Shared-Memory Machines ===

A major disadvantage that popular MapReduce implementions have is the distributed file system. The communication between the MapReduce nodes is a major overhead. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient.

=== MapReduce for Distributed Memory Machines ===
One particular problem suited for the use of a MapReduce application on distributed memory machines is Self-Organizing Maps (SOMs)<ref>http://www.hicomb.org/papers/HICOMB2011-01.pdf</ref>. Self Organizing Maps are "a type of artificial neural network trained using unsupervised learning.<ref>https://en.wikipedia.org/wiki/Self-organizing_map</ref>" SOMs are used in meteorology, oceanography and bioinformatics. Processing in a SOM occurs in two steps: learning and mapping. The learning phase is where data (vectors) is loaded into the SOM to teach it to respond similarly to certain input patterns. During mapping, the input data will be compared to each node with the winning node being the one that most matches the input vector. SOM's have three major synchronization points<ref>http://www.ifs.tuwien.ac.at/ifs/research/pub_html/tom_hpcn2000/tom_hpcn2000.html</ref> that are well suited for the MapReduce structure on a distributed memory machine. This is due to the sychronization overheads are best avoided by segmenting the SOM into multiple regions so the memory usage can be spread effectively over multiple nodes.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== MapReduce for Clustered Systems==

=== Google's MapReduce ===

==== Execution Overview ====
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

==== Data Structures: Master ====

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

==== Fault Tolerance ====

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

==== Pros and Cons ====
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory Systems ==

=== Current Map Reduce Implementations ===
# [http://pdos.csail.mit.edu/metis/ Metis]
# [http://csl.stanford.edu/~christos/sw/phoenix/ Phoenix++]

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Distributed Memory Machines ==

=== Current MapReduce Implementations ===
# [http://mapreduce.sandia.gov/papers.html MapReduce-MPI]
# [http://mt.aics.riken.jp/kmr/ KMR]

====MapReduce-MPI====

MapReduce-MPI is an implementation of MapReduce on top of the standardized MPI. Unlike other implementations of MapReduce, MapReduce-MPI is implemented in C++. The major downfall of this implementation is a lack of fault tolerance. The implementations MPI library does not detect machines that are no longer part of the cluster very well.[http://mapreduce.sandia.gov/doc/Background.html]

Example MR-MPI code

MapReduce *mr = new MapReduce(MPI_COMM_WORLD); // instantiate an MR object
mr->map(nfiles,&mymap); // parallel map
mr->collate() // collate keys
mr->reduce(&myreduce); // parallel reduce
delete mr; // delete the MR object

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# [http://www.cse.ust.hk/gpuqp/Mars.html Mars]
# [http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf StreamMR ]
# [https://code.google.com/p/gpmr/ GPMR]
# [https://code.google.com/p/mapcg/source/browse/trunk/README MapCG]

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-13T02:18:02Z

Acweber2: add links per the assignment description.

WriteUp: https://docs.google.com/document/d/1dyv2TU7PsDe78rMq8gWE788II_KjmK3yIS8Wm_F0Z-c/edit 
StartingDoc: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/3b_xz

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Problems Solved by MapReduce =

=== MapReduce for Shared-Memory Machines ===

The types of problems that a shared-memory MapReduce implementation solves are problems with large files. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient.

=== Distributed Memory Machines ===
To be continued <ref>http://mapreduce.sandia.gov/</ref>

=== LANs ===
To be continued <ref>http://www.teradata.com/Teradata-Aster-SQL-MapReduce/</ref>

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== Google's MapReduce ==

=== Execution Overview ===
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

=== Data Structures: Master ===

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

=== Fault Tolerance ===

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

=== Pros and Cons ===
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory ==

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current Available MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# Mars <ref>http://www.cse.ust.hk/gpuqp/Mars.html</ref>
# StreamMR <ref>http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf</ref>
# GPMR <ref>https://code.google.com/p/gpmr/</ref>
# MapCG <ref>https://code.google.com/p/mapcg/source/browse/trunk/README</ref>

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-13T02:05:51Z

Acweber2: /* Apache’s Hadoop MapReduce */

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Problems Solved by MapReduce =

=== MapReduce for Shared-Memory Machines ===

The types of problems that a shared-memory MapReduce implementation solves are problems with large files. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient.

=== Distributed Memory Machines ===
To be continued <ref>http://mapreduce.sandia.gov/</ref>

=== LANs ===
To be continued <ref>http://www.teradata.com/Teradata-Aster-SQL-MapReduce/</ref>

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== Google's MapReduce ==

=== Execution Overview ===
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

=== Data Structures: Master ===

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

=== Fault Tolerance ===

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

=== Pros and Cons ===
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

=====Spark=====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

=====Tez=====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

=====Flink=====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory ==

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current Available MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# Mars <ref>http://www.cse.ust.hk/gpuqp/Mars.html</ref>
# StreamMR <ref>http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf</ref>
# GPMR <ref>https://code.google.com/p/gpmr/</ref>
# MapCG <ref>https://code.google.com/p/mapcg/source/browse/trunk/README</ref>

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-13T01:59:31Z

Acweber2: /* Mars API */

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Problems Solved by MapReduce =

=== MapReduce for Shared-Memory Machines ===

The types of problems that a shared-memory MapReduce implementation solves are problems with large files. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient.

=== Distributed Memory Machines ===
To be continued <ref>http://mapreduce.sandia.gov/</ref>

=== LANs ===
To be continued <ref>http://www.teradata.com/Teradata-Aster-SQL-MapReduce/</ref>

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== Google's MapReduce ==

=== Execution Overview ===
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

=== Data Structures: Master ===

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

=== Fault Tolerance ===

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

=== Pros and Cons ===
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

====Spark====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

====Tez====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

====Flink====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory ==

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current Available MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# Mars <ref>http://www.cse.ust.hk/gpuqp/Mars.html</ref>
# StreamMR <ref>http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf</ref>
# GPMR <ref>https://code.google.com/p/gpmr/</ref>
# MapCG <ref>https://code.google.com/p/mapcg/source/browse/trunk/README</ref>

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
void MAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
void MAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-13T01:42:26Z

Acweber2: /* MapReduce1 (MRV1) */

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Problems Solved by MapReduce =

=== MapReduce for Shared-Memory Machines ===

The types of problems that a shared-memory MapReduce implementation solves are problems with large files. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient.

=== Distributed Memory Machines ===
To be continued <ref>http://mapreduce.sandia.gov/</ref>

=== LANs ===
To be continued <ref>http://www.teradata.com/Teradata-Aster-SQL-MapReduce/</ref>

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== Google's MapReduce ==

=== Execution Overview ===
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

=== Data Structures: Master ===

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

=== Fault Tolerance ===

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

=== Pros and Cons ===
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokes the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

====Spark====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

====Tez====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

====Flink====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory ==

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current Available MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# Mars <ref>http://www.cse.ust.hk/gpuqp/Mars.html</ref>
# StreamMR <ref>http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf</ref>
# GPMR <ref>https://code.google.com/p/gpmr/</ref>
# MapCG <ref>https://code.google.com/p/mapcg/source/browse/trunk/README</ref>

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
voidMAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
voidMAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-13T01:40:17Z

Acweber2: /* Apache’s Hadoop MapReduce */ touch up description

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Problems Solved by MapReduce =

=== MapReduce for Shared-Memory Machines ===

The types of problems that a shared-memory MapReduce implementation solves are problems with large files. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient.

=== Distributed Memory Machines ===
To be continued <ref>http://mapreduce.sandia.gov/</ref>

=== LANs ===
To be continued <ref>http://www.teradata.com/Teradata-Aster-SQL-MapReduce/</ref>

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== Google's MapReduce ==

=== Execution Overview ===
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

=== Data Structures: Master ===

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

=== Fault Tolerance ===

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

=== Pros and Cons ===
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the computation
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Data is stored in Hadoop in the filesystem called HDFS. Map Reduce provides the framework of processing the data in HDFS.

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokees the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

====Spark====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

====Tez====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

====Flink====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory ==

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current Available MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# Mars <ref>http://www.cse.ust.hk/gpuqp/Mars.html</ref>
# StreamMR <ref>http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf</ref>
# GPMR <ref>https://code.google.com/p/gpmr/</ref>
# MapCG <ref>https://code.google.com/p/mapcg/source/browse/trunk/README</ref>

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
voidMAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
voidMAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-13T01:29:48Z

Acweber2: /* Apache’s Hadoop MapReduce */ flink tez sections

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Problems Solved by MapReduce =

=== MapReduce for Shared-Memory Machines ===

The types of problems that a shared-memory MapReduce implementation solves are problems with large files. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient.

=== Distributed Memory Machines ===
To be continued <ref>http://mapreduce.sandia.gov/</ref>

=== LANs ===
To be continued <ref>http://www.teradata.com/Teradata-Aster-SQL-MapReduce/</ref>

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== Google's MapReduce ==

=== Execution Overview ===
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

=== Data Structures: Master ===

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

=== Fault Tolerance ===

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

=== Pros and Cons ===
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the processing
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Map Reduce provides the framework of processing
the data on a Hadoop cluster that is stored in the Hadoop File System (HDFS).

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokees the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

====Spark====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

====Tez====
Tez like Spark is a second generation computation framework Apache Hadoop. Tez is a DAG engine. A directed-acyclic-graph engine. Based on the Microsoft Dryad paper, the DAG execution engine allows applications to have tasks as a node in a graph. Like Spark it has gains over execution speeds, and attempts to make more efficient use of available resources on the cluster.

====Flink====
Flink like Spark and Tez is another attempt to make a more efficient computation engine that can sit on top of Apache Hadoop. Flink is also a DAG processor that attempts to reduce latency and better use available resources.

== Map Reduce for Shared Memory ==

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current Available MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# Mars <ref>http://www.cse.ust.hk/gpuqp/Mars.html</ref>
# StreamMR <ref>http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf</ref>
# GPMR <ref>https://code.google.com/p/gpmr/</ref>
# MapCG <ref>https://code.google.com/p/mapcg/source/browse/trunk/README</ref>

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
voidMAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
voidMAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-13T01:21:49Z

Acweber2: /* Spark */

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Problems Solved by MapReduce =

=== MapReduce for Shared-Memory Machines ===

The types of problems that a shared-memory MapReduce implementation solves are problems with large files. A shared-memory MapReduce implementation on a large machine (24 GB of RAM, two quad core processors) has the advantage to load large files into memory and outperform a 15-node cluster of similar sized nodes <ref>http://www.vldb.org/pvldb/vol6/p1354-kumar.pdf</ref>. If the dataset can fit into memory, running a fully-distributed MapReduce cluster like Hadoop is inefficient.

=== Distributed Memory Machines ===
To be continued <ref>http://mapreduce.sandia.gov/</ref>

=== LANs ===
To be continued <ref>http://www.teradata.com/Teradata-Aster-SQL-MapReduce/</ref>

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== Google's MapReduce ==

=== Execution Overview ===
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

=== Data Structures: Master ===

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

=== Fault Tolerance ===

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

=== Pros and Cons ===
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the processing
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Map Reduce provides the framework of processing
the data on a Hadoop cluster that is stored in the Hadoop File System (HDFS).

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokees the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

====Spark====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

Due to the in memory nature of Spark there are a good number of machine learning frameworks that are being built on top of Spark. This allows data to be read into memory on a cluster and iterations of an algorithm run over the same data in memory instead of reading it from disk repeatedly.

== Map Reduce for Shared Memory ==

=== Phoenix API ===
Phoenix <ref>http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref>implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

==== Buffer Management ====
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

==== Pros and Cons ====
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Current Available MapReduce Implementations for GPU ===

Several implementations of MapReduce for the GPU exist:
# Mars <ref>http://www.cse.ust.hk/gpuqp/Mars.html</ref>
# StreamMR <ref>http://synergy.cs.vt.edu/pubs/papers/elteir-icpads11-steammr.pdf</ref>
# GPMR <ref>https://code.google.com/p/gpmr/</ref>
# MapCG <ref>https://code.google.com/p/mapcg/source/browse/trunk/README</ref>

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
voidMAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
voidMAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

= Examples =

== Basic MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref> ==

=== Counting and Summing ===

Suppose you wanted to count the number of occurrences for each word in a set of documents. The documents could be anything; a log file or an http page.

The simpliest approach is just to simply emit "1" for each term that a document possesses and then have the reducer add them up.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

However this approach requires a high amount of dummy counters emitted by the mapper. A way to clean this up is to make the mapper count the terms for it's document.

<pre>
class Mapper
method Map(docid id, doc d)
H = new AssociativeArray
for all term t in doc d do
H{t} = H{t} + 1
for all term t in H do
Emit(term t, count H{t})
</pre>

To expand on this idea, it's better to use a combiner so that counter may be accumulated for more than one document.

<pre>
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
</pre>

=== Distributed Task Execution ===

A large computational problem that can be divided into equal parts and then combined together for a final result is a standard Map-Reduce problem. The problem is split into a set of specifications and specifications are stored as input data for the mappers. Each mapper takes a specification, executes the computation and then emits the results.

<pre>
class Mapper
method Map(specid id, spec s)
result = 0
result = calculate(specid id, spec s )
Emit(result r)

class Reducer
method Reduce(results [r1, r2,...])
sum = 0
for all result r in [r1, r2,...] do
sum = sum + r
Emit(result sum)
</pre>

==Advanced MapReduce Patterns <ref>https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</ref>==

=== Graph Processing ===

In a network of entities, relationships exist between all nodes. The problem is calculating a state for each node using the properties of it's neighbors. This state can be the distance between nodes, characteristics of density and so on. Conceptually MapReduce jobs are performed in an iterative way. On each iteration, a node sends a message to it's neighbors. Each neighbor will then update it's state based on the message it received. The iteration is terminated based on some condition like a fixed number of iterations or minor changes in state. The Mapper is responsibly for emitting messages with for each node using the adjacent node ID as a key. The Reducer is responsible for recomputing state and rewriting the node with the new state based on the the messages from the incoming node.

<pre>
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)
</pre>

By changing the definition of the state object with calculateState and getMessage function several other use cases can be fulfilled with this pattern including availability propagation through a category tree and breadth-first search. For instance, defining these functions fulfill a breadth-first search.

<pre>
class N
State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])
min( [d1, d2,...] )
</pre>

== Further examples ==
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-10T03:57:09Z

Acweber2: add a spark section

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== Google's MapReduce ==

=== Execution Overview ===
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

=== Data Structures: Master ===

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

=== Fault Tolerance ===

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

=== Pros and Cons ===
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the processing
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Map Reduce provides the framework of processing
the data on a Hadoop cluster that is stored in the Hadoop File System (HDFS).

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokees the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

====Spark====
Just as YARN was implemented to address some of the short comings of MRV1, Spark is a new execution framework to help remove some of the inefficiencies and startup latency of MapReduce. Spark takes greater advantage of available memory on the nodes in the cluster, and will start job execution immediately. Where MapReduce will wait until distribution of code to all the nodes. Spark also adds a number of things into the framework, such as streaming and ingestion and the ability to do SQL queries within the applications.

== Phoenix ==
Phoenix<ref>
http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref> implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

=== Phoenix API ===
The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

=== Buffer Management ===
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

=== Pros and Cons ===
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
voidMAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
voidMAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

=== Implementation Details ===

* Since the GPU does not support [http://en.wikipedia.org/wiki/Dynamic_memory_allocation#Dynamic_memory_allocation/ dynamic memory allocation] on the device memory during the execution of the GPU code, arrays are used as the main data structure.
* The input data, the intermediate result and the final result are stored in three kinds of arrays, i.e., the key array, the value array and the directory index. The directory index consists of an entry of <key offset, key size, value offset, value size> for each key/value pair.
* Given a directory index entry, the key or the value at the corresponding offset in the key array or the value array is fetched.
* With the array structure, for the input data as well as for the result output the space on the device memory is allocated before executing the GPU program. However, the sizes of the output from the map and the reduce stages are unknown. The output scheme for the map stage is similar to that for the reduce stage.

First, each map task outputs three counts, i.e., the number of intermediate results, the total size of keys (in bytes) and the total size of values (in bytes) generated by the map task. Based on key sizes (or value sizes) of all map tasks, the run-time system computes a prefix sum on these sizes and produces an array of write locations. A write location is the start location in the output array for the corresponding map task to write. Based on the number of intermediate results, the run-time system computes a prefix sum and produces an array of start locations in the output directory index for the corresponding map task. Through these prefix sums, the sizes of the arrays for the intermediate result is also known. Thus, the run-time allocates arrays in the device memory with the exact size for storing the intermediate results.

Second, each map task outputs the intermediate key/value pairs to the output array and updates the directory index. Since each map has its deterministic and non-overlapping positions to write to, the write conflicts are avoided. This two-step scheme does not require the hardware support of atomic functions. It is suitable for the massive thread parallelism on the GPU. However, it doubles the map computation in the worst case. The overhead of this scheme is application dependent, and is usually much smaller than that in the worst case.

=== Optimization Techniques ===
==== Memory Optimizations ====
Two memory optimizations are used to reduce the number of memory requests in order to improve the memory bandwidth utilization.
* '''Coalesced accesses'''
The GPU feature of coalesced accesses is utilized to improve the memory performance. The memory accesses of each thread to the data arrays are designed according to the coalesced access pattern when applicable. Suppose there are T threads in total and the number of key/value pairs is N in the map stage. Thread i processes the (i + T • k )th (k=0,..,N/T) key/value pair. Due to the SIMD property of the GPU, the memory addresses from the threads within a thread group are consecutive and these accesses are coalesced into one. The figure below illustrates the map stage with and without the coalesced access optimization. 
[[File:Mars.jpg]] 

* '''Accesses using built-in vector types'''
Accessing the values in the device memory can be costly, because the data values are often
of different sizes and the accesses are hardly coalesced. Fortunately, GPUs such as G80 support built-in vector types such as char4 and int4. Reading built-in vectors fetches the entire vector
in a single memory request. Compared with reading char or int, the number of memory requests is greatly reduced and the memory performance is improved.

==== Thread parallelism ====
The thread configuration, i.e., the number of thread groups and the number of threads per thread group, is related to multiple factors including, (1) the hardware configuration such as the number of multiprocessors and the on-chip computation resources such as the number of registers on each multiprocessor, (2) the computation characteristics of the map and the reduce tasks, e.g., they are memory- or computation-intensive. Since the map and the reduce functions are implemented by the developer, and their costs are unknown to the runtime system, it is difficult to find the optimal setting for the thread configuration at
run time.

==== Handling variable-sized types ====
The variable-sized types are supported with the directory index. If two key/value pairs need to be swapped, their corresponding entries in the directory index are swapped without modifying the key and the value arrays. This choice is to save the swapping cost since the directory entries are typically much smaller than the key/value pairs. Even though swapping changes the order of entries in the directory index, the array layout is preserved and therefore accesses to the directory index can still be coalesced after swaps. Since strings are a typical variable-sized type, and string processing is common in web data analysis tasks, a GPU-based string manipulation library was developed for Mars. The operations in the library include strcmp, strcat, memset and so on. The APIs of these operations are consistent with those in C/C++ library on the CPU. The difference is that simple algorithms for these GPU-based string operations were used, since they usually handle small strings within a map or a reduce task. In addition, char4 is used to implement strings to optimize the memory performance.

==== Hashing ====
[http://en.wikipedia.org/wiki/Hash_function/ Hashing] is used in the sort algorithm to store the results with the same key value consecutively. In that case, it is not needed that the results with the key values are in their strict ascending/ decreasing order. The hashing technique that hashes a key into a 32-bit integer is used, and the records are sorted according to their hash values. When two records are compared, their hash values are compared first. Only when their hash values are the same, their keys are fetched and compared. Given a good hash function, the probability of comparing the keys is low.

==== File manipulation ====
Currently, the GPU cannot directly access the data in the hard disk. Thus, the file manipulation with the assistance of the CPU is performed in three phases. First, the file I/O on the CPU is performed and the file data is loaded into a buffer in the main memory. To reduce the I/O stall, multiple threads are used to perform the I/O task. Second, the preprocessing on the buffered data is performed and the input key/value pairs are obtained. Finally, the input key/value pairs are copied to the GPU device memory.

=== Pros and Cons ===
* '''Advantages'''

# Provides a performance [http://en.wikipedia.org/wiki/Speedup/ speedup] of accessing data by using built-in vector types. These vector types reduces the number of memory requests and improves the bandwidth utilization.
# Applications written on Mars may or may not have the reduce stage and thus improves speedup.

* '''Disadvantages'''

# GPU based applications are much more complex
# Mars currently handles data that can fit into the device memory but has not yet been checked to support massive data sets

= More Examples =
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-10T03:32:04Z

Acweber2: placeholder for work to be done.

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== Google's MapReduce ==

=== Execution Overview ===
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

=== Data Structures: Master ===

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

=== Fault Tolerance ===

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

=== Pros and Cons ===
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the processing
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Map Reduce provides the framework of processing
the data on a Hadoop cluster that is stored in the Hadoop File System (HDFS)

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokees the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.
====SPARK====

== Phoenix ==
Phoenix<ref>
http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref> implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

=== Phoenix API ===
The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

=== Buffer Management ===
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

=== Pros and Cons ===
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
voidMAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
voidMAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

=== Implementation Details ===

* Since the GPU does not support [http://en.wikipedia.org/wiki/Dynamic_memory_allocation#Dynamic_memory_allocation/ dynamic memory allocation] on the device memory during the execution of the GPU code, arrays are used as the main data structure.
* The input data, the intermediate result and the final result are stored in three kinds of arrays, i.e., the key array, the value array and the directory index. The directory index consists of an entry of <key offset, key size, value offset, value size> for each key/value pair.
* Given a directory index entry, the key or the value at the corresponding offset in the key array or the value array is fetched.
* With the array structure, for the input data as well as for the result output the space on the device memory is allocated before executing the GPU program. However, the sizes of the output from the map and the reduce stages are unknown. The output scheme for the map stage is similar to that for the reduce stage.

First, each map task outputs three counts, i.e., the number of intermediate results, the total size of keys (in bytes) and the total size of values (in bytes) generated by the map task. Based on key sizes (or value sizes) of all map tasks, the run-time system computes a prefix sum on these sizes and produces an array of write locations. A write location is the start location in the output array for the corresponding map task to write. Based on the number of intermediate results, the run-time system computes a prefix sum and produces an array of start locations in the output directory index for the corresponding map task. Through these prefix sums, the sizes of the arrays for the intermediate result is also known. Thus, the run-time allocates arrays in the device memory with the exact size for storing the intermediate results.

Second, each map task outputs the intermediate key/value pairs to the output array and updates the directory index. Since each map has its deterministic and non-overlapping positions to write to, the write conflicts are avoided. This two-step scheme does not require the hardware support of atomic functions. It is suitable for the massive thread parallelism on the GPU. However, it doubles the map computation in the worst case. The overhead of this scheme is application dependent, and is usually much smaller than that in the worst case.

=== Optimization Techniques ===
==== Memory Optimizations ====
Two memory optimizations are used to reduce the number of memory requests in order to improve the memory bandwidth utilization.
* '''Coalesced accesses'''
The GPU feature of coalesced accesses is utilized to improve the memory performance. The memory accesses of each thread to the data arrays are designed according to the coalesced access pattern when applicable. Suppose there are T threads in total and the number of key/value pairs is N in the map stage. Thread i processes the (i + T • k )th (k=0,..,N/T) key/value pair. Due to the SIMD property of the GPU, the memory addresses from the threads within a thread group are consecutive and these accesses are coalesced into one. The figure below illustrates the map stage with and without the coalesced access optimization. 
[[File:Mars.jpg]] 

* '''Accesses using built-in vector types'''
Accessing the values in the device memory can be costly, because the data values are often
of different sizes and the accesses are hardly coalesced. Fortunately, GPUs such as G80 support built-in vector types such as char4 and int4. Reading built-in vectors fetches the entire vector
in a single memory request. Compared with reading char or int, the number of memory requests is greatly reduced and the memory performance is improved.

==== Thread parallelism ====
The thread configuration, i.e., the number of thread groups and the number of threads per thread group, is related to multiple factors including, (1) the hardware configuration such as the number of multiprocessors and the on-chip computation resources such as the number of registers on each multiprocessor, (2) the computation characteristics of the map and the reduce tasks, e.g., they are memory- or computation-intensive. Since the map and the reduce functions are implemented by the developer, and their costs are unknown to the runtime system, it is difficult to find the optimal setting for the thread configuration at
run time.

==== Handling variable-sized types ====
The variable-sized types are supported with the directory index. If two key/value pairs need to be swapped, their corresponding entries in the directory index are swapped without modifying the key and the value arrays. This choice is to save the swapping cost since the directory entries are typically much smaller than the key/value pairs. Even though swapping changes the order of entries in the directory index, the array layout is preserved and therefore accesses to the directory index can still be coalesced after swaps. Since strings are a typical variable-sized type, and string processing is common in web data analysis tasks, a GPU-based string manipulation library was developed for Mars. The operations in the library include strcmp, strcat, memset and so on. The APIs of these operations are consistent with those in C/C++ library on the CPU. The difference is that simple algorithms for these GPU-based string operations were used, since they usually handle small strings within a map or a reduce task. In addition, char4 is used to implement strings to optimize the memory performance.

==== Hashing ====
[http://en.wikipedia.org/wiki/Hash_function/ Hashing] is used in the sort algorithm to store the results with the same key value consecutively. In that case, it is not needed that the results with the key values are in their strict ascending/ decreasing order. The hashing technique that hashes a key into a 32-bit integer is used, and the records are sorted according to their hash values. When two records are compared, their hash values are compared first. Only when their hash values are the same, their keys are fetched and compared. Given a good hash function, the probability of comparing the keys is low.

==== File manipulation ====
Currently, the GPU cannot directly access the data in the hard disk. Thus, the file manipulation with the assistance of the CPU is performed in three phases. First, the file I/O on the CPU is performed and the file data is loaded into a buffer in the main memory. To reduce the I/O stall, multiple threads are used to perform the I/O task. Second, the preprocessing on the buffered data is performed and the input key/value pairs are obtained. Finally, the input key/value pairs are copied to the GPU device memory.

=== Pros and Cons ===
* '''Advantages'''

# Provides a performance [http://en.wikipedia.org/wiki/Speedup/ speedup] of accessing data by using built-in vector types. These vector types reduces the number of memory requests and improves the bandwidth utilization.
# Applications written on Mars may or may not have the reduce stage and thus improves speedup.

* '''Disadvantages'''

# GPU based applications are much more complex
# Mars currently handles data that can fit into the device memory but has not yet been checked to support massive data sets

= More Examples =
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-10T03:21:22Z

Acweber2: Add more about Yarn and section off mrv1

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== Google's MapReduce ==

=== Execution Overview ===
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

=== Data Structures: Master ===

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

=== Fault Tolerance ===

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

=== Pros and Cons ===
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

The key to the whole system is data locality. The idea that network is slow and data plentiful, in a lot of processing frameworks you will bring the data to the processing
Hadoop brings the computation to the data. In some cases the data is so large that this is the only processing option. Map Reduce provides the framework of processing
the data on a Hadoop cluster that is stored in the Hadoop File System (HDFS)

====MapReduce1 (MRV1)====

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokees the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

====YARN====

While the initial implementation of MRV1 on Hadoop was successful, heavy use of the product did show some pain points in the MRV1 implementation. Notably heavy processing
load would cause the JobTracker to be a large bottleneck. In order to help remove this bottleneck, YARN was implemented. YARN is an application framework that solely does
resource management for Hadoop clusters. Now not only can you run Map Reduce jobs, but you can also put other in cluster implementation under the YARN resource management.
Allowing you to properly allocate resources across your cluster. YARN at it's simplest is the separation of the work that the JobTracker would do into two new processes.
The resource manager (ResourceManager) and the job scheduling and monitor task (ApplicationMaster).

The map reduce API changes only in that applications need to change imports. However, the execution of the job changes significantly. Yarn does work in units called containers.
Containers represent a unit of work that can be done on a cluster. Upon job submission, the ResourceManager allocates a container for the ApplicationMaster. This ApplicationMaster
runs on a DataNode in the cluster. To run the application manager requests that a NodeManager launch the ApplicationMaster in that container. The ApplicationMaster then
determines based on the input splits, the number of map tasks to create. Once this information is known the ApplicationMaster, requests the container resources from the ResourceManager
Based on the locality of data and available resources, the ResourceManager decides where to run the map tasks. The ApplicationMaster then asks the NodeManagers on the assigned nodes to
start the map tasks.

== Phoenix ==
Phoenix<ref>
http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref> implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

=== Phoenix API ===
The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

=== Buffer Management ===
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

=== Pros and Cons ===
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
voidMAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
voidMAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

=== Implementation Details ===

* Since the GPU does not support [http://en.wikipedia.org/wiki/Dynamic_memory_allocation#Dynamic_memory_allocation/ dynamic memory allocation] on the device memory during the execution of the GPU code, arrays are used as the main data structure.
* The input data, the intermediate result and the final result are stored in three kinds of arrays, i.e., the key array, the value array and the directory index. The directory index consists of an entry of <key offset, key size, value offset, value size> for each key/value pair.
* Given a directory index entry, the key or the value at the corresponding offset in the key array or the value array is fetched.
* With the array structure, for the input data as well as for the result output the space on the device memory is allocated before executing the GPU program. However, the sizes of the output from the map and the reduce stages are unknown. The output scheme for the map stage is similar to that for the reduce stage.

First, each map task outputs three counts, i.e., the number of intermediate results, the total size of keys (in bytes) and the total size of values (in bytes) generated by the map task. Based on key sizes (or value sizes) of all map tasks, the run-time system computes a prefix sum on these sizes and produces an array of write locations. A write location is the start location in the output array for the corresponding map task to write. Based on the number of intermediate results, the run-time system computes a prefix sum and produces an array of start locations in the output directory index for the corresponding map task. Through these prefix sums, the sizes of the arrays for the intermediate result is also known. Thus, the run-time allocates arrays in the device memory with the exact size for storing the intermediate results.

Second, each map task outputs the intermediate key/value pairs to the output array and updates the directory index. Since each map has its deterministic and non-overlapping positions to write to, the write conflicts are avoided. This two-step scheme does not require the hardware support of atomic functions. It is suitable for the massive thread parallelism on the GPU. However, it doubles the map computation in the worst case. The overhead of this scheme is application dependent, and is usually much smaller than that in the worst case.

=== Optimization Techniques ===
==== Memory Optimizations ====
Two memory optimizations are used to reduce the number of memory requests in order to improve the memory bandwidth utilization.
* '''Coalesced accesses'''
The GPU feature of coalesced accesses is utilized to improve the memory performance. The memory accesses of each thread to the data arrays are designed according to the coalesced access pattern when applicable. Suppose there are T threads in total and the number of key/value pairs is N in the map stage. Thread i processes the (i + T • k )th (k=0,..,N/T) key/value pair. Due to the SIMD property of the GPU, the memory addresses from the threads within a thread group are consecutive and these accesses are coalesced into one. The figure below illustrates the map stage with and without the coalesced access optimization. 
[[File:Mars.jpg]] 

* '''Accesses using built-in vector types'''
Accessing the values in the device memory can be costly, because the data values are often
of different sizes and the accesses are hardly coalesced. Fortunately, GPUs such as G80 support built-in vector types such as char4 and int4. Reading built-in vectors fetches the entire vector
in a single memory request. Compared with reading char or int, the number of memory requests is greatly reduced and the memory performance is improved.

==== Thread parallelism ====
The thread configuration, i.e., the number of thread groups and the number of threads per thread group, is related to multiple factors including, (1) the hardware configuration such as the number of multiprocessors and the on-chip computation resources such as the number of registers on each multiprocessor, (2) the computation characteristics of the map and the reduce tasks, e.g., they are memory- or computation-intensive. Since the map and the reduce functions are implemented by the developer, and their costs are unknown to the runtime system, it is difficult to find the optimal setting for the thread configuration at
run time.

==== Handling variable-sized types ====
The variable-sized types are supported with the directory index. If two key/value pairs need to be swapped, their corresponding entries in the directory index are swapped without modifying the key and the value arrays. This choice is to save the swapping cost since the directory entries are typically much smaller than the key/value pairs. Even though swapping changes the order of entries in the directory index, the array layout is preserved and therefore accesses to the directory index can still be coalesced after swaps. Since strings are a typical variable-sized type, and string processing is common in web data analysis tasks, a GPU-based string manipulation library was developed for Mars. The operations in the library include strcmp, strcat, memset and so on. The APIs of these operations are consistent with those in C/C++ library on the CPU. The difference is that simple algorithms for these GPU-based string operations were used, since they usually handle small strings within a map or a reduce task. In addition, char4 is used to implement strings to optimize the memory performance.

==== Hashing ====
[http://en.wikipedia.org/wiki/Hash_function/ Hashing] is used in the sort algorithm to store the results with the same key value consecutively. In that case, it is not needed that the results with the key values are in their strict ascending/ decreasing order. The hashing technique that hashes a key into a 32-bit integer is used, and the records are sorted according to their hash values. When two records are compared, their hash values are compared first. Only when their hash values are the same, their keys are fetched and compared. Given a good hash function, the probability of comparing the keys is low.

==== File manipulation ====
Currently, the GPU cannot directly access the data in the hard disk. Thus, the file manipulation with the assistance of the CPU is performed in three phases. First, the file I/O on the CPU is performed and the file data is loaded into a buffer in the main memory. To reduce the I/O stall, multiple threads are used to perform the I/O task. Second, the preprocessing on the buffered data is performed and the input key/value pairs are obtained. Finally, the input key/value pairs are copied to the GPU device memory.

=== Pros and Cons ===
* '''Advantages'''

# Provides a performance [http://en.wikipedia.org/wiki/Speedup/ speedup] of accessing data by using built-in vector types. These vector types reduces the number of memory requests and improves the bandwidth utilization.
# Applications written on Mars may or may not have the reduce stage and thus improves speedup.

* '''Disadvantages'''

# GPU based applications are much more complex
# Mars currently handles data that can fit into the device memory but has not yet been checked to support massive data sets

= More Examples =
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Suggested Reading =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2015/3b az

2015-02-10T03:12:29Z

Acweber2: Copy of old page.

= Introduction to MapReduce =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =

The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of [http://en.wikipedia.org/wiki/Concurrency_(computer_science)/ concurrency] management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for [http://en.wikipedia.org/wiki/Locality_of_reference/ locality].

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to count number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== Google's MapReduce ==

=== Execution Overview ===
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

=== Data Structures: Master ===

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

=== Fault Tolerance ===

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

=== Pros and Cons ===
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and [http://en.wikipedia.org/wiki/Load_balancing_(computing)/ load balancing]. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since [http://en.wikipedia.org/wiki/Network_bandwidth/ network bandwidth] is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf Google File System</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.<ref>http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Pragmatic Guide</ref>
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokees the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

== Phoenix ==
Phoenix<ref>
http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref> implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

=== Phoenix API ===
The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as [http://en.wikipedia.org/wiki/Void_pointer#C_and_C.2B.2B/ void pointers] wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use [http://en.wikipedia.org/wiki/Stack_(abstract_data_type)/ stack]-allocated and [http://en.wikipedia.org/wiki/Heap_(data_structure)/ heap]-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

=== Buffer Management ===
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

=== Pros and Cons ===
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of [http://en.wikipedia.org/wiki/GPU/ GPU]s differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Mars API ===

Mars provides a small set of [http://en.wikipedia.org/wiki/API/ API]s that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
voidMAP_COUNT(void *key, void *val, int keySize, int valSize);

//The map function.
voidMAP(void *key, void* val, int keySize, int valSize);

//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);

//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);

//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);

//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);

//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

=== Implementation Details ===

* Since the GPU does not support [http://en.wikipedia.org/wiki/Dynamic_memory_allocation#Dynamic_memory_allocation/ dynamic memory allocation] on the device memory during the execution of the GPU code, arrays are used as the main data structure.
* The input data, the intermediate result and the final result are stored in three kinds of arrays, i.e., the key array, the value array and the directory index. The directory index consists of an entry of <key offset, key size, value offset, value size> for each key/value pair.
* Given a directory index entry, the key or the value at the corresponding offset in the key array or the value array is fetched.
* With the array structure, for the input data as well as for the result output the space on the device memory is allocated before executing the GPU program. However, the sizes of the output from the map and the reduce stages are unknown. The output scheme for the map stage is similar to that for the reduce stage.

First, each map task outputs three counts, i.e., the number of intermediate results, the total size of keys (in bytes) and the total size of values (in bytes) generated by the map task. Based on key sizes (or value sizes) of all map tasks, the run-time system computes a prefix sum on these sizes and produces an array of write locations. A write location is the start location in the output array for the corresponding map task to write. Based on the number of intermediate results, the run-time system computes a prefix sum and produces an array of start locations in the output directory index for the corresponding map task. Through these prefix sums, the sizes of the arrays for the intermediate result is also known. Thus, the run-time allocates arrays in the device memory with the exact size for storing the intermediate results.

Second, each map task outputs the intermediate key/value pairs to the output array and updates the directory index. Since each map has its deterministic and non-overlapping positions to write to, the write conflicts are avoided. This two-step scheme does not require the hardware support of atomic functions. It is suitable for the massive thread parallelism on the GPU. However, it doubles the map computation in the worst case. The overhead of this scheme is application dependent, and is usually much smaller than that in the worst case.

=== Optimization Techniques ===
==== Memory Optimizations ====
Two memory optimizations are used to reduce the number of memory requests in order to improve the memory bandwidth utilization.
* '''Coalesced accesses'''
The GPU feature of coalesced accesses is utilized to improve the memory performance. The memory accesses of each thread to the data arrays are designed according to the coalesced access pattern when applicable. Suppose there are T threads in total and the number of key/value pairs is N in the map stage. Thread i processes the (i + T • k )th (k=0,..,N/T) key/value pair. Due to the SIMD property of the GPU, the memory addresses from the threads within a thread group are consecutive and these accesses are coalesced into one. The figure below illustrates the map stage with and without the coalesced access optimization. 
[[File:Mars.jpg]] 

* '''Accesses using built-in vector types'''
Accessing the values in the device memory can be costly, because the data values are often
of different sizes and the accesses are hardly coalesced. Fortunately, GPUs such as G80 support built-in vector types such as char4 and int4. Reading built-in vectors fetches the entire vector
in a single memory request. Compared with reading char or int, the number of memory requests is greatly reduced and the memory performance is improved.

==== Thread parallelism ====
The thread configuration, i.e., the number of thread groups and the number of threads per thread group, is related to multiple factors including, (1) the hardware configuration such as the number of multiprocessors and the on-chip computation resources such as the number of registers on each multiprocessor, (2) the computation characteristics of the map and the reduce tasks, e.g., they are memory- or computation-intensive. Since the map and the reduce functions are implemented by the developer, and their costs are unknown to the runtime system, it is difficult to find the optimal setting for the thread configuration at
run time.

==== Handling variable-sized types ====
The variable-sized types are supported with the directory index. If two key/value pairs need to be swapped, their corresponding entries in the directory index are swapped without modifying the key and the value arrays. This choice is to save the swapping cost since the directory entries are typically much smaller than the key/value pairs. Even though swapping changes the order of entries in the directory index, the array layout is preserved and therefore accesses to the directory index can still be coalesced after swaps. Since strings are a typical variable-sized type, and string processing is common in web data analysis tasks, a GPU-based string manipulation library was developed for Mars. The operations in the library include strcmp, strcat, memset and so on. The APIs of these operations are consistent with those in C/C++ library on the CPU. The difference is that simple algorithms for these GPU-based string operations were used, since they usually handle small strings within a map or a reduce task. In addition, char4 is used to implement strings to optimize the memory performance.

==== Hashing ====
[http://en.wikipedia.org/wiki/Hash_function/ Hashing] is used in the sort algorithm to store the results with the same key value consecutively. In that case, it is not needed that the results with the key values are in their strict ascending/ decreasing order. The hashing technique that hashes a key into a 32-bit integer is used, and the records are sorted according to their hash values. When two records are compared, their hash values are compared first. Only when their hash values are the same, their keys are fetched and compared. Given a good hash function, the probability of comparing the keys is low.

==== File manipulation ====
Currently, the GPU cannot directly access the data in the hard disk. Thus, the file manipulation with the assistance of the CPU is performed in three phases. First, the file I/O on the CPU is performed and the file data is loaded into a buffer in the main memory. To reduce the I/O stall, multiple threads are used to perform the I/O task. Second, the preprocessing on the buffered data is performed and the input key/value pairs are obtained. Finally, the input key/value pairs are copied to the GPU device memory.

=== Pros and Cons ===
* '''Advantages'''

# Provides a performance [http://en.wikipedia.org/wiki/Speedup/ speedup] of accessing data by using built-in vector types. These vector types reduces the number of memory requests and improves the bandwidth utilization.
# Applications written on Mars may or may not have the reduce stage and thus improves speedup.

* '''Disadvantages'''

# GPU based applications are much more complex
# Mars currently handles data that can fit into the device memory but has not yet been checked to support massive data sets

= More Examples =
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns) <ref>http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries</ref>, Sector/Sphere <ref>http://en.wikipedia.org/wiki/Sector/Sphere</ref>, Datameer Analytics Solution <ref>http://en.wikipedia.org/wiki/Datameer</ref>. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Interesting Read =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

ECE506 Main Page

2015-02-10T03:11:00Z

Acweber2:

This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.

=Supplements to Solihin Text=

Post links to the textbook supplements in this section.
*Chapter 2 [[CSC/ECE 506 Spring 2011/ch2 dm | CSC/ECE 506 Spring 2011/ch2 dm]]
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]
*Chapter 2a [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]
*Chapter 2a [[ CSC/ECE 506 Spring_2012/2a va ]]
*Chapter 2b [[CSC/ECE 506 Spring 2012/ch2b cm | CSC/ECE 506 Spring 2012/ch2b cm]]
*Chapter 2b [[ECE506_CSC/ECE_506_Spring_2012/2b_az | CSC/ECE 506 Spring 2012/2b az - Data-Parallel Processing with the AMD HD 6900 Series Graphics Processing Unit]]
*Chapter 3 (Final Revision) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]
*Chapter 4a[[ CSC/ECE 506 Spring 2011/ch4a ob | Parallelization of Nelder Mead Algorithm ]]
*Chapter 4a (Under Construction) [[ CSC/ECE_506_Spring_2011/ch4a_bm | Parallelization of Algorithms ]]
*Chapter 4a [[ CSC/ECE 506 Spring 2011/ch4a zz | CSC/ECE 506 Spring 2011/ch4a zz ]]
*Chapter 4b [[Chapter 4b CSC/ECE 506 Spring 2011 / ch4b]]
*Chapter 5a [[ CSC/ECE 506 Spring 2012/ch5a ja | CSC/ECE 506 Spring 2012/ch5a ja ]]
*Chapter 9a [[CSC/ECE 506 Spring 2012/ch9a cm | CSC/ECE 506 Spring 2012/ch9a cm]]
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a jp | CSC/ECE 506 Spring 2011/ch6a jp ]]
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a ep | CSC/ECE 506 Spring 2011/ch6a ep ]]
*Chapter 6b (Ready for First Review) [[CSC/ECE 506 Spring 2011/ch6b ab | CSC/ECE 506 Spring 2011/ch6b ab]]
*Chapter 7 (Under Construction) [[CSC/ECE 506 Spring 2011/ch7 jp | CSC/ECE 506 Spring 2011/ch7 jp]]
*Chapter 8 [[CSC/ECE 506 Spring 2011/ch8 mc | CSC/ECE 506 Spring 2011/ch8 mc]]
*Chapter 10 (Under Construction) [[CSC/ECE 506 Spring 2011/ch10 sb | CSC/ECE 506 Spring 2011/ch10 sb]]
*Chapter 10 [[CSC/ECE 506 Spring 2012/ch10 sj | CSC/ECE 506 Spring 2012/ch10 sj]]
*Chapter 10a [[CSC/ECE_506_Spring_2011/ch10a_dc | CSC/ECE_506_Spring_2011/ch10a_dc]]
*Chapter 11 [[CSC/ECE_506_Spring_2011/ch11_BB_EP | Chapter 11 Supplement]]
*Chapter 11 [[Scalable_Coherent_Interface | SCI (Scalable Coherent Interface) ]]
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 ob | Interconnection Network Topologies and Routing Algorithms]]
*Chapter 12 (Ready for Final Review) [[ CSC/ECE 506 Spring 2011/ch12 aj | Interconnection Network Topologies and Routing Algorithms]]
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 | Interconnection Network Topologies]]
*[[CSC/ECE 506 Spring 2012/1a ry]]
*[[CSC/ECE 506 Spring 2012/1c dm]]
*[[CSC/ECE 506 Spring 2012/1c cl]]
*[[CSC/ECE 506 Spring 2012/1a mw]]
*[[CSC/ECE 506 Spring 2012/3a yw]]
*[[CSC/ECE 506 Spring 2012/7b yw]]
*[[CSC/ECE 506 Spring 2012/3b sk]]
*[[CSC/ECE 506 Spring 2012/4b rs]]
*[[CSC/ECE 506 Spring 2012/6b am]]
*[[CSC/ECE 506 Spring 2012/8a cj]]
*[[CSC/ECE 506 Spring 2012/10a dr]]
*[[CSC/ECE 506 Spring 2012/10a jp]]
*[[CSC/ECE 506 Spring 2012/9a ms]]
*[[CSC/ECE 506 Spring 2012/10b sr]]
*Chapter 11a [[ECE506_CSC/ECE_506_Spring_2012/11a_az | CSC/ECE 506 Spring 2012/11a az - Performance of DSM system]]
*[[CSC/ECE 506 Spring 2012/12b jh]]
*[[CSC/ECE 506 Spring 2010/8a fu]]
*[[CSC/ECE 506 Spring 2010/8a sk]]
*[[CSC/ECE 506 Spring 2012/11a ht]]
*[[CSC/ECE 506 Spring 2013/1b dj]]
*[[CSC/ECE 506 Spring 2013/1a sp]]
*[[CSC/ECE 506 Spring 2013/1d ks]]
*[[CSC/ECE 506 Spring 2013/2b so]]
*[[CSC/ECE 506 Spring 2013/1c ad]]
*[[CSC/ECE 506 Spring 2013/3b xz]]
*[[CSC/ECE_506_Spring_2013/4a_aj]]
*[[CSC/ECE_506_Spring_2013/4a_ss]]
*[[CSC/ECE_506_Spring_2013/1a_ag]]
* Chapter 3a [[CSC/ECE_506_Spring_2013/3a_bs]]
* Chapter 6a [[CSC/ECE_506_Spring_2013/6a_cs]]
* Chapter 5a [[CSC/ECE_506_Spring_2013/5a_ks]]
* Chapter 8a [[CSC/ECE_506_Spring_2013/8a_an]]
* Chapter 7a [[CSC/ECE_506_Spring_2013/7a_bs]]
* Chapter 8b [[CSC/ECE_506_Spring_2013/8b_ap]]
* Chpater 8c [[CSC/ECE_506_Spring_2013/8c_da]]
* Chpater 10a [[CSC/ECE_506_Spring_2013/10a_os]]
* Chapter 10c [[CSC/ECE_506_Spring_2013/10c_ks]]
* Chapter 11a [[CSC/ECE_506_Spring_2013/11a_ad]]
* Chapter 12a [[CSC/ECE_506_Spring_2013/12a_cm]]
* Chapter 12b [[CSC/ECE_506_Spring_2013/12b_dj]]
* Chapter 12b [[CSC/ECE_506_Spring_2013/12b_sl]]
*[[CSC/ECE_506_Spring_2013/9b_sc]]
*[[ECE506_Spring_2014_new_problems_A_Comparing_Shared_Memory_And_Message_Passing_Models]]
*[[CSC/ECE 506 Spring 2014/1b ms]]
*[[CSC/ECE_506_Spring_2014/2a]]
*[http://wiki.expertiza.ncsu.edu/index.php/User:Ufmuhamm1 CSC/ECE_506_Spring_2014/1a]
*[[CSC/ECE 506 Spring 2014/4a ad]]
*[[CSC/ECE 506 Spring 2014/3a ns]]
*[[CSC/ECE 506 Spring 2014/7b ks]]
*[[CSC/ECE 506 Spring 2014/7b ss]]
*[[CSC/ECE 506 Spring 2014/7b jj]]
*[[CSC/ECE 506 Spring 2014/9b vn]]
*[[CSC/ECE 506 Spring 2014/12b ds]]
*[[CSC/ECE 506 Spring 2015/1b DL]]
*[[CSC/ECE 506 Spring 2015/37 mr]]
*[[CSC/ECE 506 Spring 2015/3b az]]

ECE506 Main Page

2015-02-10T03:10:40Z

Acweber2: