Hadoop Streaming is a generic API which allows writing Mappers and Reduces in any language. But the basic concept remains the same. Mappers and Reducers receive their input and output on stdin and stdout as (key, value) pairs. Apache Hadoop uses streams as per UNIX standard between your application and Hadoop system. Streaming is the best fit for text processing. The data view is line oriented and processed as a key/value pair separated by 'tab' character. The program reads each line and processes it as per the requirement.
Working with Hadoop Streams
In any MapReduce job, we have input and output as key/value pairs. The same concept is true for Streaming API. In Streaming, input and output are always represented as text. The 'tab' character is used to separate key and value. The Streaming program uses the 'tab' character to split a line into key/value pair. The same procedure is followed for output. The Streaming program writes its output on stdout following the same format as mentioned below.
key1 \t value1 \n key2 \t value2 \n key3 \t value3 \n
In this process, each line contains only one key/value pair. So the input to the reducer is sorted so that all the same keys are placed adjacent to one another.
Any program or tool can be used as Mapper and Reducer if it is capable of handling input in text format as described above. Other scripts like Perl, Python or Bash can also be used for this purpose, provided all the nodes have an interpreter to understand the language.
The Hadoop streaming utility allows any script or executable to work as Mapper/Reducer provided they can work with stdin and stdout. In this section I will describe the implementation steps of the Streaming utility. I will assume two sample programs to work as Mapper and Reducer.
First, let us check the following command to run a Streaming job. The command does not have any arguments so it will show different usage options as shown below.
Figure 1: Showing Streaming command and usage
Now let us assume streamMapProgram and streamReduceProgramwill work as Mapper and Reducer. These programs can be scripts, executables or any other component capable of taking input from stdin and producing output at stdout. The following command will show how the Mapper and Reducer arguments can be combined with the Streaming command.
Figure 2: Showing input and output arguments
It is assumed that the Mapper and Reducer programs are present in all the nodes before starting the Streaming job.
First, the Mapper task converts the input into lines and places it into the stdin of the process. After this the Mapper collects the output of the process from stdout and converts it into key/value pair. These key/value pairs are the actual output from the Mapper task. The key is the value till the first 'tab' character and the remaining portion of the line is considered as value. If there is no 'tab' character then the total line is considered as key with value as 'null'.
The same process is followed when the reducer task runs. First it converts the key/value pairs into lines and put it into the stdin of the process. Then the reducer collects the line output from the stdout of the process and prepare key/value pairs as the final output of the reduce job. The key/value is separated the same way following the first 'tab' character.
The following diagram shows the process flow in a streaming job
Figure 3: Streaming process flow
Design Difference Between Java API and Streaming
There is a design difference between the Java MapReduce API and Hadoop Streaming. The difference is mainly in the processing implementation. In the standard Java API, the mechanism is to process each record, one at a time. So the framework is implemented to call the map () method (on your Mapper) for each record. But with the Streaming API, the map program can control the processing of input. It can also read and process multiple lines at a time as it can control the reading mechanism. In Java, the same can be implemented but with the help of some other mechanism such as using instance variables to accumulate multiple lines and then process it.
The Hadoop Streaming API supports both streaming and generic command options. Some important streaming command options are listed below.
Additional Configuration Variables
In a streaming job, additional configuration variables can be mentioned with
–D option ("-D <property>=<value>").
- The following command can be used to change local temp directory
- The following command can be used to specify additional local temp directories
- The following command can be used to specify Map-Only job
- The following command can be used to specify number of reducers
- The following command can be used to specify line split options
-D stream.map.output.field.separator=. \ -D stream.num.map.output.key.fields=6\
The Apache Hadoop framework and MapReduce programming are the industry standard for processing a large volume of data. The MapReduce programming framework is used to do the actual processing and logic implementation. The Java MapReduce API is the standard option for writing MapReduce programs. But the Hadoop Streaming API provides options to write MapReduce jobs in other languages. This is one of the best examples of flexibility available to MapReduce programmers who have experience in other languages apart from Java. Even executables can be used with the Streaming API to work as a MapReduce job. The only condition is that the program/executable should be able to take input from STDIN and produce output at STDOUT.
Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.