[hadoop]Hadoop Streaming笔记1

2130 查看

How Streaming Works

主要是对官方文档的一些理解

主要参考了
Apache Hadoop文档
《Hadoop: a definitive guide》
Hadoop权威指南

Mapper

When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized.

Mapper初始化的时候会给可执行文件起一个单独的进程。

As the mapper task runs, it converts its inputs into lines and feed the lines to the stdin of the process. In the meantime, the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper.

当mapper task运行的时候, mapper的 input就被转换成了lines
(比如输入一个文本文件,会被转换成文本文件里面的内容
输入一个目录, 就会被转换成目录中的文件名或者说是文件的path)
然后这些lines就被输入了 executable 的process的 stdin, 等到执行完毕, mapper又会将process的stdout的每一行转换成 k-v对, 也就是mapper的输出。

【inputs】 --converts--> 【lines】 ---->【stdin of process】 --->【stdout of process】---(collected by mapper)-->【lines】---(converted by mapper)-->【key/value pair】 ---【collected by mapper(output of mapper)】

By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value. If there is no tab character in the line, then entire line is considered as key and the value is null. However, this can be customized, as discussed later.

默认情况下, process的stdout 的每一行都会用第一个 \t 进行分割,左边是key右边是value。

Reducer

When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized.
类似mapper, reducer task也会为executable起一个进程,如果有的话。

As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the stdin of the process.

In the meantime, the reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer.

【key/value pair】 --converted by reducer--> 【lines】 ---> 【stdin of process】-> 【stdout of process】--(collected by reducer)-->【lines】--converted by reducers-->【key/value pair(output of reducer)】

By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized, as discussed later.

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \  #转化成的lines就是dir内包含文件的path
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/wc

Packaging Files With Job Submissions

You can specify any executable as the mapper and/or the reducer. The executables do not need to pre-exist on the machines in the cluster; however, if they don't, you will need to use "-file" option to tell the framework to pack your executable files as a part of job submission. For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper myPythonScript.py \
    -reducer /bin/wc \
    -file myPythonScript.py

The above example specifies a user defined Python executable as the mapper. The option "-file myPythonScript.py" causes the python executable shipped to the cluster machines as a part of job submission.

-file选项会将文件传送给集群中的机器,任务完成之后删除。

In addition to executable files, you can also package other auxiliary files (such as dictionaries, configuration files, etc) that may be used by the mapper and/or the reducer. For example:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-reducer /bin/wc \
-file myPythonScript.py \
-file myDictionary.txt #也可以传送别的文件,比如字典什么的