cascading是一个计算工作流框架,它可以把编写好的计算流程规划成hadoop map/reduce任务,并发送到hadoop进行运算。cascading的主页是:
http://www.cascading.org/ OS:ubuntu 8.10
(1&2参考:
http://hadoop.apache.org/core/docs/r0.19.0/quickstart.html )
0. java & ant
$ sudo apt-get install sun-java6-jdk
下载ant,解压缩到~/data/w/ant
1. 配置ssh免密码
$ ssh-keygen -t dsa -P ''
$ cd ~/.ssh
$ cat id_dsa.pub >> authorized_keys
#测试一下,应该不需要密码才对
$ ssh localhost
2. 配置并启动hadoop
下载hadoop到~/data/w/
$ tar xfz hadoop-0.19.0.tar.gz
$ ln -s hadoop-0.19.0 hadoop
编辑hadoop/conf/hadoop-env.sh,加入:
export JAVA_HOME=/usr
编辑hadoop/conf/hadoop-site.xml,加入:
<configuration>
<property>
<name>fs.default.name</name>
<value>localhost:54320</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54321</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
编辑~/.bash_profile,加入环境变量:
export PATH=$PATH:~/data/w/hadoop/bin:~/data/w/ant/bin
export HADOOP_HOME=~/data/w/hadoop
$ hadoop namenode -format
$ start_all.sh
测试一下。用浏览器打开:
* NameNode -
http://localhost:50070/ * JobTracker -
http://localhost:50030/ 3. 配置和使用cascade
下载 cascading-1.0.10-hadoop-0.19.0+.tgz
$ tar xvfz cascading-1.0.10-hadoop-0.19.0+.tgz
$ ln -s cascading-1.0.10-hadoop-0.19.0+ cascading
下载 logparser-11-24-08.tgz
$ tar xvfz logparser-11-24-08.tgz
$ cd logparser
$ ant -Dhadoop.home=../hadoop -Dcascading.home=../cascading jar
$ hadoop jar ./build/logparser.jar data/apache.200.txt output
$ hadoop fs -get output .
查看output/part-00000中的结果