Thursday, November 22, 2012

Apache Pig, a better way to map reduce

中文版:Apache Pig, 便捷的map reduce之道
Have used Apache Pig for about a year before I start to write anything about it. 
First, you won't want to use pig or map reduce if your data is in the scale of MB or even GB, it will no more efficient than python. But for big data, it's beyond the ability of single machine, so you have to think about parallel process, which is map reduce used for, and pig is the tool that make MR more easier and efficient.  
Second, I have to say it's very handy and elegant that you can make simple statistical or data transformation with only several lines of code. eg. you want to count how many query view for each query through a search log, the code looks like below:
---pig example for qv---
log = load 'xxx' as (query);
grp = group log by query;
qv = for each grp generate group as query, COUNT(log) as qv;
------
If you write map reduce, you have to follow the routines and copy-paste tens of lines of code before you start to write your logic. eg. in java on hadoop is append at the end, because it's really too long.
Third, pig like any other script languages, is very convenient to test or run, just modify the script and run, but with java, you have to modify, compile, package src as jar and then execute.
Fourth, when using java or even streaming on hadoop, you have to run you code on real cluster to test, but with pig, you can use "-x local" to run locally with local file system, which is not only faster but also more easier to check out results than using hdfs. you can use this feature to process small data if you're willing to.
Fifth, grammar check with "-c" makes me more comfortable while coding pig, you don't have to be worried about yourself missing one letter will cause you test again or even miss to tell the bug because of lack of unit test like with python.
Sixth, pig has a very extendable API, with which you can implement your own UDF(user defined function) to deal with more complex data element process. eg. make string transform, calculate variance of a bag of data, or read/store data into/from a DIY format. further, java functions from jdk can be directly used without additional code.
---mr example for qv---
import xxxx;
...
import xxxx;
public class QueryCount extends Configured implements Tool{
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{

//map logic
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException{
//reduce logic
}
}

@Override
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
//some settings on job

boolean success = job.waitForCompletion(true);
return success?0:1;

}

public static void main(String[] args) throws Exception{
int ret = ToolRunner.run(new QueryCount(), args);
System.exit(ret);
}


}
------