Tuesday, December 18, 2012

an openssl sha512 example


/*--------cpp code-------------*/

#include <openssl/sha.h>
#include <stdio.h>
int main(){
        unsigned char *md = new unsigned char[64];
        SHA512((unsigned char*)"a", 1, md);
        for(int i=0; i<64; ++i){
                printf("%02x", md[i]);
        }
        printf("\n");
}

compile using:
>g++ -I /usr/local/ssl/include -L /usr/local/ssl/lib -l crypto sha512test.c
>./a.out
1f40fc92da241694750979ee6cf582f2d5d7d28e18335de05abc54d0560e0f5302860c652bf08d560252aa5e74210546f369fbbbce8c12cfc7957b2652fe9a75

Thursday, December 6, 2012

study in community detection in network

Recently I'm thinking if I can find out any interest preference within a bunch of online novel reading logs. The idea is intuitive, people may be interested in some kind of novel like love story, magic, sci-fi or military etc. , then the the novels in the same kind may be read by a same person frequently than two novels out of the same kind. It seems easy to identify for coarse categories as I already have category tag for each of the novel, but identify interest within more subdivided categories or even within totally another kind of hierarchy of categories like interest of different ages seems to have much more fun.
The first thing comes into my head is using clustering, classical algorithms are hierarchical, k-means, canopy, Gaussian mixture model, Dirichlet process clustering, etc. The limitation is using only novel reading logs, novel can not be represented as point in space. In this case, k-means, center-based hierarchical and Gaussian mixture model is not suitable. In spite of this, normal cluster algorithm only considers links between 2 nodes, but rarely consider links among all nodes within a cluster.
Clustering Coefficient is a good way to describe the attribute of a cluster, but exhaustively enumerate all possible clusters and calculate clustering coefficient is clearly unacceptable.
the paper Finding and evaluating community structure in networks, MEJ Newman, M Girvan - Physical review E, 2004 - APS is trying to solve similar problem, and the features of its method is, first it's an dividing method not an agglomerating method, second, it uses "betweenness" which is defined on an edge as weighted sum of shorted path between any 2 nodes passes through this edge, to divide the graph.
but it's different from what I thought.
first, In real world, a node can belong to different communities at the same time
second, it's fine some nodes are alone, or the cluster is very small, but the middle size clusters which is very cohesive is what I'm looking for. so I'm still prefer aggregating method of clustering.

Recently, I'm thinking through the clique method, it is:

  • first generate all maximal cliques of the whole graph. 
  • second, use the maximal cliques as core and attach other nodes to a clique to form a cluster if the node is close enough to most of the nodes in the clique. 

like the picture below, the red nodes and deep green nodes are 2 group of maximal cliques, and the orange nodes are attached to red clique to form a cluster, and grass green nodes are attached to deep green nodes to form another cluster. there are some white nodes which are not in any cluster, and the middle node have both orange and grass green belongs to both cluster
but the problem is finding clique is NP-Complete problem, when the graph is as large as social network, sequential  method will be too slow. there are some useful thing I found:

  1. "On Computing All Maximal Cliques Distributedly",Fábio Protti, Felipe M. G. França, and Jayme Luiz Szwarcfiter , a divide and conquer method, recursively divide graphs into v1, v2, and computer cliques in v1, v2 and between v1, v2
  2. xrime, a graph algorithm package, which include a maximal clique method. It generates the neighbors of neighbors for each node, and find maximal clique in each set of neighbors of neighbors. the method is clean and tricky but to emit all adjacent list to all neighbors seems to be too heavy.
here is a map-reduce friendly algorithm I work out to find maximal clique level by level, see the blog a simple Maximal Clique algorithm for map reduce for detail



Thursday, November 22, 2012

Apache Pig, a better way to map reduce

中文版:Apache Pig, 便捷的map reduce之道
Have used Apache Pig for about a year before I start to write anything about it. 
First, you won't want to use pig or map reduce if your data is in the scale of MB or even GB, it will no more efficient than python. But for big data, it's beyond the ability of single machine, so you have to think about parallel process, which is map reduce used for, and pig is the tool that make MR more easier and efficient.  
Second, I have to say it's very handy and elegant that you can make simple statistical or data transformation with only several lines of code. eg. you want to count how many query view for each query through a search log, the code looks like below:
---pig example for qv---
log = load 'xxx' as (query);
grp = group log by query;
qv = for each grp generate group as query, COUNT(log) as qv;
------
If you write map reduce, you have to follow the routines and copy-paste tens of lines of code before you start to write your logic. eg. in java on hadoop is append at the end, because it's really too long.
Third, pig like any other script languages, is very convenient to test or run, just modify the script and run, but with java, you have to modify, compile, package src as jar and then execute.
Fourth, when using java or even streaming on hadoop, you have to run you code on real cluster to test, but with pig, you can use "-x local" to run locally with local file system, which is not only faster but also more easier to check out results than using hdfs. you can use this feature to process small data if you're willing to.
Fifth, grammar check with "-c" makes me more comfortable while coding pig, you don't have to be worried about yourself missing one letter will cause you test again or even miss to tell the bug because of lack of unit test like with python.
Sixth, pig has a very extendable API, with which you can implement your own UDF(user defined function) to deal with more complex data element process. eg. make string transform, calculate variance of a bag of data, or read/store data into/from a DIY format. further, java functions from jdk can be directly used without additional code.
---mr example for qv---
import xxxx;
...
import xxxx;
public class QueryCount extends Configured implements Tool{
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{

//map logic
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException{
//reduce logic
}
}

@Override
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
//some settings on job

boolean success = job.waitForCompletion(true);
return success?0:1;

}

public static void main(String[] args) throws Exception{
int ret = ToolRunner.run(new QueryCount(), args);
System.exit(ret);
}


}
------