wonderyl's blog: graph

Suppose there is a 6-node-graph as the pic below, and I will explain how my algorithm finds all Maximal Cliques from it in a map-reduce friendly way.

Section 1. find all cliques level by level

Step 0. generate all node pairs for each edge and sort node in order. for this example:

(1,2) (1,3) (1,4) (2,3) (2,4) (2,5) (2,6) (3,4) (3,5) (3,6) (4,5) (4,6) (5, 6)

actually, these are all cliques with 2 nodes.

Step 1. from here we can suppose we have all cliques with N nodes, name it CliqueNs. Then cluster each clique according to the first N-1 node. for this example:

(1: {(1,2) (1,3) (1,4)})

(2: {(2,3) (2,4) (2,5) (2,6) })

(3: (3,4) (3,5) (3,6))

(4: (4,5) (4,6)) (5: {(5, 6)})

Step 2. for each cluster generate all pairs of clique-n as a clique-n+1 candidate,

for this example:

(1,2,3) (1,2,4) (1,3,4)

(2,3,4) (2,3,5) (2,3,6) (2,4,5) (2,4,6) (2,5,6)

(3,4,5) (3,4,6) (3,5,6)

(4,5,6)

Step 3. map all candidate using last n-1 nodes as key, map all nodes in CliqueN using all nodes as key, and reduce candidates and CliqueN together. for this example:

((1,2): {} {(1,2)})

((1,3): {} {(1,3)})

((1,4): {} {(1,4)})

((2,3): {(1,2,3)} {(2,3)})

((2,4): {(1,2,4)} {(2,4)})

((2,5): {} {(2,5)})

((2,6): {} {(2,6)})

((3,4): {(1,3,4) (2,3,4)} {(3,4)})

((3,5): {(2,3,5)} {(3,5)})

((3,6): {(2,3,6)} {(3,6)})

((4,5): {(2,4,5) (3,4,5)} {(4,5)})

((4,6): {(2,4,6) (3,4,6)} {(4,6)})

((5,6): {(2,5,6) (3,5,6) (4,5,6)} {(5,6)})

Step 4. only keep the entries that both candidate set and CliqueN set is not empty, and all candidates from remained entry becomes CliqueN+1, for this example:

(1,2,3) (1,2,4) (1,3,4) (2,3,4) (2,3,5) (2,3,6) (2,4,5) (3,4,5) (2,4,6) (3,4,6) (2,5,6) (3,5,6) (4,5,6)

Step 5. repeat step 2, 3, 4 until there is no CliqueN+1 or numbers in CliqueN+1 <= n+1

For this example:

redo step 1:

((1,2): {(1,2,3) (1,2,4)})

((1,3): {(1,3,4)})

((2,3): {(2,3,4) (2,3,5) (2,3,6)})

((2,4): {((2,4,5) (2,4,6))})

((2,5): {(2,5,6)})

((3,4): {(3,4,5) (3,4,6)})

((3,5): {(3,5,6)})

((4,5): {(4,5,6)})

redo step 2:
(1,2,3,4) (2,3,4,5) (2,3,4,6) (2,3,5,6) (2,4,5,6) (3,4,5,6)
redo step 3:
((1,2,3): {} {(1,2,3)})
((1,2,4): {} {(1,2,4)})
((1,3,4): {} {(1,3,4)})
((2,3,4): {(1,2,3,4)} {(2,3,4)})
((2,3,5): {} {(2,3,5)})
((2,3,6): {} {(2,3,6)})
((2,4,5): {} {(2,4,5)})
((2,4,6): {} {(2,4,6)})
((2,5,6): {} {(2,5,6)})
((3,4,5): {(2,3,4,5)} {(3,4,5)})
((3,4,6): {(2,3,4,6)} {(3,4,6)})
((3,5,6): {(2,3,5,6)} {(3,5,6)})
((4,5,6): {(2,4,5,6) (3,4,5,6)} {(4,5,6)})
redo step 4:
(1,2,3,4) (2,3,4,5) (2,3,4,6) (2,3,5,6) (2,4,5,6) (3,4,5,6)
these are clique with 4 nodes
go on like this you will eventually find all cliques.

How to implement it in map-reduce?

As you can see, almost all operations in the steps are mapping into key-values and aggregate by keys, which is the nature of map-reduce.
Step 1 and step 2 can be done within 1 map-reduce, and step 3 and step 4 can be done within 1 map-reduce.
Then we can implement this with only 2 map-reduce for each level of cliques.

Why does this algorithm work?

It bases on the fact that if 2 n-node-clique share n-1 nodes and if the 2 not shared nodes are connected, then the union of the 2 clique is also a clique, which is a n+1 node clique.

Section 2. find all maximal cliques level by level

As you can see, above steps only finds all clique but not maximal clique.

So what is maximal clique?

It's a clique that is not a subset of any other clique.

Think otherwise, if a clique of N nodes is not maximal, then is must be a subset of some N+1 node clique.

Then the job is easy, enumerate all n-node-subsets of n+1-node cliques and use them to validate n-node cliques. It's easy to implement within 1 map-reduce job as step 3 and step 4 above.

Section 3. advances and pitfalls

Advances

1. most heavy duty is implemented by hashing like step 3, which is scalable by using map-reduce
2. the algorithm is intutive and easy to implement
3. maximal cliques are found level by level, you can skip small maximal cliques if you only cares about large ones. but it's pity that find cliques can not skip.

pitfalls

1. in step 2, which is generating candidates, will generate n^2/2 (n is the size of a set of cliques) emit for each set of cliques, which is gigantic when the graph is dense. especially when implemented using map-reduce, the reduce size is estimate by map size, but in this case, the reduce size will far larger then map size, in practice it will need manual settings of reduce size. However, real-life large scale graphs or networks are usually sparse, like social network, people only have hundreds or thousands of friends.

2. cliques is calculated level by level, it will need m (m is the size of the largest clique) iterations to complete. Map-reduce have overhead to prepare and start, suppose 1 minutes per iteration, it will take half an hour overhead for a 30 iterations, which is not rare.

about why I'm thinking about this algorithm, see my another blog

study in community detection in network

Recently I'm thinking if I can find out any interest preference within a bunch of online novel reading logs. The idea is intuitive, people may be interested in some kind of novel like love story, magic, sci-fi or military etc. , then the the novels in the same kind may be read by a same person frequently than two novels out of the same kind. It seems easy to identify for coarse categories as I already have category tag for each of the novel, but identify interest within more subdivided categories or even within totally another kind of hierarchy of categories like interest of different ages seems to have much more fun.
The first thing comes into my head is using clustering, classical algorithms are hierarchical, k-means, canopy, Gaussian mixture model, Dirichlet process clustering, etc. The limitation is using only novel reading logs, novel can not be represented as point in space. In this case, k-means, center-based hierarchical and Gaussian mixture model is not suitable. In spite of this, normal cluster algorithm only considers links between 2 nodes, but rarely consider links among all nodes within a cluster.
Clustering Coefficient is a good way to describe the attribute of a cluster, but exhaustively enumerate all possible clusters and calculate clustering coefficient is clearly unacceptable.
the paper Finding and evaluating community structure in networks, MEJ Newman, M Girvan - Physical review E, 2004 - APS is trying to solve similar problem, and the features of its method is, first it's an dividing method not an agglomerating method, second, it uses "betweenness" which is defined on an edge as weighted sum of shorted path between any 2 nodes passes through this edge, to divide the graph.
but it's different from what I thought.
first, In real world, a node can belong to different communities at the same time
second, it's fine some nodes are alone, or the cluster is very small, but the middle size clusters which is very cohesive is what I'm looking for. so I'm still prefer aggregating method of clustering.

Recently, I'm thinking through the clique method, it is:

first generate all maximal cliques of the whole graph.
second, use the maximal cliques as core and attach other nodes to a clique to form a cluster if the node is close enough to most of the nodes in the clique.

like the picture below, the red nodes and deep green nodes are 2 group of maximal cliques, and the orange nodes are attached to red clique to form a cluster, and grass green nodes are attached to deep green nodes to form another cluster. there are some white nodes which are not in any cluster, and the middle node have both orange and grass green belongs to both cluster

but the problem is finding clique is NP-Complete problem, when the graph is as large as social network, sequential method will be too slow. there are some useful thing I found:

"On Computing All Maximal Cliques Distributedly",Fábio Protti, Felipe M. G. França, and Jayme Luiz Szwarcfiter , a divide and conquer method, recursively divide graphs into v1, v2, and computer cliques in v1, v2 and between v1, v2
xrime, a graph algorithm package, which include a maximal clique method. It generates the neighbors of neighbors for each node, and find maximal clique in each set of neighbors of neighbors. the method is clean and tricky but to emit all adjacent list to all neighbors seems to be too heavy.

here is a map-reduce friendly algorithm I work out to find maximal clique level by level, see the blog a simple Maximal Clique algorithm for map reduce for detail

wonderyl's blog

Thursday, January 10, 2013

a simple Maximal Clique algorithm for map reduce