Thursday, January 10, 2013

a simple Maximal Clique algorithm for map reduce

Suppose there is a 6-node-graph as the pic below,  and I will explain how my algorithm finds all Maximal Cliques from it in a map-reduce friendly way.

Section 1. find all cliques level by level

Step 0. generate all node pairs for each edge and sort node in order. for this example:
(1,2) (1,3) (1,4) (2,3) (2,4) (2,5) (2,6) (3,4) (3,5) (3,6) (4,5) (4,6) (5, 6)
actually, these are all cliques with 2 nodes.
Step 1. from here we can suppose we have all cliques with N nodes, name it CliqueNs. Then cluster each clique according to the first N-1 node. for this example:
(1: {(1,2) (1,3) (1,4)}) 
(2: {(2,3) (2,4) (2,5) (2,6) }) 
(3: (3,4) (3,5) (3,6)) 
(4: (4,5) (4,6)) (5: {(5, 6)})
Step 2. for each cluster generate all pairs of clique-n as a clique-n+1 candidate,
for this example: 
(1,2,3) (1,2,4) (1,3,4)    
(2,3,4) (2,3,5) (2,3,6) (2,4,5) (2,4,6) (2,5,6)    
(3,4,5) (3,4,6) (3,5,6)    
(4,5,6)
Step 3. map all candidate using last n-1 nodes as key, map all nodes in CliqueN using all nodes as key, and reduce candidates and CliqueN together. for this example:
((1,2): {} {(1,2)}) 
((1,3): {} {(1,3)}) 
((1,4): {} {(1,4)}) 
((2,3): {(1,2,3)} {(2,3)}) 
((2,4): {(1,2,4)} {(2,4)}) 
((2,5): {} {(2,5)}) 
((2,6): {} {(2,6)}) 
((3,4): {(1,3,4) (2,3,4)} {(3,4)}) 
((3,5): {(2,3,5)} {(3,5)}) 
((3,6): {(2,3,6)} {(3,6)}) 
((4,5): {(2,4,5) (3,4,5)} {(4,5)}) 
((4,6): {(2,4,6) (3,4,6)} {(4,6)}) 
((5,6): {(2,5,6) (3,5,6) (4,5,6)} {(5,6)}) 
Step 4. only keep the entries that both candidate set and CliqueN set is not empty, and all candidates from remained entry becomes CliqueN+1, for this example:
(1,2,3) (1,2,4) (1,3,4) (2,3,4) (2,3,5) (2,3,6) (2,4,5) (3,4,5) (2,4,6) (3,4,6) (2,5,6) (3,5,6) (4,5,6)
Step 5. repeat step 2, 3, 4 until there is no CliqueN+1 or numbers in CliqueN+1 <= n+1

For this example:
redo step 1:
((1,2): {(1,2,3) (1,2,4)})
((1,3): {(1,3,4)})
((2,3): {(2,3,4) (2,3,5) (2,3,6)})
((2,4): {((2,4,5) (2,4,6))})
((2,5): {(2,5,6)})
((3,4): {(3,4,5) (3,4,6)})
((3,5): {(3,5,6)})
((4,5): {(4,5,6)})
redo step 2:
(1,2,3,4) (2,3,4,5) (2,3,4,6) (2,3,5,6) (2,4,5,6) (3,4,5,6)
redo step 3:
((1,2,3): {} {(1,2,3)})
((1,2,4): {} {(1,2,4)})
((1,3,4): {} {(1,3,4)})
((2,3,4): {(1,2,3,4)} {(2,3,4)})
((2,3,5): {} {(2,3,5)})
((2,3,6): {} {(2,3,6)})
((2,4,5): {} {(2,4,5)})
((2,4,6): {} {(2,4,6)})
((2,5,6): {} {(2,5,6)})
((3,4,5): {(2,3,4,5)} {(3,4,5)})
((3,4,6): {(2,3,4,6)} {(3,4,6)})
((3,5,6): {(2,3,5,6)} {(3,5,6)})
((4,5,6): {(2,4,5,6) (3,4,5,6)} {(4,5,6)})
redo step 4:
(1,2,3,4) (2,3,4,5) (2,3,4,6) (2,3,5,6) (2,4,5,6) (3,4,5,6)
these are clique with 4 nodes
go on like this you will eventually find all cliques.

How to implement it in map-reduce?

As you can see, almost all operations in the steps are mapping into key-values and aggregate by keys, which is the nature of map-reduce.
Step 1 and step 2 can be done within 1 map-reduce, and step 3 and step 4 can be done within 1 map-reduce.
Then we can implement this with only 2 map-reduce for each level of cliques.

Why does this algorithm work?

It bases on the fact that if 2 n-node-clique share n-1 nodes and if the 2 not shared nodes are connected, then the union of the 2 clique is also a clique, which is a n+1 node clique.



Section 2. find all maximal cliques level by level

As you can see, above steps only finds all clique but not maximal clique. 
So what is maximal clique? 
It's a clique that is not a subset of any other clique.
Think otherwise, if a clique of N nodes is not maximal, then is must be a subset of some N+1 node clique.
Then the job is easy, enumerate all n-node-subsets of n+1-node cliques and  use them to validate n-node cliques. It's easy to implement within 1 map-reduce job as step 3 and step 4 above.


Section 3. advances and pitfalls 

Advances

1. most heavy duty is implemented by hashing like step 3, which is scalable by using map-reduce
2. the algorithm is intutive and easy to implement
3. maximal cliques are found level by level, you can skip small maximal cliques if you only cares about large ones. but it's pity that find cliques can not skip.

pitfalls

1. in step 2, which is generating candidates, will generate n^2/2 (n is the size of a set of cliques) emit for each set of cliques, which is gigantic when the graph is dense. especially when implemented using map-reduce, the reduce size is estimate by map size, but in this case, the reduce size will far larger then map size, in practice it will need manual settings of reduce size. However, real-life large scale graphs or networks are usually sparse, like social network, people only have hundreds or thousands of friends. 
2. cliques is calculated level by level, it will need m (m is the size of the largest clique) iterations to complete. Map-reduce have overhead to prepare and start, suppose 1 minutes per iteration, it will take half an hour overhead for a 30 iterations, which is not rare.

about why I'm thinking about this algorithm, see my another blog  

2 comments:

  1. Don't you have the source code related to this

    ReplyDelete
    Replies
    1. This was a few years ago, so I can't remember clearly, maybe you can find something from here https://github.com/wonderyl/MyPigPractice/tree/master/maximal%20clique

      Delete