Sunday, July 14, 2013

Is a Promise actually a Wish in c++11

A typical usage of promise is like blow:

// future from a promise
std::promise<int> p;
std::future<int> f3 = p.get_future();
std::thread( [](std::promise<int>& p){ p.set_value(9); }, 
            std::ref(p) ).detach();
f3.wait();

It's confusing, because in real life a promise is what you are willing to do for others rather than asking others to do.
If we call promise here a wish, it all makes sense:
One makes a wish, then gives the wish to another(a thread) and waits for the wish to come true. The wish may not come true because the thread can throw a exception rather than set_value.

Friday, April 26, 2013

What is a big data problem?

Days ago I attend a representation of Tao Shi introducing how to detect community from directed graph using svd.
And his described a definition of big data problem from a statistician's perspective, he said and I quote "A small data problem is a problem that can be solved by sampling, and a big data problem can not."

Thursday, April 25, 2013

A Good Approximation for L0 Norm Regularization ?

In machine learning, Regularization are usually used to prevent over fitting. Different regularization are used subject to different purpose. Usually, L2 norm and L1 norm are used.
L2 norm is fully derivable, but converge slowly
L1 norm is only broken on 0, and faster than L2.
L0 norm are thought to be NP-complete.
the formular for L0 is 
$y=\sum _{n=0}^N1\left(W_n\right)$
where 1() is a indicate function, whether Wn is non-zero
Not only it is broken on 0, the derivative of other point is 0, which is not able to be used as regularization.
Think about the function below:
$y=-e^{-b\left|x\right|}+1$
where b is a parameter, the bigger b is the thinner the pitch is.
Approximation of 1()
This would be a good approximation of 1(), as it's only broken on 0 and the derivative of other points are
$y=be^{-b\left|x\right|}\sign \left(x\right)$
It seems like a Reverse of an Impulse.



Wednesday, April 10, 2013

Sparse Coding and Matrix Fatorization

Just read a article about sparse coding for deep learning, which remained me that it seems a little like Matrix Factorization for recommender system.

  • Both are trying to find a sparse representation of data.
  • Both are trying to balance between 2 objects, data representation vs features in sparse coding, item vs user in matrix factorization 
  • both need alternative optimization algorithm to achieve global optimization 

Tuesday, April 2, 2013

This Community Definition is What I'm Thinking

As discussed in my previous post study in community detection in network , I'm looking for a definition/algorithm that address communities from a network that has several properties as below:
1. high intrinsic clustering
2. communities can overlap on nodes (like your family community and friend community can overlap on you)
3. It's ok for some node to be single island

here is a paper described just like what I wanted:
uncovering the overlapping community structure of complex networks in nature and society
And I found it from cfinder.org

As in the paper:
"we define a community, or more precisely, a k-clique-community as a union of all k-cliques (complete sub-graphs of size k) that can be reached from each other through a series of adjacent k-cliques  (where adjacency means sharing k-1 nodes) "
For each node in a community, it's at least a member of a k-clique, which leads to intrinsic clustering.
As a node can belong to different cliques and those cliques can share less then k-1 nodes which mean they belong to different community. This leads to overlap between communities
Finally, there must be some nodes that don't belong to any clique then they form islands.
It's tricky to decide k. if k is too high, there must be too much islands.
If k is 2, then all connected nodes form a community which block the overlap property.
so maybe 3 or 4 would be good choice.



Tuesday, March 12, 2013

Polymorphism of Apache Pig's Eval Function

Polymorphism is not necessary for a programming language, but it will make our code more beautiful and clean. At least it will save us several lines of code.
In Apache Pig, a Eval Function is a class which extends EvalFunc, rather than a function, so we can't leverage java's polymorphism for function. But there are 2 back doors left/designed by EvalFunc designer:
1. The input of EvalFunc is a tuple which makes Input Polymorphism possible.
2. EvalFunc is generic which makes Output Polymorphism possible.

Input Polymorphism

Input Polymorphism is referring to the variance of input.
As the input of EvalFunc is a tuple, and the element of tuple is object which means you can embed any object to tuple and pass to EvalFunc.
For example:

public class Add extends EvalFunc<Double> {

@Override
public Double exec(Tuple input) throws IOException {
Object a = input.get(0);
Object b = input.get(1);
Double da, db;
if(a instanceof String){
da = Double.parseDouble(a);
}
else{
da = a;
}

if(b instanceof String){
db = Double.parseDouble(b);
}
else{
db = b;
}
return da+db;

}
}
In the previous example, the Add function tries to parse a string into double so that add between strings or between string and double is ok.

Output Polymorphism

Output Polymorphism is referring to the variance of output.
Usually you have to designate the output type of Eval Function. In the example above, Double is the return type. But if you want the return type to vary, you could just use Object as the return type.
For example:

public class AorB extends EvalFunc<Object> {

@Override
public Object exec(Tuple input) throws IOException {
Object a = input.get(0);
Object b = input.get(1);
if(a != null){
return a;
}
else{
return b;
}
}
}
In the example above, AorB returns a if a is not null or b otherwise.

Of course, the combination of input and output polymorphism make Eval Function more flexible and powerful.



Friday, March 8, 2013

R is interesting and powerful

Recently I'm participating a online course Data Analysis by Jeff Leek , which introduces some basic ideas of data analysis. But the best part of it is through this course I finally started to use R.
Sometime at work, I need to draw some graph, usually I will use excel for simple lines or bars, or pies. and use desmos to explore functions. I also used matlab to process data.
I hear about R along ago, but there no motivation to just learn a new thing. until recently, the data analysis course lead me to explore R. It's grammar is like javascript which makes it more easier for me.
I think the reason I like to use R is totally because of RStudio which makes it a rival of matlab.

Tuesday, February 26, 2013

Practice on Matrix Factorization Using Pig

  1. Background

Recently I'm thinking about optimize my recommender system. The online version is using Item-based Collaborative Filtering, which basically recommend other items when user is viewing an item.
The Original usage of Matrix Factorization by Netflix-prize-winning algorithm is to predict rating for not viewed items and choose the highest rating items to recommend. But it take a lot of effort to predict because you have to calculate for every single item for every single user which is O(m*n). This is not practical for large system having 10s of million users and millions of items.
My variation of using Matrix Factorization is to add it on top of Item-based Collaborative Filtering. Which means:
first, using cf to generate recommend candidates, usually 10s or 100
then, for each single user, re-rank candidates according to features learned by mf.
this usage requires features learned by mf can be retrieved in real-time.  the best to do is to minimize the size of features, which means the features learned have to be as sparse as possible. I'm thinking keep about 10 significant features for each user or item.

2. Practice

according to  Matrix Factorization Techniques for Recommender System, I choose Stochastic Gradient Descend algorithm to learn the features which is:
1. easy to implement
2. relative fast running time
3. easy to parallel for large data set

the lost function is give below
lost = sum(r_ui - q_i*p_u)^2) for (i, u) + lambda*( sum(||q_i||) for i + sum(||p_u||) for u )

Problem 1, which norm should be used as regularization ?  

According to Sparse methods for machine learning Theory and algorithms
using l1-norm rather than l2-norm used in the previous paper is better as it performs regularization as well as model selection.

define: e_iu = r_ui - q_i*p_u
take partial derivative for q_i and p_u for lost function then the update for q_i and p_u are as below:

q_i = q_i + LearningRate*(sum(e_ui*p_u) for u - lambda*sign(q_i) )
p_u = p_u + LearningRate*(sum(e_ui*q_i) for i - lambda*sign(p_u) )

Problem 2, how to initiate q_i and p_u?

Choosing a right initial point is very import for stochastic gradient descent. even for a convex problem, choose a good initial point will speed up the learning process. for non-convex problem like this, a good choice also decide the quality of local optimum.
Unfortunately, there is not general rule to choose a good initial point that make a local optimum close to global optimum, but there is still way to speed up the learning process that save up a lot time.
  • Is choosing all 0 a good choice? Take a look at the update formula above, if the initial is zero, the square error part would become zero, and the regularization part will become 1. after the update, all elements of q_i and p_u will be -LearningRate*lambda. if that is what   we want, we can just initiate q_i and p_u with -LearningRate*lambda and save all calculation.
  • Is initiate every elements with a small value like -LearningRate*lambda a good choice? I don't think so, if q_i and p_u is very small, then the update of them will also be very small for each round of training, then it will take several more step to came to a point close to optimum.
My intuitive is choose the initial so that q_i*p_u is close to mean(r_ui) so that the initial is close to optimum.
Assume mean(r_ui) is 4, and q_i and p_u both are vectors with 100 elements. E() is for Expectation in statistics
E(q_i*p_u) = 100*E(q_i_k) * E(p_u_k) as q_i_k and p_u_k are independent.
E(q_i*p_u) = 100*E(q_i_k) ^2 as assuming q_i_k and p_u_k have the same distribution.
then  E(q_i_k)  = 0.2 
The question left is how to assign every q_i_k so that E(q_i_k) is 0.2.
Constant value obviously is not a good choice, because in this case, every element in a vector will update with the same difference and keep all elements the same forever.
A simple alternative is randomly pick from [0, 0.4],  or use other distribution like Gaussian Distribution with mean = 0.2. I didn't try but blindly choose random assign.

Problem 3, how to choose Learning Rate ?

Theoretically as long as the Learning Rate is small enough, stochastic gradient descent will eventually get to local optimum. so a brute force way to do it is to start from 0.1 and exponentially shrink like 0.01, 0.001... until every step lost function result decreases.
In my experiment, I have to choose 10E-5 to make lost function result decrease. and with this small learning rate, the decrease of lost function result is very very slow, after run about 100 round, it only decrease about 10%.
After think about this for a few days. I realized that the algorithm only take account of observed data, which lead to huge unbalance. As the hotness of items coincides with Zipf's Law, there is huge unbalance between items or users.
For example, item A may be viewed by 1000 users, but item B may only be viewed by 1 user.
then the square error part of update function of q_i for A is supposed to be 1000 times larger then B. In this case, a uniform learning rate for both item A and B would be inefficient. That is also why only such little learning rate can decrease lost function result. It's because after sum errors from all users view a popular item, the update difference could be very large.
To tackle the problem, I use avg instead of sum so that the difference of update would be the same size as q_i or p_u. Then the update function becomes to:

q_i = q_i + LearningRate*(avg(e_ui*p_u) for u - lambda*sign(q_i) )
p_u = p_u + LearningRate*(avg(e_ui*q_i) for i - lambda*sign(p_u) )

In this case, I only need a 0.01 learning rate and the learning speed is good, after taken 60 round, more than half error is reduced.

Problem 4, how to make feature vector sparse?


As described in background section, I hope the feature vector of user and item to be as sparse as possible so that it will require less space to store those vectors. Further, in reality, a user usually has only several interests and an item is also belong to several categories, so spareness is necessary.
The idea is simple, to make the result spare, we just have to add a regularization element that evaluate spareness to lost function.
The first thing poped into my head is Entropy.
H=sum(-p_i*log(p_i)) 
where p is the probability of each element, which is for each element i in a vector
p_i = abs(x_i)/sum(x_i)
According to information theory, Entropy is a metrics for how many bits needed to describe the vector, the more sparse the vector is the less entropy is has. then after adding this into lost function, it becomes:
lost = sum(r_ui - q_i*p_u)^2) for (i, u) + lambda_1*( sum(||q_i||) for i + sum(||p_u||) for u ) + lambda_2*( sum(H(q_i)) for i + sum(H(p_j)) for j )
let's change update function next.
to simplify, we just assume sum(x_i) is a constant then it can be ignored because magnitude is not important in gradient descend.
then
d H/d p_i = -( log(p_i) + 1 )
d H/ d x_i = -( log(x_i) + 1 ) if x_i >=0
                = log(-x_i) + 1 if x_i <0
the derivative is colored in red and orange in the picture below:
Derivative of Entropy
Let's take a look of the derivative. It becomes extreme large when the absolute value of x_i is near zero. If we apply this into gradient descend, we will subtract(gradient descend use the negative of derivative) a big number when x_i is just a little bigger then 0, which make the next x_i a negative number with a very big absolute value. This will make x_i very unstable.
Let's think about how entropy makes the vector sparse when applying gradient descend. When x_i is bigger than 0.1 which makes H(x_i)=0, it add a little bit to x_i, and the bigger x_i is, the bigger the addition is. When x_i is smaller than 0.1, it subtract a little bit to x_i, and the smaller x_i is, the bigger the subtraction is. Then we can find a function whose derivative has similar attributes and the derivative of it won't be extreme large when the absolute value is small.
The simplest form is
H(p_i) = -x^2+x 
d H/d x_i = -2x+1 if x>=0
               = -2x-1 if x<0
whose derivative is simply a linear function as showed in blue and black in the above picture.
In this case, the update functions become:

q_i = q_i + LearningRate*(avg(e_ui*p_u) for u - lambda_1*sign(q_i) - lambda_2*(-2x+sign(q_i)) )
p_u = p_u + LearningRate*(avg(e_ui*q_i) for i - lambda*sign(p_u)  - lambda_2*(-2x+sign(p_j)) )

In my experiment, the entropy really make the output vector more sparse. In the picture below is the same vector trained with and without entropy regularization. Elements are displayed from small to big. As you can see the blue one is more steeper.



Thursday, January 10, 2013

a simple Maximal Clique algorithm for map reduce

Suppose there is a 6-node-graph as the pic below,  and I will explain how my algorithm finds all Maximal Cliques from it in a map-reduce friendly way.

Section 1. find all cliques level by level

Step 0. generate all node pairs for each edge and sort node in order. for this example:
(1,2) (1,3) (1,4) (2,3) (2,4) (2,5) (2,6) (3,4) (3,5) (3,6) (4,5) (4,6) (5, 6)
actually, these are all cliques with 2 nodes.
Step 1. from here we can suppose we have all cliques with N nodes, name it CliqueNs. Then cluster each clique according to the first N-1 node. for this example:
(1: {(1,2) (1,3) (1,4)}) 
(2: {(2,3) (2,4) (2,5) (2,6) }) 
(3: (3,4) (3,5) (3,6)) 
(4: (4,5) (4,6)) (5: {(5, 6)})
Step 2. for each cluster generate all pairs of clique-n as a clique-n+1 candidate,
for this example: 
(1,2,3) (1,2,4) (1,3,4)    
(2,3,4) (2,3,5) (2,3,6) (2,4,5) (2,4,6) (2,5,6)    
(3,4,5) (3,4,6) (3,5,6)    
(4,5,6)
Step 3. map all candidate using last n-1 nodes as key, map all nodes in CliqueN using all nodes as key, and reduce candidates and CliqueN together. for this example:
((1,2): {} {(1,2)}) 
((1,3): {} {(1,3)}) 
((1,4): {} {(1,4)}) 
((2,3): {(1,2,3)} {(2,3)}) 
((2,4): {(1,2,4)} {(2,4)}) 
((2,5): {} {(2,5)}) 
((2,6): {} {(2,6)}) 
((3,4): {(1,3,4) (2,3,4)} {(3,4)}) 
((3,5): {(2,3,5)} {(3,5)}) 
((3,6): {(2,3,6)} {(3,6)}) 
((4,5): {(2,4,5) (3,4,5)} {(4,5)}) 
((4,6): {(2,4,6) (3,4,6)} {(4,6)}) 
((5,6): {(2,5,6) (3,5,6) (4,5,6)} {(5,6)}) 
Step 4. only keep the entries that both candidate set and CliqueN set is not empty, and all candidates from remained entry becomes CliqueN+1, for this example:
(1,2,3) (1,2,4) (1,3,4) (2,3,4) (2,3,5) (2,3,6) (2,4,5) (3,4,5) (2,4,6) (3,4,6) (2,5,6) (3,5,6) (4,5,6)
Step 5. repeat step 2, 3, 4 until there is no CliqueN+1 or numbers in CliqueN+1 <= n+1

For this example:
redo step 1:
((1,2): {(1,2,3) (1,2,4)})
((1,3): {(1,3,4)})
((2,3): {(2,3,4) (2,3,5) (2,3,6)})
((2,4): {((2,4,5) (2,4,6))})
((2,5): {(2,5,6)})
((3,4): {(3,4,5) (3,4,6)})
((3,5): {(3,5,6)})
((4,5): {(4,5,6)})
redo step 2:
(1,2,3,4) (2,3,4,5) (2,3,4,6) (2,3,5,6) (2,4,5,6) (3,4,5,6)
redo step 3:
((1,2,3): {} {(1,2,3)})
((1,2,4): {} {(1,2,4)})
((1,3,4): {} {(1,3,4)})
((2,3,4): {(1,2,3,4)} {(2,3,4)})
((2,3,5): {} {(2,3,5)})
((2,3,6): {} {(2,3,6)})
((2,4,5): {} {(2,4,5)})
((2,4,6): {} {(2,4,6)})
((2,5,6): {} {(2,5,6)})
((3,4,5): {(2,3,4,5)} {(3,4,5)})
((3,4,6): {(2,3,4,6)} {(3,4,6)})
((3,5,6): {(2,3,5,6)} {(3,5,6)})
((4,5,6): {(2,4,5,6) (3,4,5,6)} {(4,5,6)})
redo step 4:
(1,2,3,4) (2,3,4,5) (2,3,4,6) (2,3,5,6) (2,4,5,6) (3,4,5,6)
these are clique with 4 nodes
go on like this you will eventually find all cliques.

How to implement it in map-reduce?

As you can see, almost all operations in the steps are mapping into key-values and aggregate by keys, which is the nature of map-reduce.
Step 1 and step 2 can be done within 1 map-reduce, and step 3 and step 4 can be done within 1 map-reduce.
Then we can implement this with only 2 map-reduce for each level of cliques.

Why does this algorithm work?

It bases on the fact that if 2 n-node-clique share n-1 nodes and if the 2 not shared nodes are connected, then the union of the 2 clique is also a clique, which is a n+1 node clique.



Section 2. find all maximal cliques level by level

As you can see, above steps only finds all clique but not maximal clique. 
So what is maximal clique? 
It's a clique that is not a subset of any other clique.
Think otherwise, if a clique of N nodes is not maximal, then is must be a subset of some N+1 node clique.
Then the job is easy, enumerate all n-node-subsets of n+1-node cliques and  use them to validate n-node cliques. It's easy to implement within 1 map-reduce job as step 3 and step 4 above.


Section 3. advances and pitfalls 

Advances

1. most heavy duty is implemented by hashing like step 3, which is scalable by using map-reduce
2. the algorithm is intutive and easy to implement
3. maximal cliques are found level by level, you can skip small maximal cliques if you only cares about large ones. but it's pity that find cliques can not skip.

pitfalls

1. in step 2, which is generating candidates, will generate n^2/2 (n is the size of a set of cliques) emit for each set of cliques, which is gigantic when the graph is dense. especially when implemented using map-reduce, the reduce size is estimate by map size, but in this case, the reduce size will far larger then map size, in practice it will need manual settings of reduce size. However, real-life large scale graphs or networks are usually sparse, like social network, people only have hundreds or thousands of friends. 
2. cliques is calculated level by level, it will need m (m is the size of the largest clique) iterations to complete. Map-reduce have overhead to prepare and start, suppose 1 minutes per iteration, it will take half an hour overhead for a 30 iterations, which is not rare.

about why I'm thinking about this algorithm, see my another blog