Friday, April 26, 2013

What is a big data problem?

Days ago I attend a representation of Tao Shi introducing how to detect community from directed graph using svd.
And his described a definition of big data problem from a statistician's perspective, he said and I quote "A small data problem is a problem that can be solved by sampling, and a big data problem can not."

Thursday, April 25, 2013

A Good Approximation for L0 Norm Regularization ?

In machine learning, Regularization are usually used to prevent over fitting. Different regularization are used subject to different purpose. Usually, L2 norm and L1 norm are used.
L2 norm is fully derivable, but converge slowly
L1 norm is only broken on 0, and faster than L2.
L0 norm are thought to be NP-complete.
the formular for L0 is 
$y=\sum _{n=0}^N1\left(W_n\right)$
where 1() is a indicate function, whether Wn is non-zero
Not only it is broken on 0, the derivative of other point is 0, which is not able to be used as regularization.
Think about the function below:
$y=-e^{-b\left|x\right|}+1$
where b is a parameter, the bigger b is the thinner the pitch is.
Approximation of 1()
This would be a good approximation of 1(), as it's only broken on 0 and the derivative of other points are
$y=be^{-b\left|x\right|}\sign \left(x\right)$
It seems like a Reverse of an Impulse.



Wednesday, April 10, 2013

Sparse Coding and Matrix Fatorization

Just read a article about sparse coding for deep learning, which remained me that it seems a little like Matrix Factorization for recommender system.

  • Both are trying to find a sparse representation of data.
  • Both are trying to balance between 2 objects, data representation vs features in sparse coding, item vs user in matrix factorization 
  • both need alternative optimization algorithm to achieve global optimization 

Tuesday, April 2, 2013

This Community Definition is What I'm Thinking

As discussed in my previous post study in community detection in network , I'm looking for a definition/algorithm that address communities from a network that has several properties as below:
1. high intrinsic clustering
2. communities can overlap on nodes (like your family community and friend community can overlap on you)
3. It's ok for some node to be single island

here is a paper described just like what I wanted:
uncovering the overlapping community structure of complex networks in nature and society
And I found it from cfinder.org

As in the paper:
"we define a community, or more precisely, a k-clique-community as a union of all k-cliques (complete sub-graphs of size k) that can be reached from each other through a series of adjacent k-cliques  (where adjacency means sharing k-1 nodes) "
For each node in a community, it's at least a member of a k-clique, which leads to intrinsic clustering.
As a node can belong to different cliques and those cliques can share less then k-1 nodes which mean they belong to different community. This leads to overlap between communities
Finally, there must be some nodes that don't belong to any clique then they form islands.
It's tricky to decide k. if k is too high, there must be too much islands.
If k is 2, then all connected nodes form a community which block the overlap property.
so maybe 3 or 4 would be good choice.