Saturday, April 27, 2013
Friday, April 26, 2013
What is a big data problem?
Days ago I attend a representation of Tao Shi introducing how to detect community from directed graph using svd.
And his described a definition of big data problem from a statistician's perspective, he said and I quote "A small data problem is a problem that can be solved by sampling, and a big data problem can not."
And his described a definition of big data problem from a statistician's perspective, he said and I quote "A small data problem is a problem that can be solved by sampling, and a big data problem can not."
Thursday, April 25, 2013
A Good Approximation for L0 Norm Regularization ?
In machine learning, Regularization are usually used to prevent over fitting. Different regularization are used subject to different purpose. Usually, L2 norm and L1 norm are used.
L2 norm is fully derivable, but converge slowly
L1 norm is only broken on 0, and faster than L2.
L0 norm are thought to be NP-complete.
the formular for L0 is
$y=\sum _{n=0}^N1\left(W_n\right)$ |
where 1() is a indicate function, whether Wn is non-zero
Not only it is broken on 0, the derivative of other point is 0, which is not able to be used as regularization.
Think about the function below:
$y=-e^{-b\left|x\right|}+1$ |
where b is a parameter, the bigger b is the thinner the pitch is.
Approximation of 1() |
$y=be^{-b\left|x\right|}\sign \left(x\right)$ |
Wednesday, April 10, 2013
Sparse Coding and Matrix Fatorization
Just read a article about sparse coding for deep learning, which remained me that it seems a little like Matrix Factorization for recommender system.
- Both are trying to find a sparse representation of data.
- Both are trying to balance between 2 objects, data representation vs features in sparse coding, item vs user in matrix factorization
- both need alternative optimization algorithm to achieve global optimization
Tuesday, April 2, 2013
This Community Definition is What I'm Thinking
As discussed in my previous post study in community detection in network , I'm looking for a definition/algorithm that address communities from a network that has several properties as below:
1. high intrinsic clustering
2. communities can overlap on nodes (like your family community and friend community can overlap on you)
3. It's ok for some node to be single island
here is a paper described just like what I wanted:
uncovering the overlapping community structure of complex networks in nature and society
And I found it from cfinder.org
As in the paper:
"we define a community, or more precisely, a k-clique-community as a union of all k-cliques (complete sub-graphs of size k) that can be reached from each other through a series of adjacent k-cliques (where adjacency means sharing k-1 nodes) "
For each node in a community, it's at least a member of a k-clique, which leads to intrinsic clustering.
As a node can belong to different cliques and those cliques can share less then k-1 nodes which mean they belong to different community. This leads to overlap between communities
Finally, there must be some nodes that don't belong to any clique then they form islands.
It's tricky to decide k. if k is too high, there must be too much islands.
If k is 2, then all connected nodes form a community which block the overlap property.
so maybe 3 or 4 would be good choice.
1. high intrinsic clustering
2. communities can overlap on nodes (like your family community and friend community can overlap on you)
3. It's ok for some node to be single island
here is a paper described just like what I wanted:
uncovering the overlapping community structure of complex networks in nature and society
And I found it from cfinder.org
As in the paper:
"we define a community, or more precisely, a k-clique-community as a union of all k-cliques (complete sub-graphs of size k) that can be reached from each other through a series of adjacent k-cliques (where adjacency means sharing k-1 nodes) "
For each node in a community, it's at least a member of a k-clique, which leads to intrinsic clustering.
As a node can belong to different cliques and those cliques can share less then k-1 nodes which mean they belong to different community. This leads to overlap between communities
Finally, there must be some nodes that don't belong to any clique then they form islands.
It's tricky to decide k. if k is too high, there must be too much islands.
If k is 2, then all connected nodes form a community which block the overlap property.
so maybe 3 or 4 would be good choice.
Subscribe to:
Posts (Atom)