- Background
Recently I'm thinking about optimize my recommender system. The online version is using Item-based Collaborative Filtering, which basically recommend other items when user is viewing an item.
The Original usage of Matrix Factorization by Netflix-prize-winning algorithm is to predict rating for not viewed items and choose the highest rating items to recommend. But it take a lot of effort to predict because you have to calculate for every single item for every single user which is O(m*n). This is not practical for large system having 10s of million users and millions of items.
My variation of using Matrix Factorization is to add it on top of Item-based Collaborative Filtering. Which means:
first, using cf to generate recommend candidates, usually 10s or 100
then, for each single user, re-rank candidates according to features learned by mf.
this usage requires features learned by mf can be retrieved in real-time. the best to do is to minimize the size of features, which means the features learned have to be as sparse as possible. I'm thinking keep about 10 significant features for each user or item.
2. Practice
according to Matrix Factorization Techniques for Recommender System, I choose Stochastic Gradient Descend algorithm to learn the features which is:
1. easy to implement
2. relative fast running time
3. easy to parallel for large data set
the lost function is give below
lost = sum(r_ui - q_i*p_u)^2) for (i, u) + lambda*( sum(||q_i||) for i + sum(||p_u||) for u )
Problem 1, which norm should be used as regularization ?
According to Sparse methods for machine learning Theory and algorithms
using l1-norm rather than l2-norm used in the previous paper is better as it performs regularization as well as model selection.
define: e_iu = r_ui - q_i*p_u
take partial derivative for q_i and p_u for lost function then the update for q_i and p_u are as below:
q_i = q_i + LearningRate*(sum(e_ui*p_u) for u - lambda*sign(q_i) )
p_u = p_u + LearningRate*(sum(e_ui*q_i) for i - lambda*sign(p_u) )
In my experiment, I have to choose 10E-5 to make lost function result decrease. and with this small learning rate, the decrease of lost function result is very very slow, after run about 100 round, it only decrease about 10%.
After think about this for a few days. I realized that the algorithm only take account of observed data, which lead to huge unbalance. As the hotness of items coincides with Zipf's Law, there is huge unbalance between items or users.
For example, item A may be viewed by 1000 users, but item B may only be viewed by 1 user.
then the square error part of update function of q_i for A is supposed to be 1000 times larger then B. In this case, a uniform learning rate for both item A and B would be inefficient. That is also why only such little learning rate can decrease lost function result. It's because after sum errors from all users view a popular item, the update difference could be very large.
To tackle the problem, I use avg instead of sum so that the difference of update would be the same size as q_i or p_u. Then the update function becomes to:
q_i = q_i + LearningRate*(avg(e_ui*p_u) for u - lambda*sign(q_i) )
p_u = p_u + LearningRate*(avg(e_ui*q_i) for i - lambda*sign(p_u) )
In this case, I only need a 0.01 learning rate and the learning speed is good, after taken 60 round, more than half error is reduced.
As described in background section, I hope the feature vector of user and item to be as sparse as possible so that it will require less space to store those vectors. Further, in reality, a user usually has only several interests and an item is also belong to several categories, so spareness is necessary.
The idea is simple, to make the result spare, we just have to add a regularization element that evaluate spareness to lost function.
The first thing poped into my head is Entropy.
H=sum(-p_i*log(p_i))
where p is the probability of each element, which is for each element i in a vector
p_i = abs(x_i)/sum(x_i)
According to information theory, Entropy is a metrics for how many bits needed to describe the vector, the more sparse the vector is the less entropy is has. then after adding this into lost function, it becomes:
lost = sum(r_ui - q_i*p_u)^2) for (i, u) + lambda_1*( sum(||q_i||) for i + sum(||p_u||) for u ) + lambda_2*( sum(H(q_i)) for i + sum(H(p_j)) for j )
let's change update function next.
to simplify, we just assume sum(x_i) is a constant then it can be ignored because magnitude is not important in gradient descend.
then
d H/d p_i = -( log(p_i) + 1 )
d H/ d x_i = -( log(x_i) + 1 ) if x_i >=0
= log(-x_i) + 1 if x_i <0
the derivative is colored in red and orange in the picture below:
Let's take a look of the derivative. It becomes extreme large when the absolute value of x_i is near zero. If we apply this into gradient descend, we will subtract(gradient descend use the negative of derivative) a big number when x_i is just a little bigger then 0, which make the next x_i a negative number with a very big absolute value. This will make x_i very unstable.
Let's think about how entropy makes the vector sparse when applying gradient descend. When x_i is bigger than 0.1 which makes H(x_i)=0, it add a little bit to x_i, and the bigger x_i is, the bigger the addition is. When x_i is smaller than 0.1, it subtract a little bit to x_i, and the smaller x_i is, the bigger the subtraction is. Then we can find a function whose derivative has similar attributes and the derivative of it won't be extreme large when the absolute value is small.
The simplest form is
H(p_i) = -x^2+x
d H/d x_i = -2x+1 if x>=0
= -2x-1 if x<0
whose derivative is simply a linear function as showed in blue and black in the above picture.
In this case, the update functions become:
q_i = q_i + LearningRate*(avg(e_ui*p_u) for u - lambda_1*sign(q_i) - lambda_2*(-2x+sign(q_i)) )
p_u = p_u + LearningRate*(avg(e_ui*q_i) for i - lambda*sign(p_u) - lambda_2*(-2x+sign(p_j)) )
In my experiment, the entropy really make the output vector more sparse. In the picture below is the same vector trained with and without entropy regularization. Elements are displayed from small to big. As you can see the blue one is more steeper.
using l1-norm rather than l2-norm used in the previous paper is better as it performs regularization as well as model selection.
define: e_iu = r_ui - q_i*p_u
take partial derivative for q_i and p_u for lost function then the update for q_i and p_u are as below:
q_i = q_i + LearningRate*(sum(e_ui*p_u) for u - lambda*sign(q_i) )
p_u = p_u + LearningRate*(sum(e_ui*q_i) for i - lambda*sign(p_u) )
Problem 2, how to initiate q_i and p_u?
Choosing a right initial point is very import for stochastic gradient descent. even for a convex problem, choose a good initial point will speed up the learning process. for non-convex problem like this, a good choice also decide the quality of local optimum.
Unfortunately, there is not general rule to choose a good initial point that make a local optimum close to global optimum, but there is still way to speed up the learning process that save up a lot time.
- Is choosing all 0 a good choice? Take a look at the update formula above, if the initial is zero, the square error part would become zero, and the regularization part will become 1. after the update, all elements of q_i and p_u will be -LearningRate*lambda. if that is what we want, we can just initiate q_i and p_u with -LearningRate*lambda and save all calculation.
- Is initiate every elements with a small value like -LearningRate*lambda a good choice? I don't think so, if q_i and p_u is very small, then the update of them will also be very small for each round of training, then it will take several more step to came to a point close to optimum.
Assume mean(r_ui) is 4, and q_i and p_u both are vectors with 100 elements. E() is for Expectation in statistics
E(q_i*p_u) = 100*E(q_i_k) * E(p_u_k) as q_i_k and p_u_k are independent.
E(q_i*p_u) = 100*E(q_i_k) ^2 as assuming q_i_k and p_u_k have the same distribution.
then E(q_i_k) = 0.2
The question left is how to assign every q_i_k so that E(q_i_k) is 0.2.
Constant value obviously is not a good choice, because in this case, every element in a vector will update with the same difference and keep all elements the same forever.
A simple alternative is randomly pick from [0, 0.4], or use other distribution like Gaussian Distribution with mean = 0.2. I didn't try but blindly choose random assign.
Constant value obviously is not a good choice, because in this case, every element in a vector will update with the same difference and keep all elements the same forever.
A simple alternative is randomly pick from [0, 0.4], or use other distribution like Gaussian Distribution with mean = 0.2. I didn't try but blindly choose random assign.
Problem 3, how to choose Learning Rate ?
Theoretically as long as the Learning Rate is small enough, stochastic gradient descent will eventually get to local optimum. so a brute force way to do it is to start from 0.1 and exponentially shrink like 0.01, 0.001... until every step lost function result decreases.In my experiment, I have to choose 10E-5 to make lost function result decrease. and with this small learning rate, the decrease of lost function result is very very slow, after run about 100 round, it only decrease about 10%.
After think about this for a few days. I realized that the algorithm only take account of observed data, which lead to huge unbalance. As the hotness of items coincides with Zipf's Law, there is huge unbalance between items or users.
For example, item A may be viewed by 1000 users, but item B may only be viewed by 1 user.
then the square error part of update function of q_i for A is supposed to be 1000 times larger then B. In this case, a uniform learning rate for both item A and B would be inefficient. That is also why only such little learning rate can decrease lost function result. It's because after sum errors from all users view a popular item, the update difference could be very large.
To tackle the problem, I use avg instead of sum so that the difference of update would be the same size as q_i or p_u. Then the update function becomes to:
q_i = q_i + LearningRate*(avg(e_ui*p_u) for u - lambda*sign(q_i) )
p_u = p_u + LearningRate*(avg(e_ui*q_i) for i - lambda*sign(p_u) )
In this case, I only need a 0.01 learning rate and the learning speed is good, after taken 60 round, more than half error is reduced.
Problem 4, how to make feature vector sparse?
As described in background section, I hope the feature vector of user and item to be as sparse as possible so that it will require less space to store those vectors. Further, in reality, a user usually has only several interests and an item is also belong to several categories, so spareness is necessary.
The idea is simple, to make the result spare, we just have to add a regularization element that evaluate spareness to lost function.
The first thing poped into my head is Entropy.
H=sum(-p_i*log(p_i))
where p is the probability of each element, which is for each element i in a vector
p_i = abs(x_i)/sum(x_i)
According to information theory, Entropy is a metrics for how many bits needed to describe the vector, the more sparse the vector is the less entropy is has. then after adding this into lost function, it becomes:
lost = sum(r_ui - q_i*p_u)^2) for (i, u) + lambda_1*( sum(||q_i||) for i + sum(||p_u||) for u ) + lambda_2*( sum(H(q_i)) for i + sum(H(p_j)) for j )
let's change update function next.
to simplify, we just assume sum(x_i) is a constant then it can be ignored because magnitude is not important in gradient descend.
then
d H/d p_i = -( log(p_i) + 1 )
d H/ d x_i = -( log(x_i) + 1 ) if x_i >=0
= log(-x_i) + 1 if x_i <0
the derivative is colored in red and orange in the picture below:
Derivative of Entropy |
Let's think about how entropy makes the vector sparse when applying gradient descend. When x_i is bigger than 0.1 which makes H(x_i)=0, it add a little bit to x_i, and the bigger x_i is, the bigger the addition is. When x_i is smaller than 0.1, it subtract a little bit to x_i, and the smaller x_i is, the bigger the subtraction is. Then we can find a function whose derivative has similar attributes and the derivative of it won't be extreme large when the absolute value is small.
The simplest form is
H(p_i) = -x^2+x
d H/d x_i = -2x+1 if x>=0
= -2x-1 if x<0
whose derivative is simply a linear function as showed in blue and black in the above picture.
In this case, the update functions become:
q_i = q_i + LearningRate*(avg(e_ui*p_u) for u - lambda_1*sign(q_i) - lambda_2*(-2x+sign(q_i)) )
p_u = p_u + LearningRate*(avg(e_ui*q_i) for i - lambda*sign(p_u) - lambda_2*(-2x+sign(p_j)) )
In my experiment, the entropy really make the output vector more sparse. In the picture below is the same vector trained with and without entropy regularization. Elements are displayed from small to big. As you can see the blue one is more steeper.
No comments:
Post a Comment