SGD vs ALS for Matrix Factorization

Alex Egg,

When using a Matrix Factorization approach to implement a recommendation algorithm you decompose your large user/item matrix into lower dimensional user factors and item factors. In the most simple approach you can then estimate the user rating (or in general preference) by multiplying those factors according to the following equation:

In order to learn those factors you need to minimize the following quadratic loss function:

Note that for simplicity I am omitting the possible biases in the first equation and the regularization in this second one.

In an SGD (Stochastic Gradient descent) approach, for each example in the dataset you compute the error $(r_{ui} - p^T_uq_i)$ and then you update the parameters by a factor in the opposite direction of the gradient.

Alternating Least Squares (ALS) represents a different approach to optimizing the loss function. The key insight is that you can turn the non-convex optimization problem in Equation (2) into an “easy” quadratic problem if you fix either $p_u$ or $q_i$. ALS fixes each one of those alternatively. When one is fixed, the other one is computed, and vice versa.

There are two main benefits of this approach. First, this is very easy to parallelize. Second, whenever dealing with implicit datasets, which are usually not sparse, SGD is not practical (users times items can easily be in the order of billions). ALS is a much more efficient optimization technique in these cases.