Recommender Systems: Theory

Alex Egg,

This is a survey of modern recommender systems, particularly looking at the details of Matrix Factorization methods: Weighted-Regulated MF and Bayesian Personalized Ranking with respect to explicit and implicit feedback datasets.

Ranking Function

The ranking function $r_{ui}$ gives the ranking score of user $u$ for item $i$ where $R$ is the sparse matrix of user/item interactions.

The objective is to approximate a ranking function $\hat{r}_{ui}$ which is the unobserved ranking by user $u$ and for item $i$. How do we find a function $\hat{r}$ that approximates r?

There are two methods two popular methods to learn $\hat{r}$:

This paper will exclusively focus on Collaborative Filtering techniques.

Content-based Approach

Find other items w/ a low distance function. Finds similar items to past liked items (stuck in a bubble)

Pros: No cold-start problem.

Cons: Pigeon Hole Problem: Fails to surface orignal content

Collaborative Filtering

Find users who liked what I liked. Then recommend what they liked (collaboration). Based upon the assumption that those who agreed in the past tend to agree in the future.

Among CF methods, there are two main categories:

Memory Based - Neighborhood Models

Find similar users (user-based CF) or items (item-based CF) to predict missing ratings.

User-Based CF

  1. Find k nearest neighbors $S$ (user/item matrix)
  2. Generate recommendations based on items liked by k neighbors.



Problem: Memory-based. Expensive online similarity computation — does not scale.

Item-Based CF

Estimate rating by looking at simular items and running a weighted sum. First we must establish a simularity measure:

Using the similarity measure, we identify the k items rated by u, which are most similar to i. This set of k neighbors is denoted by S. The predicted value of $r_{ui}$ is taken as a weighted average of the ratings for neighboring items:

All item-oriented models share a disadvantage in regards to implicit feedback –they do not provide the flexibility to make a distinction between user preferences and the confidence we might have in those preferences.

Neighborhood Model Problems

Performance Implications

These methods are computationally expensive and grow w/ the number of users and items and therefore do not scale well. Similarity measure is a computational bottleneck. Time Complexity: Highly time consuming with millions of users and items in database.

The worst case complexity is $O(mn)$ (m customers and n products).

One solution is to break the neighborhood generation and production into discrete steps:

Some techniques such as clustering or K-means can help also. In other words do some initial clustering of users and only computer the similarity of users who are initially similar.

Sparcity Prolem
Sparcity Example: Netflix Challenge

If you represent the Netflix Challenge rating data as a user/item matrix you would get: 500k x 17k = 8.5B position matrix where only about 100M are not missing/0s!

Solution: Latent-Factor Models

Model Based - Latent Factor Models

In contrast to the previous, memory based system, which try to generate a a prediction using the entire user/item matrix, Model based methods will take a more sophisticated approach and try to develop a model of the user.

Matrix Factorization (MF) methods relate users and items by uncovering latent dimensions such that users have similar representations to items they rate highly, and are the basis of many state-of-the-art recommendation approaches (e.g. Rendle et al. (2009), Bell, Koren, and Volinsky (2007), Bennett and Lanning (2007)).

Ratings are deeply influenced by a set of factors that are very specific to the domain (e.g. amount of action in movies, compleixty of characters). These factors are in genreal not obvious, we might be able to think of some of them but it’s hard to estimate their impact on ratings. The goal is to infer those so called latent factors from the rating data by using mathametical techniques: Singular Value Decomposition (SVD).

SVD/Matrix Factorization

The basic idea is that we want collapse the sparse (full of 0s) user/item matrix into something much less sparse and lower dimensional. The idea behind matrix factorization is that we can do that by decomposing my original matrix into 3 components: user/item matrix, item/item matrix and a diagonal that maps factors into themselves. There are different ways to get to this decomposition, but the idea is that we reduce the dimensionality of the item space into a lower space where I have a latent number of factors and then mapping users to those factors and items to those factors.

Instead of computing this in a closed form mathematically way, you can do it iteratively using stochastic gradient descent.

All these methods look to optimze a cost function w/ respect the parameters $x$ and $y$ which are latent factors. The differences are the cost functions and optimization techniques.

Within the category of Model-based methods there can be two categories:

Point-wise methods

Weighed Regulated Matrix Factorization (WR-MF)

The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. An adaption of a SVD, which minimizes the squared loss. Their extensions are regularization to prevent overfitting and weights in the error function to increase the impact of positive feedback.

A typical model associates each user $u$ with a user-factors vector $x_u\in{R^f}$ , and each item $i$ with an item-factors vector $y_i\in{R^f}$. The prediction is done by taking an inner product, i.e.:

The more involved part is parameter estimation for $x$ and $y$. Many of the recent works, applied to explicit feedback datasets, suggested modeling directly only the observed ratings, while avoiding overfitting through an adequate regularized model, such as:

where the last term is the regularization term using the square of the L2 norm of the two parameters we are interested in optimizing: $x$ and $y$.

To estimate the parameters $x$ and $y$ we minimize the following regularized squared error cost function above. You can solve this w/ Alternating Least Squares (ALC) or Stochastic Gradient Descent (SGD). Below are partial derivatives updates for SGD:

where $e_{ui} = r_{ui} - \hat{r}_{ui}$ and $\eta$ is the learning rate and $\lambda$ is the regularization coefficient. These steps are performed over all the ratings of the trainset and repeated n_epochs times. Baselines are initialized to 0. User and item factors are initialized to 0.1, as recommended by Funk.

Note: If we use ALS to solve the cost function we can employ parallelism to speed up training. Spark implements the ALS solver natively.

Pair-wise methods

In contrast to point-wise methods, pairwise methods are based on a weaker but possibly more realistic assumption that positive feedback must only be ‘more preferable’ than non-observed feedback. Such methods directly optimize the ranking of the feedback and are to our knowledge state-of-the-art for implicit feedback datasets. Rendle et al.(2009) propose a generalized Bayesian Personalized Ranking (BPR) framework and experimentally show that BPR-MF (i.e., with MF as the underlying predictor) outperforms a variety of competitive baselines. [Julian VBPR paper]

Maximum Margin Matrix Factorization (MM-MF)

A pairwise MF model from Gantner et al.(2011), which is optimized for a hinge ranking loss on $x_{u,i,j}$ and trained using SGA.

Baysian Personalized Recommenadation - BRP MF

Optimization criterion for personalized ranking that is based on pairs of items (i.e. the user-specific order of two items. Maximises the prediction difference between a positive example and a randomly chosen negative example. Useful when only positive interactions are present and optimising ROC AUC is desired.

A training set $D_S$ consists of triples of the form $(u, i, j)$, where $u$ denotes the user together with an item $i$ about which they expressed positive feedback, and a non-observed item $j$:

This dataset takes a different approach than the point-wise methods above which are based on a user/item matrix. Here, for each user, we build an item/item matrix which shows how user $u$ prefered item $i$ over item $j$. See the BPR cost function:

where $\sigma$ is the sigmoid function, $\Theta$ is the model parameters: $x$ and $y$ and $\hat{r}_{uij}$ is

The cost function above has an interesting curve, it is convex and not concave — so this will impose a requirement upon us to use Stochastic Gradient Ascent (SGA) as opposed to batch SGA:

cost function

Figure 1: The cost function which is maximized at around $\theta=2$

Thus, in order to maximize the cost function we will have to do gradient ascent.

First a triple $(u, i, j)$ is sampled from $D_S$ and then the learning algorithm updates parameters in the following fashion:

where $\eta$ is the learning rate and the partial derivatives for the parameters $\Theta$ are:

With the cost function minimized, reconstruct the $R$ matrix $\hat{R}$ and make a prediction:

Collaborative Filtering Limitations

Cold Start

There needs to be enough other users already in the system to find a match. new items need to get enough ratings.

New user problem: TO make accurate recommendations, the system must first learn the user’s preferences from the ratings. Several techniques proposed to address tech. Most use hybrid recommendation approach (see VBPR, Ruining, McAuley), which combines content-based and collaborative techniques. One solution to this is to recommend popular items until the user has enough ratings in the system for CF to give meaningful results.

New Item Problem: New items are added regularly to recommender system. Until the new item is rated by a substantial number of users, the recommender system is not able to recommend it.

Popularity Bias

Hard to recommend items to someone with unique tastes. Tends to recommend popular items (items from the tail do not get so much data)



Data gathered for recommendation systems is either in the form of explicit or implicit feedback. Explicit datasets are the typical situation while implicit datasets require some extra considering for modeling.

Explicit Feedback

Explicit feedback datasets are a collection of $(u,i,r)$ triples where $u$ is a user, $i$ is an item and $r$ is some type of quantifiable ranking/rating. A common example is star ratings from Netflix, MovieLens etc.

MovieLens sparse matrix, e.g.: 133,960x431,826 = 57.8 billion positions where only 100M are not 0’s!

Implicit Feedback

Unlike explicit feedback, the $(u,i,r)$ implicit feedback triples are not directly quantifiable, but rather indirect actions that imply some degree of preference, e.g:

k Parameter

We can choose how many factors to include in the model by tuning the hyper-paramter: $k$ . $k$ specifies how many eigenvectors/components we want to keep in $x$ and $y$. As $k$ decreases the quality of the approximation of $R$ will decrease in proportion to the quality of the missing rakings predictions. Tuning $k$ is a tradeoff between dimensionality and accuracy which is also connected to computational time and memory to train and run the model predictions.

Permalink: recommender-systems


Last edited by Alex Egg, 2017-05-09 23:09:08
View Revision History