Recommender Systems: Theory

Alex Egg,

This is a survey of modern recommender systems, particularly looking at the details of Matrix Factorization methods: Weighted-Regulated MF and Bayesian Personalized Ranking with respect to explicit and implicit feedback datasets.

Ranking Function

The ranking function $r_{ui}$ gives the ranking score of user $u$ for item $i$ where $R$ is the sparse matrix of user/item interactions.

The objective is to approximate a ranking function $\hat{r}_{ui}$ which is the unobserved ranking by user $u$ and for item $i$. How do we find a function $\hat{r}$ that approximates r?

There are two methods two popular methods to learn $\hat{r}$:

This paper will exclusively focus on Collaborative Filtering techniques.

Content-based Approach

Find other items w/ a low distance function. Finds similar items to past liked items (stuck in a bubble)

Pros: No cold-start problem.

Cons: Fails to surface orignal content

Collaborative Filtering

Find users who liked what I liked. Then recommend what they liked (collaboration). Based upon the assumption that those who agreed in the past tend to agree in the future.

Among CF methods, there are two main categories:

Memory Based - Neighborhood Models

Find similar users (user-based CF) or items (item-based CF) to predict missing ratings.

User-Based CF

  1. Find k nearest neighbors $S$ (user/item matrix)
  2. Generate recommendations based on items liked by k neighbors.

knn

ui_matrix

Problem: Memory-based. Expensive online similarity computation — does not scale.

Item-Based CF

Estimate rating by looking at simular items and running a weighted sum. First we must establish a simularity measure:

Using the similarity measure, we identify the k items rated by u, which are most similar to i. This set of k neighbors is denoted by S. The predicted value of $r_{ui}$ is taken as a weighted average of the ratings for neighboring items:

All item-oriented models share a disadvantage in regards to implicit feedback –they do not provide the flexibility to make a distinction between user preferences and the confidence we might have in those preferences.

Cons: Memory-based. Expensive online simularity computation — does not scale.

Model Based - Latent Factor Models

All these methods look to optimze a cost function w/ respect the parameters $x$ and $y$ which are latent factors. The differences are the cost functions and optimization techniques.

Matrix Factorization (MF) methods relate users and items by uncovering latent dimensions such that users have similar representations to items they rate highly, and are the basis of many state-of-the-art recommendation approaches (e.g. Rendle et al. (2009), Bell, Koren, and Volinsky (2007), Bennett and Lanning (2007)).

Ratings are deeply influenced by a set of factors that are very specific to the domain (e.g. amount of action in movies, compleixty of characters). These factors are in genreal not obvious, we might be able to think of some of them but it’s hard to estimate their impact on ratings. The goal is to infer those so called latent factors from the rating data by using mathametical techniques: Singular Value Decomposition (SVD).

Within the category of Model-based methods there can be two categories:

Point-wise methods

Weighed Regulated Matrix Factorization (WR-MF)

The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. An adaption of a SVD, which minimizes the squared loss. Their extensions are regularization to prevent overfitting and weights in the error function to increase the impact of positive feedback.

A typical model associates each user $u$ with a user-factors vector $x_u\in{R^f}$ , and each item $i$ with an item-factors vector $y_i\in{R^f}$. The prediction is done by taking an inner product, i.e.:

The more involved part is parameter estimation for $x$ and $y$. Many of the recent works, applied to explicit feedback datasets, suggested modeling directly only the observed ratings, while avoiding overfitting through an adequate regularized model, such as:

where the last term is the regularization term using the square of the L2 norm of the two parameters we are interested in optimizing: $x$ and $y$.

To estimate the parameters $x$ and $y$ we minimize the following regularized squared error cost function above. You can solve this w/ Alternating Least Squares (ALC) or Stochastic Gradient Descent (SGD). Below are partial derivatives updates for SGD:

where $e_{ui} = r_{ui} - \hat{r}_{ui}$ and $\eta$ is the learning rate and $\lambda$ is the regularization coefficient. These steps are performed over all the ratings of the trainset and repeated n_epochs times. Baselines are initialized to 0. User and item factors are initialized to 0.1, as recommended by Funk.

Note: If we use ALS to solve the cost function we can employ parallelism to speed up training. Spark implements the ALS solver natively.

Pair-wise methods

In contrast to point-wise methods, pairwise methods are based on a weaker but possibly more realistic assumption that positive feedback must only be ‘more preferable’ than non-observed feedback. Such methods directly optimize the ranking of the feedback and are to our knowledge state-of-the-art for implicit feedback datasets. Rendle et al.(2009) propose a generalized Bayesian Personalized Ranking (BPR) framework and experimentally show that BPR-MF (i.e., with MF as the underlying predictor) outperforms a variety of competitive baselines. [Julian VBPR paper]

Maximum Margin Matrix Factorization (MM-MF)

A pairwise MF model from Gantner et al.(2011), which is optimized for a hinge ranking loss on $x_{u,i,j}$ and trained using SGA.

Baysian Personalized Recommenadation - BRP MF

Optimization criterion for personalized ranking that is based on pairs of items (i.e. the user-specific order of two items. Maximises the prediction difference between a positive example and a randomly chosen negative example. Useful when only positive interactions are present and optimising ROC AUC is desired.

A training set $D_S$ consists of triples of the form $(u, i, j)$, where $u$ denotes the user together with an item $i$ about which they expressed positive feedback, and a non-observed item $j$:

This dataset takes a different approach than the point-wise methods above which are based on a user/item matrix. Here, for each user, we build an item/item matrix which shows how user $u$ prefered item $i$ over item $j$. See the BPR cost function:

where $\sigma$ is the sigmoid function, $\Theta$ is the model parameters: $x$ and $y$ and $\hat{r}_{uij}$ is

The cost function above has an interesting curve, it is convex and not concave — so this will impose a requirement upon us to use Stochastic Gradient Ascent (SGA) as opposed to batch SGA:

cost function

Figure 1: The cost function which is maximized at around $\theta=2$

Thus, in order to maximize the cost function we will have to do gradient ascent.

First a triple $(u, i, j)$ is sampled from $D_S$ and then the learning algorithm updates parameters in the following fashion:

where $\eta$ is the learning rate and the partial derivatives for the parameters $\Theta$ are:

With the cost function minimized, reconstruct the $R$ matrix $\hat{R}$ and make a prediction:

Appendix

Datasets

Data gathered for recommendation systems is either in the form of explicit or implicit feedback. Explicit datasets are the typical situation while implicit datasets require some extra considering for modeling.

Explicit Feedback

Explicit feedback datasets are a collection of $(u,i,r)$ triples where $u$ is a user, $i$ is an item and $r$ is some type of quantifiable ranking/rating. A common example is star ratings from Netflix, MovieLens etc.

MovieLens sparse matrix, e.g.: 133,960x431,826 = 57.8 billion positions where only 100M are not 0’s!

Implicit Feedback

Unlike explicit feedback, the $(u,i,r)$ implicit feedback triples are not directly quantifiable, but rather indirect actions that imply some degree of preference, e.g:

k Parameter

We can choose how many factors to include in the model by tuning the hyper-paramter: $k$ . $k$ specifies how many eigenvectors/components we want to keep in $x$ and $y$. As $k$ decreases the quality of the approximation of $R$ will decrease in proportion to the quality of the missing rakings predictions. Tuning $k$ is a tradeoff between dimensionality and accuracy which is also connected to computational time and memory to train and run the model predictions.

Permalink: recommender-systems

Tags:

Last edited by Alex Egg, 2017-02-22 21:08:35
View Revision History