Recommender Systems: Theory
This is a survey of modern recommender systems, particularly looking at the details of Matrix Factorization methods: WeightedRegulated MF and Bayesian Personalized Ranking with respect to explicit and implicit feedback datasets.
Ranking Function
The ranking function $r_{ui}$ gives the ranking score of user $u$ for item $i$ where $R$ is the sparse matrix of user/item interactions.
The objective is to approximate a ranking function $\hat{r}_{ui}$ which is the unobserved ranking by user $u$ and for item $i$. How do we find a function $\hat{r}$ that approximates r?
There are two methods two popular methods to learn $\hat{r}$:
 Contentbased Approach
 Collaborative Filtering
This paper will exclusively focus on Collaborative Filtering techniques.
Contentbased Approach
Find other items w/ a low distance function. Finds similar items to past liked items (stuck in a bubble)
Pros: No coldstart problem.
Cons: Fails to surface orignal content
Collaborative Filtering
Find users who liked what I liked. Then recommend what they liked (collaboration). Based upon the assumption that those who agreed in the past tend to agree in the future.
Among CF methods, there are two main categories:
 Memory Based  Neighborhood Models
 Latent Factor Models  Matrix Factorization (stateoftheart)
Memory Based  Neighborhood Models
Find similar users (userbased CF) or items (itembased CF) to predict missing ratings.
UserBased CF
 Find k nearest neighbors $S$ (user/item matrix)
 Generate recommendations based on items liked by k neighbors.
Problem: Memorybased. Expensive online similarity computation — does not scale.
ItemBased CF
Estimate rating by looking at simular items and running a weighted sum. First we must establish a simularity measure:
Using the similarity measure, we identify the k items rated by u, which are most similar to i. This set of k neighbors is denoted by S. The predicted value of $r_{ui}$ is taken as a weighted average of the ratings for neighboring items:
All itemoriented models share a disadvantage in regards to implicit feedback –they do not provide the flexibility to make a distinction between user preferences and the confidence we might have in those preferences.
Cons: Memorybased. Expensive online simularity computation — does not scale.
Model Based  Latent Factor Models
All these methods look to optimze a cost function w/ respect the parameters $x$ and $y$ which are latent factors. The differences are the cost functions and optimization techniques.
Matrix Factorization (MF) methods relate users and items by uncovering latent dimensions such that users have similar representations to items they rate highly, and are the basis of many stateoftheart recommendation approaches (e.g. Rendle et al. (2009), Bell, Koren, and Volinsky (2007), Bennett and Lanning (2007)).
Ratings are deeply influenced by a set of factors that are very specific to the domain (e.g. amount of action in movies, compleixty of characters). These factors are in genreal not obvious, we might be able to think of some of them but it’s hard to estimate their impact on ratings. The goal is to infer those so called latent factors from the rating data by using mathametical techniques: Singular Value Decomposition (SVD).
Within the category of Modelbased methods there can be two categories:
 Pointwise methods: These typically deal w/ explicit feedback datasets and minimizes a cost function which relates the error (diff) between two explicit ratings.
 Pairwise methods: These typically deal w/ implicit feedback datasets and maximizes a cost function which relates a positive and negative observations.
Pointwise methods
Weighed Regulated Matrix Factorization (WRMF)
The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. An adaption of a SVD, which minimizes the squared loss. Their extensions are regularization to prevent overfitting and weights in the error function to increase the impact of positive feedback.
A typical model associates each user $u$ with a userfactors vector $x_u\in{R^f}$ , and each item $i$ with an itemfactors vector $y_i\in{R^f}$. The prediction is done by taking an inner product, i.e.:
The more involved part is parameter estimation for $x$ and $y$. Many of the recent works, applied to explicit feedback datasets, suggested modeling directly only the observed ratings, while avoiding overfitting through an adequate regularized model, such as:
where the last term is the regularization term using the square of the L2 norm of the two parameters we are interested in optimizing: $x$ and $y$.
To estimate the parameters $x$ and $y$ we minimize the following regularized squared error cost function above. You can solve this w/ Alternating Least Squares (ALC) or Stochastic Gradient Descent (SGD). Below are partial derivatives updates for SGD:
where $e_{ui} = r_{ui}  \hat{r}_{ui}$ and $\eta$ is the learning rate and $\lambda$ is the regularization coefficient. These steps are performed over all the ratings of the trainset and repeated n_epochs
times. Baselines are initialized to 0
. User and item factors are initialized to 0.1
, as recommended by Funk.
Note: If we use ALS to solve the cost function we can employ parallelism to speed up training. Spark implements the ALS solver natively.
Pairwise methods
In contrast to pointwise methods, pairwise methods are based on a weaker but possibly more realistic assumption that positive feedback must only be ‘more preferable’ than nonobserved feedback. Such methods directly optimize the ranking of the feedback and are to our knowledge stateoftheart for implicit feedback datasets. Rendle et al.(2009) propose a generalized Bayesian Personalized Ranking (BPR) framework and experimentally show that BPRMF (i.e., with MF as the underlying predictor) outperforms a variety of competitive baselines. [Julian VBPR paper]
Maximum Margin Matrix Factorization (MMMF)
A pairwise MF model from Gantner et al.(2011), which is optimized for a hinge ranking loss on $x_{u,i,j}$ and trained using SGA.
Baysian Personalized Recommenadation  BRP MF
Optimization criterion for personalized ranking that is based on pairs of items (i.e. the userspecific order of two items. Maximises the prediction difference between a positive example and a randomly chosen negative example. Useful when only positive interactions are present and optimising ROC AUC is desired.
A training set $D_S$ consists of triples of the form $(u, i, j)$, where $u$ denotes the user together with an item $i$ about which they expressed positive feedback, and a nonobserved item $j$:
This dataset takes a different approach than the pointwise methods above which are based on a user/item matrix. Here, for each user, we build an item/item matrix which shows how user $u$ prefered item $i$ over item $j$. See the BPR cost function:
where $\sigma$ is the sigmoid function, $\Theta$ is the model parameters: $x$ and $y$ and $\hat{r}_{uij}$ is
The cost function above has an interesting curve, it is convex and not concave — so this will impose a requirement upon us to use Stochastic Gradient Ascent (SGA) as opposed to batch SGA:
Figure 1: The cost function which is maximized at around $\theta=2$
Thus, in order to maximize the cost function we will have to do gradient ascent.
First a triple $(u, i, j)$ is sampled from $D_S$ and then the learning algorithm updates parameters in the following fashion:
where $\eta$ is the learning rate and the partial derivatives for the parameters $\Theta$ are:
With the cost function minimized, reconstruct the $R$ matrix $\hat{R}$ and make a prediction:
Appendix
Datasets
Data gathered for recommendation systems is either in the form of explicit or implicit feedback. Explicit datasets are the typical situation while implicit datasets require some extra considering for modeling.
Explicit Feedback
Explicit feedback datasets are a collection of $(u,i,r)$ triples where $u$ is a user, $i$ is an item and $r$ is some type of quantifiable ranking/rating. A common example is star ratings from Netflix, MovieLens etc.
MovieLens sparse matrix, e.g.: 133,960x431,826 = 57.8 billion positions where only 100M are not 0’s!
Implicit Feedback
Unlike explicit feedback, the $(u,i,r)$ implicit feedback triples are not directly quantifiable, but rather indirect actions that imply some degree of preference, e.g:

watched episode of show > loosely implies they like it

reviewed item on amazon > implies they bought it which implies they like it

clicked link > can imply something…
k Parameter
We can choose how many factors to include in the model by tuning the hyperparamter: $k$ . $k$ specifies how many eigenvectors/components we want to keep in $x$ and $y$. As $k$ decreases the quality of the approximation of $R$ will decrease in proportion to the quality of the missing rakings predictions. Tuning $k$ is a tradeoff between dimensionality and accuracy which is also connected to computational time and memory to train and run the model predictions.
Permalink: recommendersystems
Tags: