# Recommender Systems: Theory

Alex Egg,

This is a survey of modern recommender systems, particularly looking at the details of Matrix Factorization methods: Weighted-Regulated MF and Bayesian Personalized Ranking with respect to explicit and implicit feedback datasets.

## Ranking Function

The ranking function $r_{ui}$ gives the ranking score of user $u$ for item $i$ where $R$ is the sparse matrix of user/item interactions.

The objective is to approximate a ranking function $\hat{r}_{ui}$ which is the unobserved ranking by user $u$ and for item $i$. How do we find a function $\hat{r}$ that approximates r?

There are two methods two popular methods to learn $\hat{r}$:

• Content-based Approach
• Collaborative Filtering

This paper will exclusively focus on Collaborative Filtering techniques.

## Content-based Approach

Find other items w/ a low distance function. Finds similar items to past liked items (stuck in a bubble)

Pros: No cold-start problem.

Cons: Pigeon Hole Problem: Fails to surface orignal content

## Collaborative Filtering

Find users who liked what I liked. Then recommend what they liked (collaboration). Based upon the assumption that those who agreed in the past tend to agree in the future.

Among CF methods, there are two main categories:

• Memory Based - Neighborhood Models
• Latent Factor Models - Matrix Factorization (state-of-the-art)

### Memory Based - Neighborhood Models

Find similar users (user-based CF) or items (item-based CF) to predict missing ratings.

#### User-Based CF

1. Find k nearest neighbors $S$ (user/item matrix)
2. Generate recommendations based on items liked by k neighbors.  Problem: Memory-based. Expensive online similarity computation — does not scale.

#### Item-Based CF

Estimate rating by looking at simular items and running a weighted sum. First we must establish a simularity measure:

Using the similarity measure, we identify the k items rated by u, which are most similar to i. This set of k neighbors is denoted by S. The predicted value of $r_{ui}$ is taken as a weighted average of the ratings for neighboring items:

All item-oriented models share a disadvantage in regards to implicit feedback –they do not provide the flexibility to make a distinction between user preferences and the confidence we might have in those preferences.

#### Neighborhood Model Problems

##### Performance Implications

These methods are computationally expensive and grow w/ the number of users and items and therefore do not scale well. Similarity measure is a computational bottleneck. Time Complexity: Highly time consuming with millions of users and items in database.

The worst case complexity is $O(mn)$ (m customers and n products).

One solution is to break the neighborhood generation and production into discrete steps:

• “off-line component” / “model” — similarity computation, done earlier & stored in memory.
• “on-line component” — predication generation process

Some techniques such as clustering or K-means can help also. In other words do some initial clustering of users and only computer the similarity of users who are initially similar.

##### Sparcity Prolem
• Typically: large product sets, user ratings for a smaller percentage of them
• Example Amazon: millions of books and a user may have bought at most hundreds of books:
• the probability that two users that have bought 100s of books have a common book (in a catalog of 1 million books) is .01 (with 50 and 10 millions is 0.0002)
• Standard CF must have a number of users comparable to one tenth of the size of the product catalog. In other words if the number of users is at least 1/10 the size of the items in your catalog then Neighborhood based CF is feasible.
##### Sparcity Example: Netflix Challenge

If you represent the Netflix Challenge rating data as a user/item matrix you would get: 500k x 17k = 8.5B position matrix where only about 100M are not missing/0s!

Solution: Latent-Factor Models

### Model Based - Latent Factor Models

In contrast to the previous, memory based system, which try to generate a a prediction using the entire user/item matrix, Model based methods will take a more sophisticated approach and try to develop a model of the user.

Matrix Factorization (MF) methods relate users and items by uncovering latent dimensions such that users have similar representations to items they rate highly, and are the basis of many state-of-the-art recommendation approaches (e.g. Rendle et al. (2009), Bell, Koren, and Volinsky (2007), Bennett and Lanning (2007)).

Ratings are deeply influenced by a set of factors that are very specific to the domain (e.g. amount of action in movies, compleixty of characters). These factors are in genreal not obvious, we might be able to think of some of them but it’s hard to estimate their impact on ratings. The goal is to infer those so called latent factors from the rating data by using mathametical techniques: Singular Value Decomposition (SVD).

#### SVD/Matrix Factorization

The basic idea is that we want collapse the sparse (full of 0s) user/item matrix into something much less sparse and lower dimensional. The idea behind matrix factorization is that we can do that by decomposing my original matrix into 3 components: user/item matrix, item/item matrix and a diagonal that maps factors into themselves. There are different ways to get to this decomposition, but the idea is that we reduce the dimensionality of the item space into a lower space where I have a latent number of factors and then mapping users to those factors and items to those factors.

Instead of computing this in a closed form mathematically way, you can do it iteratively using stochastic gradient descent.

All these methods look to optimze a cost function w/ respect the parameters $x$ and $y$ which are latent factors. The differences are the cost functions and optimization techniques.

Within the category of Model-based methods there can be two categories:

• Point-wise methods: These typically deal w/ explicit feedback datasets and minimizes a cost function which relates the error (diff) between two explicit ratings.
• Pair-wise methods: These typically deal w/ implicit feedback datasets and maximizes a cost function which relates a positive and negative observations.

#### Point-wise methods

##### Weighed Regulated Matrix Factorization (WR-MF)

The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. An adaption of a SVD, which minimizes the squared loss. Their extensions are regularization to prevent overfitting and weights in the error function to increase the impact of positive feedback.

A typical model associates each user $u$ with a user-factors vector $x_u\in{R^f}$ , and each item $i$ with an item-factors vector $y_i\in{R^f}$. The prediction is done by taking an inner product, i.e.:

The more involved part is parameter estimation for $x$ and $y$. Many of the recent works, applied to explicit feedback datasets, suggested modeling directly only the observed ratings, while avoiding overfitting through an adequate regularized model, such as:

where the last term is the regularization term using the square of the L2 norm of the two parameters we are interested in optimizing: $x$ and $y$.

To estimate the parameters $x$ and $y$ we minimize the following regularized squared error cost function above. You can solve this w/ Alternating Least Squares (ALC) or Stochastic Gradient Descent (SGD). Below are partial derivatives updates for SGD:

where $e_{ui} = r_{ui} - \hat{r}_{ui}$ and $\eta$ is the learning rate and $\lambda$ is the regularization coefficient. These steps are performed over all the ratings of the trainset and repeated n_epochs times. Baselines are initialized to 0. User and item factors are initialized to 0.1, as recommended by Funk.

Note: If we use ALS to solve the cost function we can employ parallelism to speed up training. Spark implements the ALS solver natively.

#### Pair-wise methods

In contrast to point-wise methods, pairwise methods are based on a weaker but possibly more realistic assumption that positive feedback must only be ‘more preferable’ than non-observed feedback. Such methods directly optimize the ranking of the feedback and are to our knowledge state-of-the-art for implicit feedback datasets. Rendle et al.(2009) propose a generalized Bayesian Personalized Ranking (BPR) framework and experimentally show that BPR-MF (i.e., with MF as the underlying predictor) outperforms a variety of competitive baselines. [Julian VBPR paper]

##### Maximum Margin Matrix Factorization (MM-MF)

A pairwise MF model from Gantner et al.(2011), which is optimized for a hinge ranking loss on $x_{u,i,j}$ and trained using SGA.

##### Baysian Personalized Recommenadation - BRP MF

Optimization criterion for personalized ranking that is based on pairs of items (i.e. the user-specific order of two items. Maximises the prediction difference between a positive example and a randomly chosen negative example. Useful when only positive interactions are present and optimising ROC AUC is desired.

A training set $D_S$ consists of triples of the form $(u, i, j)$, where $u$ denotes the user together with an item $i$ about which they expressed positive feedback, and a non-observed item $j$:

This dataset takes a different approach than the point-wise methods above which are based on a user/item matrix. Here, for each user, we build an item/item matrix which shows how user $u$ prefered item $i$ over item $j$. See the BPR cost function:

where $\sigma$ is the sigmoid function, $\Theta$ is the model parameters: $x$ and $y$ and $\hat{r}_{uij}$ is

The cost function above has an interesting curve, it is convex and not concave — so this will impose a requirement upon us to use Stochastic Gradient Ascent (SGA) as opposed to batch SGA: Figure 1: The cost function which is maximized at around $\theta=2$

Thus, in order to maximize the cost function we will have to do gradient ascent.

First a triple $(u, i, j)$ is sampled from $D_S$ and then the learning algorithm updates parameters in the following fashion:

where $\eta$ is the learning rate and the partial derivatives for the parameters $\Theta$ are:

With the cost function minimized, reconstruct the $R$ matrix $\hat{R}$ and make a prediction:

### Collaborative Filtering Limitations

#### Cold Start

There needs to be enough other users already in the system to find a match. new items need to get enough ratings.

New user problem: TO make accurate recommendations, the system must first learn the user’s preferences from the ratings. Several techniques proposed to address tech. Most use hybrid recommendation approach (see VBPR, Ruining, McAuley), which combines content-based and collaborative techniques. One solution to this is to recommend popular items until the user has enough ratings in the system for CF to give meaningful results.

New Item Problem: New items are added regularly to recommender system. Until the new item is rated by a substantial number of users, the recommender system is not able to recommend it.

#### Popularity Bias

Hard to recommend items to someone with unique tastes. Tends to recommend popular items (items from the tail do not get so much data)

## Appendix

### Datasets

Data gathered for recommendation systems is either in the form of explicit or implicit feedback. Explicit datasets are the typical situation while implicit datasets require some extra considering for modeling.

#### Explicit Feedback

Explicit feedback datasets are a collection of $(u,i,r)$ triples where $u$ is a user, $i$ is an item and $r$ is some type of quantifiable ranking/rating. A common example is star ratings from Netflix, MovieLens etc.

MovieLens sparse matrix, e.g.: 133,960x431,826 = 57.8 billion positions where only 100M are not 0’s!

#### Implicit Feedback

Unlike explicit feedback, the $(u,i,r)$ implicit feedback triples are not directly quantifiable, but rather indirect actions that imply some degree of preference, e.g:

• watched episode of show -> loosely implies they like it

• reviewed item on amazon -> implies they bought it which implies they like it

• clicked link -> can imply something…

#### k Parameter

We can choose how many factors to include in the model by tuning the hyper-paramter: $k$ . $k$ specifies how many eigenvectors/components we want to keep in $x$ and $y$. As $k$ decreases the quality of the approximation of $R$ will decrease in proportion to the quality of the missing rakings predictions. Tuning $k$ is a tradeoff between dimensionality and accuracy which is also connected to computational time and memory to train and run the model predictions.