RecSys 2018 Recap

Alex Egg,

The annual ACM Recommender Systems Conference was held this year, 2018, in Vancouver, BC. Below are some of my notes from presented works that caught my interest.


Best Paper: Causal Embeddings (Criteo AI)

The paper is summarized well by the literature the Criteo AI booth was passing out at the conference:

You can always compute offline the ranking metrics of your recommendation models.

But is this really related to the online performance of the new model? What is the right offline metric that predicts online performance? Meet the new methods f counterfactual inference from observational data that can be used to improve our offline metrics correlation with online performance. And of course, once these metrics become predictive of online behavior, why not directly optimize for them?

If my memory serves me they also picked up the researchers who put out Field-Aware Factorization Machines which won best paper a few years ago.

The paper tackles the task of increasing the desired outcome versus the organic user behavior, which is certainly a healthy way to evaluate recommenders in general.

I looked at the Tensorflow code they published: but I don’t understand it yet. They seem to do a full training routine on the LF model and then afterwards they do a Causal Evaluation on it in the compute_bootstraps function which will show the gain over the baseline implicit LF model.

Implicit Point-wise CF

factorize a probability matrix instead of a binary interaction matrix.

TF code has a python base class implementation of a point-wise implicit CF model. They implement two losses. A binary cross entropy log loss and an MSE loss.

Takes user, items and the binary indicator. Then, it tries to predict it using the sigmoid of the dot product of the embeddings:

where $\hat{y}$ is the logits in a logistic framework. Minimize w/ binary cross entropy:

The also added an MSE loss in addition:


Item embedding doesn’t contribute to the loss: $Q’$, this new item embedding reference is read only AND they only read from the 0 index element from the embedding.

creates a Counterfactual Loss and adds to main loss

It is added to the logistic loss above…

Why I like it: multi task learning for recommendation and explanation

user generated reviews as data (amazon).

jointly learning recommendations and explanations using a multitask approach combining matrix factorization and adversarial seq2seq.

pointwise explicit loss (MSE loss)

Interpreting User inaction in recsys

This paper was very popular w/ the audience. User inaction as feedback. I’ll have to go back and review.

Variational L2R (Salesforce)

Learning to Rank (LTR) -> optimal ranking

Variational Inference (VI) -> Optimize probability distribution

IVI+LTR = optimize exploration of knowledge


Minimize ranking loss: don’t predict clicks, rank better!

Traditionally, LTR, is better ranking than click models


probability distribution instead of point estimates.

KL Div: compare distribution of data vs model reproduction

Reparameterization trick (Kingma)




IPS is popular to de-bias logged data (see BanditNet with SNIPS)

Early work. No evaluation numbers :(

Adapting Session based recommendation for features through transfer learning (

all items cold, freshness is important. inventory changes daily.

An interaction is considered an impression (session)

geographically constrained (congregated search)

Listing Embeddings -> build user rep from respective listing embeddings -> dot product embedded prediction

Feature Engineering: They embed categorical variables (categorical feature embeddings )

Pinterest Hybrid Search: Incorporating contextual signals in recommendation

related pins powers 40% of engagement on Pinterest

Gradient Boosted Trees

Pinterest uses an oft-not-seen graph model for recs.

Flipboard: Learning content and usage factors simultaneously to reduce clickbait

cold news items

content and collab filtering . Tagging system for news articles: 16k topics

based on user clicks, they aggregate your affinity

Hybridize the model. Side information using Topic Modeling (LDA)

(hu,koren,voliknsity) implicit CF + topic modeling term as regularizer

One thing I found interesting is that, from my experience, in LF models when you increase the dimensionality of the embeddings past 10D, you don’t really get THAT much more expressive power. However, they reported impressive gains of 23% after increasing from 15D to 128D. I guess it’s a product of how dense your data is…

Multi-stakeholder Recommendation w/ provider constraints

This was an interesting, very theoretical, approach to modeling, what I call, a hierarchal search problem in recsys. For example, you have Restaurants and each Restaurant has Menu Items. You want a model to recommend Restaurants, but you also want a model that can recommend Menu Items. Instead of handling these two tasks in isolation, they propose a method to learn it simultaneously .

They post it as as some type of integer optimization problem: hard to implement and scale.


I think this was my best paper award. From UCSD <3 and handles a popular trend this year sequential data in a very clean unified model. It is an extension of last year’s TransRec, which changes the model formulation to FMs which gives the model the ability to incorporate arbitrary side information.


TransRec did not include side-info, but is the foundation for the sequential modeling. It is a metric-learning approach which predicts a users recs by nearest neighbor search in the embedding space, where the embeddings were learn by the L2 distance and not the dot product interaction. Recommendations are made by taking the user’s previous item embedding and adding to the user’s translation embedding, then at that point in the embedding space, the nearest neighbor is the next rec!

Exploring recs under user-controlled data filtering

Important work to understand how performance degrades when not using filtered user data (due for example to GDPR)

Item Recommendation on Monotonic Behavior Chains


Interesting Idea of considering the whole spectrum of interactions (not just orders or clicks for example).

Whole Spectrum: click -> purchase -> review -> like

how can we use this whole spectrum to model users?

Tensor Factorization: Factorize the cube of [ items x actions x users]

Day 2

HOP-Rec: High-Order Proximity for Implicit Recommendation

best paper nom and one of the ONLY authors to include WARP in their baselines for pair-wise MF methods.

Combination of LF Models & Graph Models


Only discriminates shallow observations within users/item interactions

Graph Based

explore high-order proximities with graph, but unreached items will not be affected remotely


Why do you have to treat both ideas in isolation?

The idea sounded promising, however, upon looking at eval numbers WARP looks just as good without the engineering effort? Maybe it is worth it for another dataset.

Calibrated Recommendations (Netflix)

We typically optimize for accuracy over calibration. How can you ensure that your recommendations match the distribution of interests for each user?

unbalanced recommendations are a result of uncalibrated models :

The proposed method is a post-processing step to re-adjust rec scores to match users interested, namely, to offset popularity bias.

They mentioned one future direction and the holy grail for this project is Reinforcement Learning.

One interesting note is that Netflix was very secretive on sharing any evaluation results for this method or deployment details (confirming if it was even deployed at all!)

The general idea is to satisfy the constraint that products recommended to you should follow a distribution that is similar to the movies that you watched in the past. So for instance, if you’ve been watching 70% of actions movies and 30% of romantic movies, then the recommender should show you (roughly) 70% of actions movies and 30% romantic movies, even though it could be a better option for Netflix to deliver a different distribution.

Measuring operational quality of recommendations (Zalando)

Deployment talk. Described canary/feature-flags for deployment. Described some monitoring patterns for ML systems. Focused on online systems (vs offline).

Building Recommender Systems with Strict Privacy Boundaries (Slack)

First of all I was wondering what use-case is there at all for Slack to even do recommendations?

Training one Global MF model w/o mixing user data. no data sharing or leaking across boundaries (companies). For example to do challenges recs: User/channel user/item matrix where there are off-limit cells which are different teams (customers).

The general theme at the conference that this was solid engineering work, in that they were able to do some clever feature engineering which allowed them to use a simple I2I CF model.

Their Motivation for training global model, and dealing w/ the complexity of the privacy boundaries in the U/I matrix, is that that team/local data is too sparse. If they can learn global trends the data becomes more dense and CF can work.

Also, an important note they clarify, is that they don’t look at message text, ie. They don’t train on words (NLP). This to avoid data leakage (privacy). They describe other ways to personalize such as meta-data of the message.

Someone asked a question about “Differential Privacy”, he said they didn’t use it b/c it’s complicated.

Artwork Personalization at Netflix

This artwork optimization work was earlier extensively described in at their techblog.

Post recommendation: what artwork to show. Bandits Method. The goal is to determine an incremental effect within an unknown reward distribution. This requires a multi-armed bandit solution as traditional machine learning can’t model this effect. They confirmed this method is deployed into production (which you can easily very by looking at the same title under two different profiles), which means Netflix has a Bandits infrastructure in place: impressive.

There are five aspects that classic matrix factorization is unable to handle:

New methods must enable continuous and fast learning, like multi-armed bandits.

He mentioned it’s mostly as good as just choosing the most popular image. This kinda reminds me of the work Expedia did in learning a model to choose the best image for a hotel listing (however, that was much more simple supervised and non-personalized approach).

Their online test on 125 million users showed that the artwork optimization is most beneficial for less known titles.

Robustness of Top-k rec

Analysis of IR metrics in the recsys space. Argument is a lot of these metrics are not robust in the recsys setting. Gives his stamp of approval for NDCG and [email protected]

I want to review this work again, I think I missed a lot.

StreamingRec: A framework for benchmarking stream-based news recommenders

Software package for datasets w/ strong recency bias

He used an argument against BPR that you have to retrain it from scratch when there’s updates. I think that’s weak, as you can certainly do incremental training: just deserialize the model from disk and add train on the new item to the previous state of the model, e’voila stochastic GD!

Spectral CF

Latent Factor Model

First model your interaction data as graph: Bi-partite user/item interactions

transform/project graph into Spectral Domain (2d) using Fourier transform. (This confused me b/c typically when you see the FFT it puts time domain into frequency domain, but I guess the original signal was not in time domain)

Then, 2D convolve in spectral (spatial) domain to learn LF embeddings

Can optimize a BPR objective (bonus points b/c I feel like most people are not using pair-wise methods)

Reported Baselines beat NCF, ALS and BPR

They also get extra bonus points for using Neural Collaborative Filtering (NCF) as a baseline even thought hardly came out a year ago.

Given that they present such solid baselines, I want to look at it a bit more, especially from the engineering side.

Categorical-Attributes-Based Item Classification for Recommender Systems

In this paper, the author was address the recommendation as classification approach that uses approximated softmax. I think he makes some argument that even, approximated softmax (negative sampling) is too slow. Their proposition is something they unfortunately call Hierarchal Softmax. Unfortunate, b/c that’s already a thing. In fact it’s a well established negative sampling method in literature and in industry, so I’m perplexed that even though their method may deserve that name, why they would go with it….

Anyways, they are making negative sampling faster by some grouping argument: a topic softmax instead of item softmax. If predicting a single item is hard, what about predicting a group of items first, then an item?

I’ll have to revisit this paper, b/c I think it had more merits than just a computational boost. It seemed relatively popular w the audience.

Eliciting Pairwise Preferences in Recommender Systems

Pairwise study. Similar to Nawar’s work at trivago.

How algorithmic confounding in recommendation systems increases homogeneity and decreases utility

Long suspected, but my first time seeing it so well quantified – recommender algorithms homogenizes human behavior significantly.



First hour was background of MF, NN, DL and CNNs. Also some background behind word2vec paradigms, particularly the adaptation into industry prod/item2vec methods where you take a sequence of user order history and learn a next item using a word2vec arch. This learns an item embedding where you can recommend the next item, but it is NOT personalized. And some theory behind RNNs..


Sequence-aware recommendation systems survey paper READ THIS


Matrix Completion > LTR


MF has a long-term bias – ignores recent history (bc it ignores recency). If your history is all shirts and pants and you only recently bought shoes, it will not recommend you shoes, b/c the shirts and pants will be weighted heavier.

Reconsider if your goal is to recommend the most similar item (movies MF) or the next compatible item (seq coffee maker on amzn)?

Session based vs session aware. Based only considers last session. Aware considers all sessions including historical.

Most existing works in sequential recommendations are in e-commerence which is good confirmation of our approach

A good survey of the literature: Although they were missing a technique from last year’s RecSys from Julian McAuley at UCSD called TransRec which achieved session-based recommendations with RNNs.



Deep Learning from Logged Interventions (Cornell)

Thorsten Joachims, Adith Swaminathan

This was a keynote from Joachims which has authored a lot of work in un-biasing recsys. Methods like IPS, Counter Factual Inference and Bandits.

Learning from Bandit Feedback: Suggests to train estimators w/ system log data rather than supervised labels a la ImageNet.

Logged Contextual Bandit Feedback is as opposed to traditional labeled data for the purpose of supervised learning.

In logs from system, sometimes ML system or for example search engines, there is lots of implicit feedback from which to potentially learn. However, handled naively, this would be based on a false assumption that the data is unbiased. This is because the user’s choice is limited and influenced by the algorithm, see Selection and Survivorship Bias. We can still however, learn from this biased log information, or implicit feedback by using tools from counterfactual risk minimization. This is commonly done using some type of Inverse Propensity Scoring (IPS) approach and is in fact what the authors propose. More precisely they introduce some type of modification to a standard softmax layer in a NN which means you can learn a empirical risk estimator instead of a standard variance-optimal estimator. They call this technique BanditNet and released some evaluations against a ResNet20 arch trained on ImageNet labels. They were able to reach the same accuracy in about 225k log observations and then even get a lower error rate than the full-information training. However, I’m not sure what concretely the logged data looks like for this resent test.

A Collective Variational Auto-encoder for Top-N Recommendation with Side Information

Implicit point-wise w/ Side info

Prev work:

Argument is that most methods that incorporate side-info increase input dimensionality too much. His contribution is cVAE which doesn’t have a high-dim problem

References SLIM MF (which you don’t hear of much, I2I right?). Says it’s a user-side auto encoder… Using inspiration from this to make an item-side auto encoder and a feature-side AE.

Baselines are all against non-side-info models?

Github Source code +1

Item Recommendation with Variational Autoencoders and Heterogenous Priors

explicit feedback w/ side info

Linear LF models have limited modeling capacity. People have added, side info, non-linear features and now people are looking NNs and even VAEs.


WhatsApp Image 2018-10-06 at 1.51.33 PM

Using text of reviews for a user Prior (before interactions).

Seemed to suggest they use a full softmax for output (isn’t it too big/slow)????

Apparently the variational part of this VAE doesn’t contribute anything :(

This paper also wins the best graphic award, it was very visually intuitive! +1

Delayed Learning, Multi-objective Optimization, and Whole Slate Generation in Recommender Systems (DeepMind)

This keynote from Ray at DeepMind London highlighted some of their recent research directions. All of was next-level stuff which was hard for me to grasp.

Research areas:

Slate Optimization

I didn’t know what Slate meant before this talk (although I did see it once in a Cornell Thorsten Joachims paper, but forgot the meaning), but I guess it’s the term in literature meaning the whole page or display surface. Meaning how to optimize recommendations for for a grid page layout ie not a list. Traditionally, when we look at recommendations particularly, L2R, we just think of simple search results, a list. But there are more complicated UI layouts that are not lists, so this Slate recommendation handles that. When the page layout goes beyond a simple list, ie a grid, then the choice of placement (recommendation) explodes combinatorially.

Their work in this space is called List-CVAE and is in the very new paper: Conditional Variational Autoencoder, “List-CVAE Model” paper?

She said it’s hard to find public datasets for slate challenges, but one is from the 2015 recsys challenge: They did their baselines one this.

NN Verification

There has been a good amount of activity in this space, especially around hacking NNs.

Solves the problem that after model deployment, we don’t know what recommendations it’ll make in all situations. Supervised learning systems are fragile.

3 Methods:

  1. normal test verification
  2. Adversarial test: if adversarial test set work well, deploy
  3. Model Level Verification

Long Term Value Prediction

Training over long time horizons introduces uncertainty (they observed this in the Netflix challenge too see temporal dynamics).

Learning from delayed Signals

meaningful signals are often delayed

Factorizing predictions can mitigate the impact of delay:

They demonstrated a hypothetical book recommender where what page your on is the fast feedback that feeds into the second model. They then evaluation on the GitHub dataset where commits where the fast feedback that feed into the second model.

Future Directions

Google and Netflix both see Reinforcement Learning as the holy grail of recsys.

Permalink: recsys-2018-recap


Last edited by Alex Egg, 2018-10-09 17:17:17
View Revision History