RecSys 2018 Recap
The annual ACM Recommender Systems Conference was held this year, 2018, in Vancouver, BC. Below are some of my notes from presented works that caught my interest.
- ~800 Attendees. 73% of attendees at #recsys2018 are from industry. Upward trend.
- A lot of interest in sessions, meaning modeling Sequential and Ordered data.
- item2vec techniques are more popular over latent factors.
- Bandit methods are becoming popular (at least 7 papers)
- Mainstream adoption of Deep Learning recsys in industry (side effect: Deep Learning Workshop is closing). Among DL frameworks, TensorFlow is used exclusively in industry.
- The Deep Learning Workshop was almost exclusively Autoencoders (some were variational)
- BPR/WARP is still very competitive among CF solutions
- Saw trend in Metric learning (replace dot-products)
- Applications of Causal Inference seem popular (see Best Paper)
- Trend on unbiased predictions (IPS, Counter Factual Inference)
- Big names (Google, Netflix) see Reinforcement Learning as holy grail of recsys. Criteo is invested in the space also w/ recogym.
- RecSys 2019 is in Copenhagen and Trivago is sponsoring the Challenge
[TOC]
Best Paper: Causal Embeddings (Criteo AI)
The paper is summarized well by the literature the Criteo AI booth was passing out at the conference:
You can always compute offline the ranking metrics of your recommendation models.
But is this really related to the online performance of the new model? What is the right offline metric that predicts online performance? Meet the new methods f counterfactual inference from observational data that can be used to improve our offline metrics correlation with online performance. And of course, once these metrics become predictive of online behavior, why not directly optimize for them?
If my memory serves me they also picked up the researchers who put out Field-Aware Factorization Machines which won best paper a few years ago.
The paper tackles the task of increasing the desired outcome versus the organic user behavior, which is certainly a healthy way to evaluate recommenders in general.
I looked at the Tensorflow code they published: https://github.com/criteo-research/CausE but I don’t understand it yet. They seem to do a full training routine on the LF model and then afterwards they do a Causal Evaluation on it in the compute_bootstraps
function which will show the gain over the baseline implicit LF model.
Implicit Point-wise CF
factorize a probability matrix instead of a binary interaction matrix.
TF code has a python base class implementation of a point-wise implicit CF model. They implement two losses. A binary cross entropy log loss and an MSE loss.
Takes user, items and the binary indicator. Then, it tries to predict it using the sigmoid of the dot product of the embeddings:
where $\hat{y}$ is the logits in a logistic framework. Minimize w/ binary cross entropy:
The also added an MSE loss in addition:
Causal
Item embedding doesn’t contribute to the loss: $Q’$, this new item embedding reference is read only AND they only read from the 0
index element from the embedding.
creates a Counterfactual Loss and adds to main loss
It is added to the logistic loss above…
Why I like it: multi task learning for recommendation and explanation
user generated reviews as data (amazon).
jointly learning recommendations and explanations using a multitask approach combining matrix factorization and adversarial seq2seq.
pointwise explicit loss (MSE loss)
Interpreting User inaction in recsys
This paper was very popular w/ the audience. User inaction as feedback. I’ll have to go back and review.
Variational L2R (Salesforce)
Learning to Rank (LTR) -> optimal ranking
Variational Inference (VI) -> Optimize probability distribution
IVI+LTR = optimize exploration of knowledge
LTR
Minimize ranking loss: don’t predict clicks, rank better!
- point-wise -> pairwise -> list-wise
- List losses optimize: NDGC, MAP…
- Ranknet, LambdaMart
Traditionally, LTR, is better ranking than click models
VI
probability distribution instead of point estimates.
KL Div: compare distribution of data vs model reproduction
Reparameterization trick (Kingma)
**
VLTR
Pair-wise
IPS is popular to de-bias logged data (see BanditNet with SNIPS)
Early work. No evaluation numbers :(
Adapting Session based recommendation for features through transfer learning (Realtor.com)
all items cold, freshness is important. inventory changes daily.
An interaction is considered an impression (session)
geographically constrained (congregated search)
Listing Embeddings -> build user rep from respective listing embeddings -> dot product embedded prediction
Feature Engineering: They embed categorical variables (categorical feature embeddings )
Pinterest Hybrid Search: Incorporating contextual signals in recommendation
related pins powers 40% of engagement on Pinterest
Gradient Boosted Trees
Pinterest uses an oft-not-seen graph model for recs.
Flipboard: Learning content and usage factors simultaneously to reduce clickbait
cold news items
content and collab filtering . Tagging system for news articles: 16k topics
based on user clicks, they aggregate your affinity
Hybridize the model. Side information using Topic Modeling (LDA)
(hu,koren,voliknsity) implicit CF + topic modeling term as regularizer
One thing I found interesting is that, from my experience, in LF models when you increase the dimensionality of the embeddings past 10D, you don’t really get THAT much more expressive power. However, they reported impressive gains of 23% after increasing from 15D to 128D. I guess it’s a product of how dense your data is…
Multi-stakeholder Recommendation w/ provider constraints
This was an interesting, very theoretical, approach to modeling, what I call, a hierarchal search problem in recsys. For example, you have Restaurants and each Restaurant has Menu Items. You want a model to recommend Restaurants, but you also want a model that can recommend Menu Items. Instead of handling these two tasks in isolation, they propose a method to learn it simultaneously .
They post it as as some type of integer optimization problem: hard to implement and scale.
TransFM
I think this was my best paper award. From UCSD <3 and handles a popular trend this year sequential data in a very clean unified model. It is an extension of last year’s TransRec, which changes the model formulation to FMs which gives the model the ability to incorporate arbitrary side information.
TransRec
TransRec did not include side-info, but is the foundation for the sequential modeling. It is a metric-learning approach which predicts a users recs by nearest neighbor search in the embedding space, where the embeddings were learn by the L2 distance and not the dot product interaction. Recommendations are made by taking the user’s previous item embedding and adding to the user’s translation embedding, then at that point in the embedding space, the nearest neighbor is the next rec!
Exploring recs under user-controlled data filtering
Important work to understand how performance degrades when not using filtered user data (due for example to GDPR)
Item Recommendation on Monotonic Behavior Chains
UCSD
Interesting Idea of considering the whole spectrum of interactions (not just orders or clicks for example).
Whole Spectrum: click -> purchase -> review -> like
how can we use this whole spectrum to model users?
Tensor Factorization: Factorize the cube of [ items x actions x users]
Day 2
HOP-Rec: High-Order Proximity for Implicit Recommendation
best paper nom and one of the ONLY authors to include WARP in their baselines for pair-wise MF methods.
Combination of LF Models & Graph Models
LF
Only discriminates shallow observations within users/item interactions
- MF
- BPR
- WARP
- kOS
Graph Based
explore high-order proximities with graph, but unreached items will not be affected remotely
- Pagerank
- itemrank
- s-step random walk dist
- popularity-based r-ranking r^3
**
Why do you have to treat both ideas in isolation?
- LF + Graph complimentary to each other
- Order aware objectives
- LF based
The idea sounded promising, however, upon looking at eval numbers WARP looks just as good without the engineering effort? Maybe it is worth it for another dataset.
Calibrated Recommendations (Netflix)
We typically optimize for accuracy over calibration. How can you ensure that your recommendations match the distribution of interests for each user?
unbalanced recommendations are a result of uncalibrated models :
- limited data
- accuracy metrics
The proposed method is a post-processing step to re-adjust rec scores to match users interested, namely, to offset popularity bias.
They mentioned one future direction and the holy grail for this project is Reinforcement Learning.
One interesting note is that Netflix was very secretive on sharing any evaluation results for this method or deployment details (confirming if it was even deployed at all!)
The general idea is to satisfy the constraint that products recommended to you should follow a distribution that is similar to the movies that you watched in the past. So for instance, if you’ve been watching 70% of actions movies and 30% of romantic movies, then the recommender should show you (roughly) 70% of actions movies and 30% romantic movies, even though it could be a better option for Netflix to deliver a different distribution.
Measuring operational quality of recommendations (Zalando)
Deployment talk. Described canary/feature-flags for deployment. Described some monitoring patterns for ML systems. Focused on online systems (vs offline).
Building Recommender Systems with Strict Privacy Boundaries (Slack)
First of all I was wondering what use-case is there at all for Slack to even do recommendations?
- Channel Recommendations (to new users)
- Alerting (modeling if a user will be interested in given message)
Training one Global MF model w/o mixing user data. no data sharing or leaking across boundaries (companies). For example to do challenges recs: User/channel user/item matrix where there are off-limit cells which are different teams (customers).
The general theme at the conference that this was solid engineering work, in that they were able to do some clever feature engineering which allowed them to use a simple I2I CF model.
Their Motivation for training global model, and dealing w/ the complexity of the privacy boundaries in the U/I matrix, is that that team/local data is too sparse. If they can learn global trends the data becomes more dense and CF can work.
Also, an important note they clarify, is that they don’t look at message text, ie. They don’t train on words (NLP). This to avoid data leakage (privacy). They describe other ways to personalize such as meta-data of the message.
Someone asked a question about “Differential Privacy”, he said they didn’t use it b/c it’s complicated.
Artwork Personalization at Netflix
This artwork optimization work was earlier extensively described in at their techblog.
Post recommendation: what artwork to show. Bandits Method. The goal is to determine an incremental effect within an unknown reward distribution. This requires a multi-armed bandit solution as traditional machine learning can’t model this effect. They confirmed this method is deployed into production (which you can easily very by looking at the same title under two different profiles), which means Netflix has a Bandits infrastructure in place: impressive.
There are five aspects that classic matrix factorization is unable to handle:
- time sensitivity
- scarce feedback
- dynamic catalogue
- non-stationary member base
- country availability.
New methods must enable continuous and fast learning, like multi-armed bandits.
He mentioned it’s mostly as good as just choosing the most popular image. This kinda reminds me of the work Expedia did in learning a model to choose the best image for a hotel listing (however, that was much more simple supervised and non-personalized approach).
Their online test on 125 million users showed that the artwork optimization is most beneficial for less known titles.
Robustness of Top-k rec
Analysis of IR metrics in the recsys space. Argument is a lot of these metrics are not robust in the recsys setting. Gives his stamp of approval for NDCG and [email protected]
I want to review this work again, I think I missed a lot.
StreamingRec: A framework for benchmarking stream-based news recommenders
Software package for datasets w/ strong recency bias
He used an argument against BPR that you have to retrain it from scratch when there’s updates. I think that’s weak, as you can certainly do incremental training: just deserialize the model from disk and add train on the new item to the previous state of the model, e’voila stochastic GD!
Spectral CF
Latent Factor Model
First model your interaction data as graph: Bi-partite user/item interactions
transform/project graph into Spectral Domain (2d) using Fourier transform. (This confused me b/c typically when you see the FFT it puts time domain into frequency domain, but I guess the original signal was not in time domain)
Then, 2D convolve in spectral (spatial) domain to learn LF embeddings
Can optimize a BPR objective (bonus points b/c I feel like most people are not using pair-wise methods)
Reported Baselines beat NCF, ALS and BPR
They also get extra bonus points for using Neural Collaborative Filtering (NCF) as a baseline even thought hardly came out a year ago.
Given that they present such solid baselines, I want to look at it a bit more, especially from the engineering side.
Categorical-Attributes-Based Item Classification for Recommender Systems
In this paper, the author was address the recommendation as classification approach that uses approximated softmax. I think he makes some argument that even, approximated softmax (negative sampling) is too slow. Their proposition is something they unfortunately call Hierarchal Softmax. Unfortunate, b/c that’s already a thing. In fact it’s a well established negative sampling method in literature and in industry, so I’m perplexed that even though their method may deserve that name, why they would go with it….
Anyways, they are making negative sampling faster by some grouping argument: a topic softmax instead of item softmax. If predicting a single item is hard, what about predicting a group of items first, then an item?
I’ll have to revisit this paper, b/c I think it had more merits than just a computational boost. It seemed relatively popular w the audience.
Eliciting Pairwise Preferences in Recommender Systems
Pairwise study. Similar to Nawar’s work at trivago.
How algorithmic confounding in recommendation systems increases homogeneity and decreases utility
Long suspected, but my first time seeing it so well quantified – recommender algorithms homogenizes human behavior significantly.
Tutorials
FlipKart
First hour was background of MF, NN, DL and CNNs. Also some background behind word2vec paradigms, particularly the adaptation into industry prod/item2vec methods where you take a sequence of user order history and learn a next item using a word2vec arch. This learns an item embedding where you can recommend the next item, but it is NOT personalized. And some theory behind RNNs..
Sequential
Sequence-aware recommendation systems survey paper READ THIS
Historical:
Matrix Completion > LTR
Problems:
- ignores order of actions
- ignores taste changes (interest drift)
- ignores repeat interactions
MF has a long-term bias – ignores recent history (bc it ignores recency). If your history is all shirts and pants and you only recently bought shoes, it will not recommend you shoes, b/c the shirts and pants will be weighted heavier.
Reconsider if your goal is to recommend the most similar item (movies MF) or the next compatible item (seq coffee maker on amzn)?
Session based vs session aware. Based only considers last session. Aware considers all sessions including historical.
Most existing works in sequential recommendations are in e-commerence which is good confirmation of our approach
A good survey of the literature: Although they were missing a technique from last year’s RecSys from Julian McAuley at UCSD called TransRec which achieved session-based recommendations with RNNs.
Workshops
DL4R
- 4 auto-encoder talks our of 5 talks total!
- DeepMind is doing some next-level stuff that is way over my head…
- DL workshop closing b/c industry adoption is high (mission complete)
Deep Learning from Logged Interventions (Cornell)
Thorsten Joachims, Adith Swaminathan
This was a keynote from Joachims which has authored a lot of work in un-biasing recsys. Methods like IPS, Counter Factual Inference and Bandits.
- Generative modeling w/ VAEs is on rise
- Causal Inference/Counter-factual inference
- Off-policy Risk Evaluation: IPS.
- Counterfactual Risk Minimization
- Unbiased MF: ICML16
Learning from Bandit Feedback: Suggests to train estimators w/ system log data rather than supervised labels a la ImageNet.
Logged Contextual Bandit Feedback is as opposed to traditional labeled data for the purpose of supervised learning.
In logs from system, sometimes ML system or for example search engines, there is lots of implicit feedback from which to potentially learn. However, handled naively, this would be based on a false assumption that the data is unbiased. This is because the user’s choice is limited and influenced by the algorithm, see Selection and Survivorship Bias. We can still however, learn from this biased log information, or implicit feedback by using tools from counterfactual risk minimization. This is commonly done using some type of Inverse Propensity Scoring (IPS) approach and is in fact what the authors propose. More precisely they introduce some type of modification to a standard softmax layer in a NN which means you can learn a empirical risk estimator instead of a standard variance-optimal estimator. They call this technique BanditNet and released some evaluations against a ResNet20 arch trained on ImageNet labels. They were able to reach the same accuracy in about 225k log observations and then even get a lower error rate than the full-information training. However, I’m not sure what concretely the logged data looks like for this resent test.
A Collective Variational Auto-encoder for Top-N Recommendation with Side Information
Implicit point-wise w/ Side info
Prev work:
- Denoise Auto-encoder
- CDL (Wang)
- mDA
- VA
- cfVAE (Li and She) (SOTA)
Argument is that most methods that incorporate side-info increase input dimensionality too much. His contribution is cVAE which doesn’t have a high-dim problem
References SLIM MF (which you don’t hear of much, I2I right?). Says it’s a user-side auto encoder… Using inspiration from this to make an item-side auto encoder and a feature-side AE.
Baselines are all against non-side-info models?
Github Source code +1
Item Recommendation with Variational Autoencoders and Heterogenous Priors
explicit feedback w/ side info
Linear LF models have limited modeling capacity. People have added, side info, non-linear features and now people are looking NNs and even VAEs.
VAEs:
- have larger modeling capacity
- generalize linear latency factor models
Using text of reviews for a user Prior (before interactions).
Seemed to suggest they use a full softmax for output (isn’t it too big/slow)????
Apparently the variational part of this VAE doesn’t contribute anything :(
This paper also wins the best graphic award, it was very visually intuitive! +1
Delayed Learning, Multi-objective Optimization, and Whole Slate Generation in Recommender Systems (DeepMind)
This keynote from Ray at DeepMind London highlighted some of their recent research directions. All of was next-level stuff which was hard for me to grasp.
Research areas:
- Slate Optimization
- Verification
- Long term value pred
- learning from delayed signals
Slate Optimization
I didn’t know what Slate meant before this talk (although I did see it once in a Cornell Thorsten Joachims paper, but forgot the meaning), but I guess it’s the term in literature meaning the whole page or display surface. Meaning how to optimize recommendations for for a grid page layout ie not a list. Traditionally, when we look at recommendations particularly, L2R, we just think of simple search results, a list. But there are more complicated UI layouts that are not lists, so this Slate recommendation handles that. When the page layout goes beyond a simple list, ie a grid, then the choice of placement (recommendation) explodes combinatorially.
Their work in this space is called List-CVAE and is in the very new paper: Conditional Variational Autoencoder, “List-CVAE Model” paper? https://arxiv.org/abs/1803.01682
She said it’s hard to find public datasets for slate challenges, but one is from the 2015 recsys challenge: http://2015.recsyschallenge.com They did their baselines one this.
NN Verification
There has been a good amount of activity in this space, especially around hacking NNs.
Solves the problem that after model deployment, we don’t know what recommendations it’ll make in all situations. Supervised learning systems are fragile.
3 Methods:
- normal test verification
- Adversarial test: if adversarial test set work well, deploy
- Model Level Verification
Long Term Value Prediction
Training over long time horizons introduces uncertainty (they observed this in the Netflix challenge too see temporal dynamics).
Learning from delayed Signals
meaningful signals are often delayed
Factorizing predictions can mitigate the impact of delay:
- create 2 models
- exploit information before the label arrives
- One model updates quickly but depends on the specific item
- Second model updates slowly but generalizes
They demonstrated a hypothetical book recommender where what page your on is the fast feedback that feeds into the second model. They then evaluation on the GitHub dataset where commits where the fast feedback that feed into the second model.
Future Directions
- ML Faireness in Recsys
- Simulating rec systems
- List-CVAE as a policy network in RL environments
Google and Netflix both see Reinforcement Learning as the holy grail of recsys.
Permalink: recsys-2018-recap
Tags: