SIGIR 2019 Recap
SIGIR ‘19
The annual ACM SIG Information Retrieval Conference was held this year, 2019, in Paris, France. Below are some of my notes from presented works that caught my interest.
- BERT
- Attention
- Fusion
- China
- Nobody in industry is debiasing
- Industry is using click data
- Academia/datasets are using relevance judgements
- Deep Learning
- Industry: Amazon, Spotify, Ebay, Microsoft, Bloomberg, Netflix, Google (Apple is big sponsor, but doesn’t publish)
- UvA IR Lab is a powerhouse
LTR Tutorial
Session I: Efficiency/Effectiveness Tradeoffs
A lot of this will talk about documents, where as GH doesn’t have documents per-say, we have known items. So we don’t need to generate text features, we have meta-data.
Background on the classic LTR setting and evaluation:
- Query/document pairs
- Relevance labels: explicit
- Evaluation: binary/graded, DCG, RBP, ERR
- ERR correlates better with user satisfactions (clicks and editorials) [CMZG09]
Arguments about how we have good metrics, now we can’t actually directly optimize for them: eg DCG. We need an approximation of the quality function that is smooth and diff.
Pointwise Algorithms
Ranking as regression task. Documents considered independently (in isolation)
Gradient Boosting Regression Trees
Pointeise approach to learn a strong ranker from weak rankers: sum of many regression trees
Pairwise Algos
Ranknet
Binary classification between pairs of docs.
Better than pointwise algos. However, RankNet cost is not correlated w/ DCG! [he had nice chart of this (put here] b/c we disgard rank information.
Listwise Algos
Lambda Rank
Makes RankNeet sensitive to ranking. Improvement over RankNEt. Multiply ranker loss (sigmoid) with $\delta NDCG$ = $-\lambda_{ji}$ where deltaNDCG is the quality after swapping the two pairwise documents.
Implementations: ListMLEs
Gradient Boosting Regression Trees (GBRT)
This session will focus on GBRT.
Single Stange Ranking
Argument: Online feature generation is expensive for example: a language model.
2 Stang Ranking
Stage 1 (recall-based ranking) -> (query + top k (<1000s) docs) -> Stage 2 (precision-based ranking w/ expensive features)
Multi stage ranking
Stage 1 recall > stage 2 precision > stage 3 contextual (see Yahoo research)
Cheap Ranker (fast) > Accurate Ranker > Very Accurate Ranker….
Complex: GBRT, Lambda Mart
Simple: Regression
LTR Feature Selection
Talking about algorithms to select features “Fast feature selection for learning to rank”
Going over lots of literature about optimizing GBRTs… for example, pruning the forests
CLEAVER, XCLEAVER, DART, XDART
Tree Traversal
Going into performance details of these large tree ensembles
- Vectorized Trees (VPred)
- QuickScorer
- RapidScorer
Session II: TF Ranking
TF ranking used in gmail instant search. Click/no-click labels
Google drive quick access panel uses TF ranking (no query) noisy click data.
Context could mean query — a more general term
Question: how do you build a dataset from the google drive example? Pairwise comparisons?
List-wise loss is like pairwise, but for all docs and then probably of the permutation (plackett-luce model)
Groupwise Scoring
Score m documents w/ scoring function. Runs a window over scores and pools.
Treis to capture interactions between documenting to generate a score for document.
Loss functions
(Listwise) softmax loss (aka listet) works very well internally for google’s Click datasets. It bounds NDCG.
ApproxNDCG (better than listnet)
Empirical Results
Sebastian: “BERT is a glorified BM25”
They plan to be compatible w/ Keras
Going over arch and protobuff EIE format.
Normalization/scaling goes in the transform function
Session III: Unbiased LTR
From Amsterdam crew
Counterfactual vs online LTR
Human judges are:
- expensive $10k at least (for TREK)
- Unethical (privacy)
- Impossible in personalized settings
- Stationary, cannot capture future changes in relevancy
- Not necessarily aligned w/ actual user preferences (bias)
User Interactions
- Solves problems for human annotations
- Free
- User behavior is indicative of their preferences
- However, interactions give implicit feedback :(
User Interactions
- Noise:
- Users click for unexpected reasons
- Often clicks office not because of relevancy
- Often clicks do no occur despite relevance
- Bias
- Interactions are affected by factors other than relevancy
- Postion bias: higher ranked document get more attention (top-to-bottom)
- Item selection bias: interactions are limited to the presented documents (if we dont show to user, we dont get feedback)
- Presentation bais: resuls that are presented differently will be treated differently. For example visually attractive elements will get more attention, b/c they are visually attractive not more relevant.
[Showed heat map over google results page w/ clear position bias and said “golden triangle”] on Netflix it makes a Z shape (find this!!)
TODO let’s make a grub hub click heatmap (do we record x,y coords?)
Counterfactual LTR
- Counterfactual Evaluation
- Propensity-weighted LTR
- Estimating Postion Bias
- Practical Considerations
Counterfactual Evaluation
How well does ranker work? Can we eval a ranker w/o showing it to users?
Definition: Evaluate a new ranking function $f_\theta$ user ghistiricla interaction data ( eg clicks) collects form a perviously deployed ranking function $f_{deploy}$
Clicks
A click $c_i$ is a a biased and noisy indictor that $d_i$ is relevant
A missing click does not necessary indicate non-relevance (missing not at random)
Examination user Model
Assumes clicks only occur on examined documents
IPS
Divide the biased evaluation by the probability of observance (position bias)
IPS LTR
Given click log can we learn a ranker in unbiased fashion.
Not differentiable but to the rank function. Instead optimize a differentiable upper-bound (RankSVM 2002 Jochims)
Estimating Postion Bias
How to find: $P(o_i=1 | R, d_i)$ |
RandTop-n Algorithm
Show random top-n results to users. If you agg over many sessions it evens out.
Normalize rank clicks for propensities
Methods to minis user impact over RandTOpn
- Rand Pair:
- Interventaional Sets
Jointly Learning and Estimating
Instead of learning and getting propensity scores as two separate tasks, do it together.
Trust Bias
People trust the ranker to be good
Pratical Considerations
These systems have high-variance:
- Not enough training data
- Extreme positon bias and very small propensity
- Large amounts of noisy clicks on douemtn w/ small propensity
Check for outliers (very small propensities that will take over the loss) — propensity clipping where you bound the propensity
Online LTR
Online Evaluation
Not trying to predict an exact metric, but just trying to tell differences A over B:
- AB Testing
- Interleaving
Interleaving
Ab testing is not specific for rankings. Specific aspects of itnereatio in ranking can be used for more efficient comparisons. Jochims 2013 — this is from the Netflix blog post (team draft). This will give a signal over the two rankers over many trials.
Faster than AB testing
Dueling Bandit Gradient Descent
Yue, Jochims 2009
For each iteration of the bandit process, it samples from the model parameter space and returns the results from the new (sampled) model w/ the previous model in an interleaved format. If the user chose the new sampled model, the parameters are updated in that direction.
I can’t see how this is different from just normal bandits….
Multileave GD
Extension of interleaving. Instead of comparison just 2 rankers, you can compare any n rankers. Speeds up learning by doing exploration efficiently, at the expense of computation .
Pariwise Differentiable Gradient Descent
A pairwise approach can be made unbiased, while being differentiable (no sampling like dueling bandit)
Plackett Luce is another term for Softmax
Called Skip above “very old” :( it is babied. Some presence re ore likely to break inferred due to positon/selciton bias. It inverts rankings. Seemed to say IPS came from this and that is is bad :(
Binary Relevance Labels
Listwise Loss (softmax)
Pointwise Scoring
Evaluation? Can’t be NDCG right? Only MRR
Poster Session: Search
BERT Ranker
Fine tune pre-trained BERT for the novel ranking task.
LTR Session 1
Counterfactual Eval
Aman Agarwal
Position-based click model to estimate propensities [see pic on phone]
SVM PropDCG: unbiased SVM ranker
The main thing I like about this pic is that there is a concrete example of position propensity scores (see monotonically describing as a product of rank)
To model or Intervene (counterfactual stuff)
Rolf Jagerman UvA
User interactions suffer from position bias
Two solutions:
- Counterfactual LTR
- Learning form historical interaction
- Online LTR
- Learns by interactively optimizing the model directly w/ the user
Counterfactual
Recorded clicks are treated as relevance labels and debiased
A great work on To Model or to Intervene: A Comparison of Counterfactual and Online Learning to Rank from User Interactions provided a comparison of counterfactual and online methods, two common strategies to deal with bias in the field of Learning to Rank (LTR for short). Counterfactual methods learn a ranking model offline from historical data and use a re-weighting strategy to debias the interaction data. Online methods optimize and update a ranking model after every interaction, combating the bias by displaying slightly modified rankings, i.e. interventions. The counterfactual methods have the advantage of avoiding the risk of showing untested rankings. On the flip side, this means the possible rankings are limited. The online learning allows to explore novel rankings, and applies the learned behavior immediately, but which also increases the risk of potentially hurting the user experience. In practice, the choice between these two approaches has direct impact performance and user satisfaction. This paper provides interesting guidelines on how LTR practitioners should choose which method to apply.
Comparison of Counterfactual and Online Learning to Rank from User Interactions
Monday Orals
Domain Adaptation for Enterprise Email Search (Google)
https://arxiv.org/abs/1906.07897 From Google Intern and Rama
Personal search (no shared queries)
Softmax cross entropy loss for ranking on a binary click dataset (eg google drive quick view) was the data biased? — he said it’s a proxy for MRR (Micheal bendersky said something like that too yesterday)
MRR b/c they can’t do NDCG b/c binary relevance labels
Weighted MRR to account for position bias, “positing bias estimate for unbiased learning to rank in personal search” wang et al — acknowledged click bias issue
So his evaluation takes into account the bias, but his learning doesn’t???
Relational Collaborative Filtering
https://arxiv.org/abs/1904.12796
Incorporating item relations improve rec quality. Meta data through some knowledge graph.
Uses attention mechanism (need to learn attention!!!)
BPR loss
Lots of questions about user cold-start — classic factorization issues.
Noise Contrastive Estimation for One-Class Collaborative Filtering
http://delivery.acm.org/10.1145/3340000/3331201/p135-wu.pdf Scott Sanner
Once-class CF: one observes positive feedback: social media likes, purchases or clicks
How do you choose negative examples? Assume missing is negative (I don’t see my sampling paper)
Projected Inear rec (PLRec)
User input vocabulary (one hot) > bottle heck (user rep) output is same vocabulary . Autoencoder arch. I think this was the thing in my book from Arthi. The is similar to SLIM. Weight metrics linear w/ users and items: doesn’t scale
They introduce NCE but NCE is just random sampling, so I think they introduce some type of sampling heuristic.
TODO: read this, it could be interesting.
We are aware that dealing with product popularity biases is crucial to accurately embed less popular items. The authors presented an efficient method based on de-popularization of the implicit matrix by re-weighting with respect to the noise contrastive objective (similar to word2vec), and SVD. Check out this intriguing analysis of the ability to deal with popularity bias for the state-of-the-art methods:
Compositional Coding for CF
http://delivery.acm.org/10.1145/3340000/3331206/p145-liu.pdf Liu
Sublinear inference for MF (top k dot products)
Why not just use ANN?
Very nice clustering of item embeddings
Not sure I understand this paper…. It’s not just scoring, it’s a training routine too
An efficient adaptive transfer neural net for social aware recs
http://delivery.acm.org/10.1145/3340000/3331192/p225-chen.pdf
Using user’s social information
I keep seeing “Fusion” methods
Industry Talks
Alexa, can you help me shop? (Amazon)
Alexa demo is not working on stage (awkward)
Amazon uses the QA section on product pages as a QA dataset for Alexa!
Netflix
Challenges in search on streaming services (Netflix case study)
75% of searchers are lexical — text-text match ie “Marsilles”
25% are contextual, non-lexical recommentsion ie “Oscar nom”, “polot twist”
13% of quiuires are out of catalog, show similar things (like ULD Expansion)
When using behavioral data it’s important to debase it from offensive things
Instant Search
It ruins the dataset, bc now you only have partial queries. How can you detect intent form only 3 characters? (Ani) instead of the fulll query w.o autocomplete (animanls) How are we handling this at G????
N used to power search w/ Solr. They’re int he processs of migrating to NN Models for ranking
Comcast
Voice challenges
Query Ghosting (Amazon)
Contextualize query auto completion
40% of queries are autocompleted. 20m requests each day. 10ms latency requirement
Some type of UX pattern which highlights the autocompleted part of the query
Autocompletion uses context (past queries) to make completion. They do cos distance between current recommendation and past queries
Autocomplete/ghosting can help w/ misspellings
Stichfix
Human in the loop recsys
Not sure what he said, the model description was too fast, but they have a blog post you can checkout
Facebook groups “vertical” search engine
Embed queries [see pic]
Click has no correlcaiotn w/ long term satisifaciont so they can’t use it as LTR relevance labels
searchign-for-communities-a-facebook-way
From semantic retrieval to pairwise ranking (JD.com)
1% boost in revenue (GMV) was 1 billion$
A lot of queries re impossible to service w/ an inverted index “cellphone around1000 dollars” “large bottle of coca cola”, “cell phone for grandpa”
Tradition techniques to fix is query rewriting w/ hand rules :(
Deep Retrevial System
Embed Queries Tokens > Agg > MLP[see pic]
Queries and item in same embedding space
Click logs
Positive labels: clicks
Negative labels: random non-clicks and user feedback
TF Serving
They do the hash table trainman trick like we are going to do in order to save storage
Deep pairwise ranking
Rank-net style siamese network
No item cold start b/c the item representation is a composition of item meta-data
Ad Search(Baidu)
ANN search
BERT
Wed Orals
Bayesian personalized feature interaction selection for FMs
http://delivery.acm.org/10.1145/3340000/3331196/p665-chen.pdf
Generic supervised learning method that factors parameters. Given a set of feature hastags, comics, avengers, make the cartesian product of those as the new feature set.
Instead of O(d^2) it’s O(dk)
CTRec
http://delivery.acm.org/10.1145/3340000/3331199/p675-bai.pdf
Sequential Recommendations Is your recommender system able to distinguish products with a recurring need (such as toilet paper) from products typically only bought once (such as a toilet seat)? If not, you risk recommending irrelevant products after a buy (See for example this viral tweet ). At Criteo, we call this problem ‘post-sale recommendation’. To address this problem, Ting Bai and its co-authors propose to model the probability a user will buy a product over time as a Hawkes process parameterized by historical product features. The weight learned for each product represents its typical time of repurchase. Check out the full paper for more details: CTRec: A Long-Short Demands Evolution Model for Continuous-Time Recommendation.
CTRec: A Long-Short Demands Evolution Model for Continuous-Time Recommendation
“Der amazon I bought a toilet seat b/ ci needs don. Necessity, not desire. I dont not collect them I am not a toilet seat addict” — amazon rec sys keeps suggesting it…. https://twitter.com/GirlFromBlupo/status/982156453396996096?source=post_page—————————
##
Hashing Session
Online Multi-modal Hashing with Dynamic Query-adaption
https://doi.org/10.1145/3331184.3331217
Heterogeneous data into the same space.
Hamming
uni-modal: using image to get image
Cross-model using image to get text
mutli-modal
This is the lead I need for my text to image search. I think it’s learning2hash
Supervised Hierarchical Cross-Modal Hashing
https://dl.acm.org/citation.cfm?doid=3331184.3331229
Supervised cross-modal hashing
Common hamming space connects images and text
See hash net?
OR cross modality embedding: getting images and words/queries in same space?
Unsupervised Neural Generative Semantic Hashing
https://dl.acm.org/citation.cfm?doid=3331184.3331255
Hash documents for fast retreival
Create data-dependent hashing, where buckets have similar data
Semantic Hashing
Generating (semantic) bit vector representations of objects (text documents in this work ) also known as hash codes. Fast via hamming (hardware implementation) distance simlairyt computations
LTR Session 2
LTR Bias
Position-based model
Bias curves are different per query
Variance Reduction in Gradient Exploration for Online Learning to Rank
https://dl.acm.org/citation.cfm?doid=3331184.3331264
Online LTR: DGBT
- BEST PAPER *
The Best Paper Award was attributed to Variance reduction in gradient reduction in online learning to rank, by Huazheng Wang, Sonwoo Kim, Eric McCord-Snook, Qingyun Wu, Hongning Wang. In a context of online learning to rank, the approach reduces the variance of the gradient estimation by projecting the selected updating direction into space spanned by the feature vectors from examined documents under the current query. The paper proves that this method provides an unbiased estimate of the gradient and illustrate the benefits with significant improvements compared to several state-of-the-art models.
Variance reduction in gradient reduction in online learning to rank
Query Emegedding
Embed queries and documents in same space
ECom Wkshp
Modeling User Attention and Engagement
Eugene Agichen
Boring keynote from academia
Multi-objective Relevance Ranking (Amazon)
Amazon using click binary reenact labels for data
Item, purchase
Item 1, 0
Item 2, 1
item3, 0
NDCG is < 1
Ranked result:
Item, purchase
Item 2, 1
Item 1, 0
item 3, 0
NDCG is 1 (perfet)
Nobody is debasing in industry…., also interestingly enough they are doing NDCG on this binary click list which is that thing I was complaining to the google guys about
Learning Embeddings for Product Size Recommendations (ASOS)
They embed product/sizes. This allows them to recommend sizes for customers.
Take a look at their poster to understand the arch better, inputs were like youtube and out put was a softmax, but I wasn’t sure of the label…
Leverage Implicit Feedback for Context-aware Product Search (Amazon)
In session online ranking across search pages
Click data in search sessions. Clicks are very specific to search intent.
This reminds me of airbnbs real time embeddings paper
Embedding WORDS of search queries
Embedding WORDS of item titles
Embedding Users
Binary relevance labels (again) w/ no debiasing
NB: I just need to ditch skip-above and just do a simple binary relevance labels…. So many data points for this: GDrive, airbnb, amazonx2
Alexa, why do I people buy irrelevant products?
AMZN
AMZN is very invested in Echo voice search
Getting people to describe fashion is hard (Zalando)
Google zolando semantic-web-technologies to learn about their knowledge graph
Building a Broad Knowledge Graph for Products (Amazon)
Product graph vs knowledge graph
Stores KG in RDF format
One of the summaries of this conference is that I finally need to learn attention.
AMZN learned a model that can use xpaths to find info in DOM trees
SigIR Challenge WKSHP
`
Posters
CEDR: Contextualized Embeddings for Document Ranking
On the Effect of Low-Frequency Terms on Neural IR Models
Item recommendation by Combining Relative and Absolute Feedback
I thought this was interesting b/c of the strange loss (BPR)
Deeper Text Understanding for IR with Contextual Neural Language Modeling
Relevance Matching Baselines:
- BOW
- BM25
- SDM (Sequence Dependency Model)
- Coor-Ascent
- a state-of-the-art listwise ranker
- Conv-KNRM
- DRMM (Neural)
- DRMM is the state-of-the-art interaction based neural ranking
model [9]. It performs histogram pooling on the embedding based
translation matrix and uses the binned soft-TF as the input to a
ranking neural network. The embeddings used are pre-trained via
word2vec [17] because the histograms are not differentiable and
prohibit end-to-end learning.
- DRMM is the state-of-the-art interaction based neural ranking
- DSMM (Neural)
Semantic Matchine:
- word2vec
- Glove
Permalink: sigir-2019-recap
Tags: