SIGIR 2019 Recap

Alex Egg,


The annual ACM SIG Information Retrieval Conference was held this year, 2019, in Paris, France. Below are some of my notes from presented works that caught my interest.

LTR Tutorial

Session I: Efficiency/Effectiveness Tradeoffs

A lot of this will talk about documents, where as GH doesn’t have documents per-say, we have known items. So we don’t need to generate text features, we have meta-data.

Background on the classic LTR setting and evaluation:

Arguments about how we have good metrics, now we can’t actually directly optimize for them: eg DCG. We need an approximation of the quality function that is smooth and diff.

Pointwise Algorithms

Ranking as regression task. Documents considered independently (in isolation)

Gradient Boosting Regression Trees

Pointeise approach to learn a strong ranker from weak rankers: sum of many regression trees

Pairwise Algos


Binary classification between pairs of docs.

Better than pointwise algos. However, RankNet cost is not correlated w/ DCG! [he had nice chart of this (put here] b/c we disgard rank information.

Listwise Algos

Lambda Rank

Makes RankNeet sensitive to ranking. Improvement over RankNEt. Multiply ranker loss (sigmoid) with $\delta NDCG$ = $-\lambda_{ji}$ where deltaNDCG is the quality after swapping the two pairwise documents.

Implementations: ListMLEs

Gradient Boosting Regression Trees (GBRT)

This session will focus on GBRT.

Single Stange Ranking

Argument: Online feature generation is expensive for example: a language model.

2 Stang Ranking

Stage 1 (recall-based ranking) -> (query + top k (<1000s) docs) -> Stage 2 (precision-based ranking w/ expensive features)

Multi stage ranking

Stage 1 recall > stage 2 precision > stage 3 contextual (see Yahoo research)

Cheap Ranker (fast) > Accurate Ranker > Very Accurate Ranker….

Complex: GBRT, Lambda Mart

Simple: Regression

LTR Feature Selection

Talking about algorithms to select features “Fast feature selection for learning to rank”

Going over lots of literature about optimizing GBRTs… for example, pruning the forests


Tree Traversal

Going into performance details of these large tree ensembles

Session II: TF Ranking

TF ranking used in gmail instant search. Click/no-click labels

Google drive quick access panel uses TF ranking (no query) noisy click data.

Context could mean query — a more general term

Question: how do you build a dataset from the google drive example? Pairwise comparisons?

List-wise loss is like pairwise, but for all docs and then probably of the permutation (plackett-luce model)

Groupwise Scoring

Score m documents w/ scoring function. Runs a window over scores and pools.

Treis to capture interactions between documenting to generate a score for document.

Loss functions

(Listwise) softmax loss (aka listet) works very well internally for google’s Click datasets. It bounds NDCG.

ApproxNDCG (better than listnet)

Empirical Results

Sebastian: “BERT is a glorified BM25”

They plan to be compatible w/ Keras

Going over arch and protobuff EIE format.

Normalization/scaling goes in the transform function

Session III: Unbiased LTR

From Amsterdam crew

Counterfactual vs online LTR

Human judges are:

User Interactions

User Interactions

[Showed heat map over google results page w/ clear position bias and said “golden triangle”] on Netflix it makes a Z shape (find this!!)

TODO let’s make a grub hub click heatmap (do we record x,y coords?)

Counterfactual LTR

Counterfactual Evaluation

How well does ranker work? Can we eval a ranker w/o showing it to users?

Definition: Evaluate a new ranking function $f_\theta$ user ghistiricla interaction data ( eg clicks) collects form a perviously deployed ranking function $f_{deploy}$


A click $c_i$ is a a biased and noisy indictor that $d_i$ is relevant

A missing click does not necessary indicate non-relevance (missing not at random)

Examination user Model

Assumes clicks only occur on examined documents


Divide the biased evaluation by the probability of observance (position bias)


Given click log can we learn a ranker in unbiased fashion.

Not differentiable but to the rank function. Instead optimize a differentiable upper-bound (RankSVM 2002 Jochims)

Estimating Postion Bias

How to find: $P(o_i=1 R, d_i)$
RandTop-n Algorithm

Show random top-n results to users. If you agg over many sessions it evens out.

Normalize rank clicks for propensities

Methods to minis user impact over RandTOpn

Jointly Learning and Estimating

Instead of learning and getting propensity scores as two separate tasks, do it together.

Trust Bias

People trust the ranker to be good

Pratical Considerations

These systems have high-variance:

Check for outliers (very small propensities that will take over the loss) — propensity clipping where you bound the propensity

Online LTR

Online Evaluation

Not trying to predict an exact metric, but just trying to tell differences A over B:


Ab testing is not specific for rankings. Specific aspects of itnereatio in ranking can be used for more efficient comparisons. Jochims 2013 — this is from the Netflix blog post (team draft). This will give a signal over the two rankers over many trials.

Faster than AB testing

Dueling Bandit Gradient Descent

Yue, Jochims 2009

For each iteration of the bandit process, it samples from the model parameter space and returns the results from the new (sampled) model w/ the previous model in an interleaved format. If the user chose the new sampled model, the parameters are updated in that direction.

I can’t see how this is different from just normal bandits….

Multileave GD

Extension of interleaving. Instead of comparison just 2 rankers, you can compare any n rankers. Speeds up learning by doing exploration efficiently, at the expense of computation .

Pariwise Differentiable Gradient Descent

A pairwise approach can be made unbiased, while being differentiable (no sampling like dueling bandit)

Plackett Luce is another term for Softmax

Called Skip above “very old” :( it is babied. Some presence re ore likely to break inferred due to positon/selciton bias. It inverts rankings. Seemed to say IPS came from this and that is is bad :(

Binary Relevance Labels

Listwise Loss (softmax)

Pointwise Scoring

Evaluation? Can’t be NDCG right? Only MRR

Poster Session: Search

BERT Ranker


Fine tune pre-trained BERT for the novel ranking task.

LTR Session 1

Counterfactual Eval

Aman Agarwal

Position-based click model to estimate propensities [see pic on phone]

SVM PropDCG: unbiased SVM ranker


The main thing I like about this pic is that there is a concrete example of position propensity scores (see monotonically describing as a product of rank)

To model or Intervene (counterfactual stuff)

Rolf Jagerman UvA

User interactions suffer from position bias

Two solutions:


Recorded clicks are treated as relevance labels and debiased

A great work on To Model or to Intervene: A Comparison of Counterfactual and Online Learning to Rank from User Interactions provided a comparison of counterfactual and online methods, two common strategies to deal with bias in the field of Learning to Rank (LTR for short). Counterfactual methods learn a ranking model offline from historical data and use a re-weighting strategy to debias the interaction data. Online methods optimize and update a ranking model after every interaction, combating the bias by displaying slightly modified rankings, i.e. interventions. The counterfactual methods have the advantage of avoiding the risk of showing untested rankings. On the flip side, this means the possible rankings are limited. The online learning allows to explore novel rankings, and applies the learned behavior immediately, but which also increases the risk of potentially hurting the user experience. In practice, the choice between these two approaches has direct impact performance and user satisfaction. This paper provides interesting guidelines on how LTR practitioners should choose which method to apply.


Comparison of Counterfactual and Online Learning to Rank from User Interactions

Monday Orals

Domain Adaptation for Enterprise Email Search (Google) From Google Intern and Rama

Personal search (no shared queries)

Softmax cross entropy loss for ranking on a binary click dataset (eg google drive quick view) was the data biased? — he said it’s a proxy for MRR (Micheal bendersky said something like that too yesterday)

MRR b/c they can’t do NDCG b/c binary relevance labels

Weighted MRR to account for position bias, “positing bias estimate for unbiased learning to rank in personal search” wang et al — acknowledged click bias issue

So his evaluation takes into account the bias, but his learning doesn’t???

Relational Collaborative Filtering

Incorporating item relations improve rec quality. Meta data through some knowledge graph.

Uses attention mechanism (need to learn attention!!!)

BPR loss

Lots of questions about user cold-start — classic factorization issues.

Noise Contrastive Estimation for One-Class Collaborative Filtering Scott Sanner

Once-class CF: one observes positive feedback: social media likes, purchases or clicks

How do you choose negative examples? Assume missing is negative (I don’t see my sampling paper)

Projected Inear rec (PLRec)

User input vocabulary (one hot) > bottle heck (user rep) output is same vocabulary . Autoencoder arch. I think this was the thing in my book from Arthi. The is similar to SLIM. Weight metrics linear w/ users and items: doesn’t scale

They introduce NCE but NCE is just random sampling, so I think they introduce some type of sampling heuristic.

TODO: read this, it could be interesting.

We are aware that dealing with product popularity biases is crucial to accurately embed less popular items. The authors presented an efficient method based on de-popularization of the implicit matrix by re-weighting with respect to the noise contrastive objective (similar to word2vec), and SVD. Check out this intriguing analysis of the ability to deal with popularity bias for the state-of-the-art methods:


Compositional Coding for CF Liu

Sublinear inference for MF (top k dot products)

Why not just use ANN?

Very nice clustering of item embeddings

Not sure I understand this paper…. It’s not just scoring, it’s a training routine too

An efficient adaptive transfer neural net for social aware recs

Using user’s social information

I keep seeing “Fusion” methods

Industry Talks

Alexa, can you help me shop? (Amazon)

Alexa demo is not working on stage (awkward)

Amazon uses the QA section on product pages as a QA dataset for Alexa!


Challenges in search on streaming services (Netflix case study)

75% of searchers are lexical — text-text match ie “Marsilles”

25% are contextual, non-lexical recommentsion ie “Oscar nom”, “polot twist”

13% of quiuires are out of catalog, show similar things (like ULD Expansion)

When using behavioral data it’s important to debase it from offensive things

Instant Search

It ruins the dataset, bc now you only have partial queries. How can you detect intent form only 3 characters? (Ani) instead of the fulll query w.o autocomplete (animanls) How are we handling this at G????

N used to power search w/ Solr. They’re int he processs of migrating to NN Models for ranking


Voice challenges

Query Ghosting (Amazon)

Contextualize query auto completion

40% of queries are autocompleted. 20m requests each day. 10ms latency requirement

Some type of UX pattern which highlights the autocompleted part of the query

Autocompletion uses context (past queries) to make completion. They do cos distance between current recommendation and past queries

Autocomplete/ghosting can help w/ misspellings


Human in the loop recsys

Not sure what he said, the model description was too fast, but they have a blog post you can checkout


Facebook groups “vertical” search engine

Embed queries [see pic]

Click has no correlcaiotn w/ long term satisifaciont so they can’t use it as LTR relevance labels



From semantic retrieval to pairwise ranking (

1% boost in revenue (GMV) was 1 billion$

A lot of queries re impossible to service w/ an inverted index “cellphone around1000 dollars” “large bottle of coca cola”, “cell phone for grandpa”

Tradition techniques to fix is query rewriting w/ hand rules :(

Deep Retrevial System

Embed Queries Tokens > Agg > MLP[see pic]

Queries and item in same embedding space

Click logs

Positive labels: clicks

Negative labels: random non-clicks and user feedback

TF Serving

They do the hash table trainman trick like we are going to do in order to save storage

Deep pairwise ranking

Rank-net style siamese network

No item cold start b/c the item representation is a composition of item meta-data


Ad Search(Baidu)

ANN search


Wed Orals

Bayesian personalized feature interaction selection for FMs

Generic supervised learning method that factors parameters. Given a set of feature hastags, comics, avengers, make the cartesian product of those as the new feature set.

Instead of O(d^2) it’s O(dk)


Sequential Recommendations Is your recommender system able to distinguish products with a recurring need (such as toilet paper) from products typically only bought once (such as a toilet seat)? If not, you risk recommending irrelevant products after a buy (See for example this viral tweet ). At Criteo, we call this problem ‘post-sale recommendation’. To address this problem, Ting Bai and its co-authors propose to model the probability a user will buy a product over time as a Hawkes process parameterized by historical product features. The weight learned for each product represents its typical time of repurchase. Check out the full paper for more details: CTRec: A Long-Short Demands Evolution Model for Continuous-Time Recommendation.


CTRec: A Long-Short Demands Evolution Model for Continuous-Time Recommendation

“Der amazon I bought a toilet seat b/ ci needs don. Necessity, not desire. I dont not collect them I am not a toilet seat addict” — amazon rec sys keeps suggesting it….—————————


Hashing Session

Online Multi-modal Hashing with Dynamic Query-adaption

Heterogeneous data into the same space.


uni-modal: using image to get image

Cross-model using image to get text


This is the lead I need for my text to image search. I think it’s learning2hash

Supervised Hierarchical Cross-Modal Hashing

Supervised cross-modal hashing

Common hamming space connects images and text

See hash net?

OR cross modality embedding: getting images and words/queries in same space?

Unsupervised Neural Generative Semantic Hashing

Hash documents for fast retreival

Create data-dependent hashing, where buckets have similar data

Semantic Hashing

Generating (semantic) bit vector representations of objects (text documents in this work ) also known as hash codes. Fast via hamming (hardware implementation) distance simlairyt computations

LTR Session 2

LTR Bias

Position-based model

Bias curves are different per query

Variance Reduction in Gradient Exploration for Online Learning to Rank

Online LTR: DGBT

The Best Paper Award was attributed to Variance reduction in gradient reduction in online learning to rank, by Huazheng Wang, Sonwoo Kim, Eric McCord-Snook, Qingyun Wu, Hongning Wang. In a context of online learning to rank, the approach reduces the variance of the gradient estimation by projecting the selected updating direction into space spanned by the feature vectors from examined documents under the current query. The paper proves that this method provides an unbiased estimate of the gradient and illustrate the benefits with significant improvements compared to several state-of-the-art models.


Variance reduction in gradient reduction in online learning to rank

Query Emegedding

Embed queries and documents in same space

ECom Wkshp

Modeling User Attention and Engagement

Eugene Agichen

Boring keynote from academia

Multi-objective Relevance Ranking (Amazon)

Amazon using click binary reenact labels for data

Item, purchase
Item 1, 0
Item 2, 1
item3, 0
NDCG is < 1 

Ranked result:

Item, purchase
Item 2, 1
Item 1, 0
item 3, 0
NDCG is 1 (perfet)

Nobody is debasing in industry…., also interestingly enough they are doing NDCG on this binary click list which is that thing I was complaining to the google guys about

Learning Embeddings for Product Size Recommendations (ASOS)

They embed product/sizes. This allows them to recommend sizes for customers.

Take a look at their poster to understand the arch better, inputs were like youtube and out put was a softmax, but I wasn’t sure of the label…

Leverage Implicit Feedback for Context-aware Product Search (Amazon)

In session online ranking across search pages

Click data in search sessions. Clicks are very specific to search intent.

This reminds me of airbnbs real time embeddings paper

Embedding WORDS of search queries

Embedding WORDS of item titles

Embedding Users

Binary relevance labels (again) w/ no debiasing

NB: I just need to ditch skip-above and just do a simple binary relevance labels…. So many data points for this: GDrive, airbnb, amazonx2

Alexa, why do I people buy irrelevant products?


AMZN is very invested in Echo voice search

Getting people to describe fashion is hard (Zalando)

Google zolando semantic-web-technologies to learn about their knowledge graph

Building a Broad Knowledge Graph for Products (Amazon)

Product graph vs knowledge graph

Stores KG in RDF format

One of the summaries of this conference is that I finally need to learn attention.

AMZN learned a model that can use xpaths to find info in DOM trees

SigIR Challenge WKSHP



CEDR: Contextualized Embeddings for Document Ranking


On the Effect of Low-Frequency Terms on Neural IR Models


Item recommendation by Combining Relative and Absolute Feedback

I thought this was interesting b/c of the strange loss (BPR)


Deeper Text Understanding for IR with Contextual Neural Language Modeling

Relevance Matching Baselines:

Semantic Matchine:


Permalink: sigir-2019-recap


Last edited by Alex Egg, 2020-04-28 19:59:13
View Revision History