Weighted Approximate-Rank Pairwise Loss (WARP)

Alex Egg,

There is no formal paper for WARP for it’s popular use-case in Matrix Factrorization in Recommender Systems, but rather an implementation detail in a classifier model in the WSABIE paper1 from Google.


They learn by $(u,j$) pairs in their training routine, where $u$ is a positive example and $j$ is a random negative. You keep sampling negatives until there is a ranking inversion between the $(I,j)$ pair. You then weight you loss by the number of samples required to cause an inversion. Maybe you can see the logic in that a model that requires many samples to cause an inversion is well trained and a model that requires a small amount of negatives samples to cause an inversion is NOT well trained.


Like I mentioned it originally came out with Weston et al. WSABIE paper for the task of image annotations,. However, it was first applied to recommenders in the 2013 k-order statistic loss paper in the form of the k-OS WARP loss.

Where $\textbf{Y} $ is the entire item set and $N$ is the number of samples to cause an inversion.


One important implementation detail of the WARP loss as is that it is known to only operate in a fully stochastic environment. This is highlighted in a short paper by Liu and Natarajan at the 2017 RecSys. 2 They formalize the findings as:

  1. High Variance
  2. Low updating Frequency
  3. Intrinsic Sequential Manner

My main issue, is number 3, which I came about when trying to implement WARP for my estimator in Keras. You quickly find that WARPs global sampler doesn’t work in the mini-batch environment.

Intrinsic sequential manner of WARP prevents full utilization of available parallel computation (e.g. GPUs), making it hard to train large or flexible models. 2


The proposed solution is to replace the sampling based procedure and offset for the batch size:

Where $Z$ is a subset of $Y$ randomly sampled (without replacement).

Compared to WARP. , WMRB replaces the sequential sampling procedure with batch computations. It results in a different rank approximation and loss function. Per user-item pair, WMRB involves more computation and is compensated with easy utilization of parallel computing. WMRB updates model parameters much more frequently than WARP – which only updates the parameters of one user-item after many sampling. 2

Adaptive Hinge Loss

I also saw this interesting approximation of WARP in the Spotlight3 Project

Where the hinge loss is:

Then adaptive hinge is:

Where $f(\textbf{j})$ is all the negative prediction scores. Then adaptive hinge is:


In most tasks, I see WARP outperforming BPR, so it is an easy drop-in replacement.

I was also suprised to see lots of references to BPR and classic ALS, but only 1 reference baseline to WARP in all of RecSys 2018.


  1. Weston, Jason, Samy Bengio, and Nicolas Usunier. “Wsabie:Scaling up to large vocabulary image annotation.” IJCAI. Vol. 11. 2011. 

  2. Liu, Kwan. WMRB: Learning to Rank in a Searchable Batch Training Approach  2 3

  3. Maciej. Spotlight. https://github.com/maciejkula/spotlight 

Permalink: weighted-approximate-rank-pairwise-loss-WARP


Last edited by Alex Egg, 2018-10-08 21:28:51
View Revision History