Weighted Approximate-Rank Pairwise Loss (WARP)
There is no formal paper for WARP for it’s popular use-case in Matrix Factrorization in Recommender Systems, but rather an implementation detail in a classifier model in the WSABIE paper1 from Google.
WSABIE
They learn by $(u,j$) pairs in their training routine, where $u$ is a positive example and $j$ is a random negative. You keep sampling negatives until there is a ranking inversion between the $(I,j)$ pair. You then weight you loss by the number of samples required to cause an inversion. Maybe you can see the logic in that a model that requires many samples to cause an inversion is well trained and a model that requires a small amount of negatives samples to cause an inversion is NOT well trained.
WARP
Like I mentioned it originally came out with Weston et al. WSABIE paper for the task of image annotations,. However, it was first applied to recommenders in the 2013 k-order statistic loss paper in the form of the k-OS WARP loss.
Where $\textbf{Y} $ is the entire item set and $N$ is the number of samples to cause an inversion.
Where
One important implementation detail of the WARP loss as is that it is known to only operate in a fully stochastic environment. This is highlighted in a short paper by Liu and Natarajan at the 2017 RecSys. 2 They formalize the findings as:
- High Variance
- Low updating Frequency
- Intrinsic Sequential Manner
My main issue, is number 3, which I came about when trying to implement WARP for my estimator in Keras. You quickly find that WARPs global sampler doesn’t work in the mini-batch environment.
Intrinsic sequential manner of WARP prevents full utilization of available parallel computation (e.g. GPUs), making it hard to train large or flexible models. 2
WMRB
The proposed solution is to replace the sampling based procedure and offset for the batch size:
Where $Z$ is a subset of $Y$ randomly sampled (without replacement).
Compared to WARP. , WMRB replaces the sequential sampling procedure with batch computations. It results in a different rank approximation and loss function. Per user-item pair, WMRB involves more computation and is compensated with easy utilization of parallel computing. WMRB updates model parameters much more frequently than WARP – which only updates the parameters of one user-item after many sampling. 2
Adaptive Hinge Loss
I also saw this interesting approximation of WARP in the Spotlight3 Project
Where the hinge loss is:
Then adaptive hinge is:
Where $f(\textbf{j})$ is all the negative prediction scores. Then adaptive hinge is:
Conclusion
In most tasks, I see WARP outperforming BPR, so it is an easy drop-in replacement.
I was also suprised to see lots of references to BPR and classic ALS, but only 1 reference baseline to WARP in all of RecSys 2018.
References
Permalink: weighted-approximate-rank-pairwise-loss-WARP
Tags: