Skip to content

Latest commit

 

History

History
369 lines (248 loc) · 24.2 KB

Cascaded Ranking Models.md

File metadata and controls

369 lines (248 loc) · 24.2 KB

Cascade Ranking Models

Given a query $q$ (context) and a set of documents $D$ (items), the goal is to order elements of $D$ such that the resulting ranked list maximizes a user satisfaction metric $Q$ (criteria).

Except the subjective criteria, we can take advantage of NLP to unserstand the query and documnets and apply big data technology to process so many documents.

A cascaded ranking architecture turns ranking into a pipeline of multiple stages, and has been shown to be a powerful approach to balancing efficiency and effectiveness trade-offs in large-scale search systems.

Our core idea is to consider the ranking problem as a "cascade", where ranking is broken into a finite number of distinct stages. Each stage considers successively richer and more complex features, but over successively smaller candidate document sets. The intuition is that although complex features are more time-consuming to compute, the additional overhead is offset by examining fewer documents. In other words, the cascade model views retrieval as a multi-stage progressive refinement problem, where each stage balances the cost of exploiting various features with the potential gain in terms of better results. We have explored this notion in the context of linear models and tree-based models.

This document set is often retrieved from the collection using a simple unsupervised bag-of-words method, e.g. BM25. This can potentially lead to learning a suboptimal ranking, since many relevant documents may be excluded from the initially retrieved set.

Matching and Retrieval

Our aim in matching stage is to exclude the irrelevant documents with the query $q$ from the candidate documents $D$.

From another perspective, it is to choose $k$ from $n$,
where $n$ is much larger than $k$.

Relevance Matching

A core problem of information retrieval (IR) is relevance matching (RM), where the goal is to rank documents by relevance to a user’s query.

Given a query and a set of candidate text documents, relevance ranking algorithms determine how relevant each text document is for the given query.

The Probabilistic Relevance

The assumptions about relevance are as follows:

  1. Relevance is assumed to be a property of the document given information need only, assessable without reference to other documents; and
  2. The relevance property is assumed to be binary.

Either of these assumptions is at the least arguable.

Semantic Matching

There are fundamental differences between semantic matching and relevance matching:

Semantic matching emphasizes “meaning” correspondences by exploiting lexical information (e.g., words, phrases, entities) and compositional structures (e.g., dependency trees), while relevance matching focuses on keyword matching.

The semantic matching problem in product search seeks to retrieve all semantically relevant products given a user query. Recent studies have shown that extreme multi-label classification (XMC) model enjoys both low inference latency and high recall in real-world scenarios.

Extreme multi-label classification (XMC) is the problem of finding the relevant labels for an input, from a very large universe of possible labels.

Lexical Matching

Classical information retrieval systems such as BM25 rely on exact lexical match and carry out search efficiently with inverted list index.

Embedding-based Retrieval

DSSM stands for Deep Structured Semantic Model, or more general, Deep Semantic Similarity Model. DSSM can be used to develop latent semantic models that project entities of different types (e.g., queries and documents) into a common low-dimensional semantic space for a variety of machine learning tasks such as ranking and classification. For example, in web search ranking, the relevance of a document given a query can be readily computed as the distance between them in that space.

DSSM is extended as the two tower model, where the query and document are represented as vectors via different neural networks.

Embedding based retrieval (EBR; a.k.a. vector search) provides an efficient implementation of semantic search and has seen wide adoption in e-commerce.

Hybrid Retrieval

This hypothesis is plausible for methods using very different mechanisms for retrieval because there are many more irrelevant than relevant documents for most queries and corpuses. If methods retrieve relevant and irrelevant documents independently and uniformly at random, this imbalance means it is much more probable for relevant documents to match than irrelevant ones.

Rank Aggregation

When there is just a single criterion (or "judge") for ranking, the task is relatively easy, and is simply a reaction of the judge's opinions and biases. In contrast, this paper addresses the problem of computing a "consensus" ranking of the alternatives, given the individual ranking preferences of several judges. We call this the rank aggregation problem.

Pre-Ranking

Existing pre-ranking systems primarily adopt the two-tower model since the "user-item decoupling architecture" paradigm is able to balance the efficiency and effectiveness.

The pre-ranking is widely considered a mini-ranking module, as it needs to rank hundreds of times more items than the ranking under limited latency.

In the pre-ranking stage, vector-product based models with representation-focused architecture are commonly adopted to account for system efficiency.

The goal of pre-rank is to select some documents(items) efficiently.

Ranking

Given a query $q$ (context) and a set of documents $D$ (items), the goal is to order elements of $D$ such that the resulting ranked list maximizes a user satisfaction metric $Q$ (criteria).

In cascaded ranking architecture, the set $D$ is generated by matching(recall, retrieval). Here we focus on learning to rank.

Learning to Rank

Learning to rank is usually formalized as a supervised learning task, while unsupervised learning and semi-supervised learning formulations are also possible. In learning, training data consisting of sets of objects as well as the total or partial orders of the objects in each set is given, and a ranking model is learned using the data. In prediction, a new set of objects is given, and a ranking list of the objects is created using the ranking model.

Feature Selection for Learning to Rank

LambdaRank

LambdaRank introduced the $\lambda_i$ when update of parameters $w$ of the model $f:\mathbb{R}^P\to\mathbb{R}$. The key observation of LambdaRank is thus that in order to train a model, we do not need the costs themselves: we only need the gradients (of the costs with respect to the model scores).

You can think of these gradients as little arrows attached to each document in the ranked list, indicating which direction we’d like those documents to move. LambdaRank simply took the RankNet gradients, which we knew worked well, and scaled them by the change in NDCG found by swapping each pair of documents. We found that this training generated models with significantly improved relevance (as measured by NDCG) and had an added bonus of uncovering a further trick that improved overall training speed (for both RankNet and LambdaRank). Furthermore, surprisingly, we found empirical evidence (see also this paper) that the training procedure does actually optimize NDCG, even though the method, although intuitive and sensible, has no such guarantee.

We compute the gradients of RankNet by: $$ \frac{\partial L}{\partial w} = \sum_{(i, j)}\frac{\partial L_{i, j}}{\partial w}=\sum_{(i, j)}[\frac{\partial L_{i, j}}{\partial s_i}+\frac{\partial L_{i,j}}{\partial s_j}]. $$

Observe that $$\frac{\partial L_{i, j}}{\partial s_i} = -\frac{\partial L_{i,j}}{\partial s_j}$$ and define

$$ {\lambda}{i,j}=\frac{\partial L{i, j}}{\partial s_i} = -\frac{\partial L_{i, j}}{\partial s_j} = -\frac{\sigma}{1 + \exp(\sigma(s_i - s_j))}. $$

What is more, we can extend it to

$$ {\lambda}_{i,j}= -\frac{\sigma}{1+\exp(\sigma( s_i -s_j))}|\Delta Z|, $$ where $\Delta Z$ is the size of the change in some Information Retrieval Measures ${Z}$.

And $\lambda_{i}$ with respect to ${i}$-th item is defined as $$\lambda_i = \sum_{i\in(i,j)}\lambda_{(i,j)}-\sum_{j\in(j,i)}\lambda_{(i,j)}$$


LambdaMART

LambdaMART is the boosted tree version of LambdaRank, which is based on RankNet. It takes the ranking problem as classification problem.

MART stands for Multiple Additive Regression Tree. In LambdaRank, we compute the gradient. And we can use this gradient to make up the GBRT.

LambdaMART had an added advantage: the training of tree ensemble models can be very significantly sped up over the neural net equivalent (this work, led by O. Dekel, is not yet published). This allows us to train with much larger data sets, which again gives improved ranking accuracy. From RankNet: A ranking retrospective.



To implement LambdaMART we just use MART, specifying appropriate gradients and the Newton step. The key point is the gradient of the ${\lambda}i$: $$ w_i = \frac{\partial y_i}{\partial F{k-1}(\vec{x}_i)} $$ where $\lambda_i = y_i$ is defined in LambdaRank. LambdaRank updates all the weights after each query is examined. The decisions (splits at the nodes) in LambdaMART, on the other hand, are computed using all the data that falls to that node, and so LambdaMART updates only a few parameters at a time (namely, the split values for the current leaf nodes), but using all the data (since every $x_i$ lands in some leaf). This means in particular that LambdaMART is able to choose splits and leaf values that may decrease the utility for some queries, as long as the overall utility increases.

GBRT+LR can also used to predict the CTR ratio. On short but incomplete word, it is GBRT + LR - gradient boosting regression tree and logistic regression. GBRT is introduced at the Boosting section. LR is to measure the cost as the same in RankNet.

lambdaLoss

Click-Through Rate Prediction

The purpose of click-through rate (CTR) prediction is to anticipate how likely a person is to click on an advertisement or item.

Re-Ranking

Beyond user satisfaction metric, the system can re-rank the candidates to consider additional criteria or constraints.

Diversity

Fairness

Adversarial Ranking

Other

Given a user query, the top matching layer is responsible for providing semantically relevant ad candidates to the next layer, while the ranking layer at the bottom concerns more about business indicators (e.g., CPM, ROI, etc.) of those ads. The clear separation between the matching and ranking objectives results in a lower commercial return. The Mobius project has been established to address this serious issue.

Calibrated Ranking

Relevance Feedback

  • After initial retrieval results are presented, allow the user to provide feedback on the relevance of one or more of the retrieved documents.
  • Use this feedback information to reformulate the query.
  • Produce new results based on reformulated query.
  • Allows more interactive, multi-pass process.