RankEMD

Expando-Mono-Duo Design for Text Ranking

Dec 5, 2021

Home

→

Work

→

RankEMD

The Expando-Mono-Duo Design Pattern for Text Ranking with Pre-trained Sequence-to-Sequence Models

CS293S: Final Project

Objective and Challenges

As BM25 is based on matching query keywords with the documents term and frequency, often we end up missing semantically relevant documents which do not contain the query keywords. Neural models pre-trained on language modeling tasks such as ELMo (Peters et al., 2017), OpenAI GPT (Radford et al., 2018), and BERT (Devlin et al., 2019) have achieved impressive results on text ranking and retrieve semantically correct documents as well. However, neural models have higher latency for adhoc retrieval and hence passing all the document, query pairs would not be viable. By using a multi stage ranking approach the effectiveness and efficiency can both be improved.

The basic idea behind multi-stage ranking architectures is to break adhoc retrieval down into a series of pipeline stages. Following an initial retrieval stage (also called candidate generation or first-stage retrieval), where a bag-of words query is typically issued against an inverted index, each subsequent stage re-ranks the list of candidates passed along from the previous stage until the final top k results are generated for consumption, e.g., returned to the user.

But using a bag-of-words based retrieval such as BM25 for first stage can lead to missing out on relevant documents which do not contain exact query keywords. Hence, a document expansion using a sequence-to-sequence model has been proposed in this paper to enrich keyword representations of texts from the corpus prior to indexing. The augmented documents are then indexed exactly as before; the only impact of document expansion is to (hopefully) provide a richer set of candidate documents for subsequent re-rankers to process.

In a data-poor regime with only a modest amount of training data, BERT is not effective and performs rather poorly. We explore T5 model which explores the limits of transfer learning and hence can learn far more effectively than BERT.

State-of-art Techniques

We have used the techniques derived from the following papers:

Multi stage ranking with transformer models was first proposed in Multi-Stage Document Ranking with BERT. arXiv:1910.14424 (2019) by Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin.
Document expansion (“Expando”)was first proposed in Document Expansion by Query Prediction. arXiv:1904.08375 (2019) by Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. It was further improved upon with Doc2Query using T5 model in From doc2query to docTTTTTquery by Rodrigo Nogueira and Jimmy Lin.
A two-stage reranking pipeline comprising a pointwise reranker (“Mono”) was first detailed in Document Ranking with a Pretrained Sequence-to-Sequence Model by Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020.
And the pairwise reranker (“Duo”) was detailed in Multi-Stage Document Ranking with BERT. arXiv:1910.14424 (2019) by Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin.
BERT model was taken from the paper BERT: Pre-training of Deep Bidirectional Transformers 6. for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019.
T5 sequence-to-sequence model was taken from the paper Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv:1910.10683 (2019) by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu.

Key algorithms and techniques

Figure 1: Multi-stage ranking architecture

The Expando-Mono-Duo model proposed in the paper comprises a multi-stage architecture (as shown above). It has a number of stages which are denoted as H0 to Hn. Other than stage H0 (that retrieves documents based on matched keywords from the corpus using an inverted index), each stage Hn receives a ranked list Rn-1 consisting of kn-1 documents from Hn-1 (i.e. from previous stage). Each stage generates a ranked list Rn consisting of kn documents to the next stage such that kn<=kn-1. The ranked list generated by last stage H_nin the pipeline is finally designated for human consumption. One of the state of the art techniques used here is the document expansion which is done before H0 step to enrich the documents. This step (called as H-1) generates an augmented corpus which is indexed as usual using BM25. The advantage of this step is to provide a richer set of documents for subsequent re-rankers. Following are the key features of the algorithm used:

1. H₋₁(Document Expansion with T5) - The document expansion is used to enrich the candidate documents with additional text that is relevant to the document for the purpose of efficient retrieval. The above figure describes the mechanism of document expansion. Given a document, the Doc2query model predicts queries that the document can potentially answer. In the example given in the figure, we can see that the trained Doc2query model generates a predicted query “does cinnamon lower blood sugar?”. Even though the term “lower” does not exist in the input document, the query is relevant to the document as it contains a synonymous keyword “reduces” and captures the semantics of the document. In the next step, the document and the predicted query are concatenated to generate an expanded document. This step is performed with every document in the original corpus to generate an augmented corpus which is then indexed. Further, as shown in figure, when the user sends the target query “foods and supplements to lower blood sugar”, the search system returns better documents.

Although the term “lower” sent by the user is not present in the original document, we are still able to use the document to answer the user query because of the expanded document (which contains the term “lower”). The advantage of this step is it broadens the range of queries for which the passage would be relevant and also highlights the importance of the terms in the document that get repeated in the predicted queries.

2. H₀(Keyword Retrieval) - In multi-stage ranking, this step is called candidate generation. It receives the user query q and the augmented corpus from H-1 step (as shown in Fig 1) and generates top k0 candidate documents R0. In this implementation, a standard inverted index based on BM25 is used as it is inherently fast at retrieval. So, combining document expansion with BM25, we get an advantage of faster retrieval of relevant documents.

3. H₁(Pointwise Reranking with monoT5) - This approach is also named as relevance classification. The documents retrieved from H0 are fed into H1 by a pointwise reranker called monoT5. The model then estimates a score si which quantifies how relevant a candidate di is to a query i.e. P(Relevant = 1 — di,q) The model produces the tokens “true” or “false” depending on whether the document is relevant to the query or not. To compute the probabilities for each query-document pair, a softmax function is used on the “true” and “false” tokens. Further, the candidate documents are re-ranked based on probabilities assigned to the “true” tokens. The advantage of this step is that the non-relevant documents get discarded.

Figure 3: Different Types of Aggregation Functions

4. H₂(Pairwise re-ranking with duoT5) - The documents retrieved from H1 are used as input to the pairwise re-ranker called duoT5. In pairwise approach, the re-ranker estimates a probability pi,j that candidate di is more relevant than dj to query q by the formula: Where di ¿ dj means di is more relevant than dj (with respect to query q) The model produces tokens “true” if document di is more relevant than dj, and “false” otherwise. To compute the probability for each document pair, a softmax function is used on “true” and “false” tokens. At the time of inference, the pairwise scores pi,j are aggregated using aggregations functions (example as in Fig 3) so that each document receives a single score. Based on these scores, we re-rank R1 to generate R2. DuoT5 is computationally intensive as k1*(k1-1) inferences need to be performed. As the efficient aggregation functions like SYM SUM combine scores for both (di,dj) and (dj,di), duoT5 generates a more stable and accurate scoring.

Experiments

Analysis of Multi-stage re-ranking on MS-MARCO passage dataset with 10 sample passages

To understand the ranking done by each step of the multi-stage re-ranker in detail, we performed an experiment with 2 query and 10 passages (Link - Implementation and Analysis).

Multi-stage re-ranking on TREC-COVID dataset

We used the open-source code for the paper (Github Link) and performed experiments on Google Colab 3.

Figure 4a: Multi-stage ranking with 10 sample passages (BM25-MonoT5)

Figure 4a: Multi-stage ranking with 10 sample passages (MonoT5-DuoT5)

1. Dataset- We used the TREC-COVID dataset to perform experiment (Implementation). It was developed as a part of TREC-COVID challenge organized by U.S.National Institute for Standards and Technology (NIST) with the goal of evaluating systems that help stakeholders access reliable scientific evidence about the virus.

2. Evaluation Metric- We used MAP (Mean Average Precision), reciprocal rank, P@10 (number of relevant results among the top 10 retrieved documents) and NDCG@10 (Normalized Discounted Cumulative Gain for the top 10 retrieved documents) evaluation metrics to evaluate the results.

3. Environmental Settings- For TREC-COVID, we used the model trained for the MS MARCO Passage ranking task. The document expansion was performed in a zero-shot manner(i.e. without fine-tuning models on data from target task), in that the model had not been trained on data from the target corpus (CORD-19). Due to limited computational resources, expansions were generated only from article abstracts. A sliding window approach was used for the cases where abstracts exceeded T5’s input length restriction of 512 tokens.

4. Results

5. Analysis and Findings- We achieved results that are at or near the state of art using the zero-shot approach. Even though the model was not fine-tuned with respect to the target task, we achieved an NDCG@20 of 0.677 (with BM25+MonoT5) and 0.648 (with BM25+MonoT5+DuoT5). We achieved marginally lesser accuracy with DuoT5 because DuoT5 operates at O(n²) time complexity which is very computationally intensive. Due to the lack of computational resources, we reduced the document comparison count from 100 to 10 documents, because of which there was a slight reduction in the metrics as compared to MonoT5. The results with DuoT5 have been proven to be better than re-ranking with just MonoT5 and its results have been included in the paper.

T5 vs BERT

Figure 6a: Comparison of T5 and BERT (Results from the Paper)

Figure 6b: Comparison of T5 and BERT (Results from implementation)

To understand the advantage of T5 over BERT, we performed an experiment with various models.

We performed multiple experiments on msmarco ans small data with BERT-large, T5-base, T5-large and T5-3B all of which were pre-trained on MS-MARCO dataset. We chose MRR@10 as evaluation metric to compare results. Fig 6(Right) shows the results from implementation. The MRR@10 for T5-base was marginally less as compared to that of BERT as per implementation which is also in line with the results shown in the paper. Due to T5-3B being computationally intensive, it got timed out on Google Colab (also tried on Google Colab Pro with higher RAM and GPU capacity). We got lesser accuracy in terms of MRR@10 for T5-large as compared to the results shown in paper and the reason was we performed ranking on a smaller version of MS-MARCO dataset in contrast to large dataset used for the results shown in paper. We tried to re-rank on the original MS-MARCO dataset as used in the paper, however it got timed-out often on Google Colab Pro (given the fact that it takes 26 hours for re-ranking)

Efforts

1. To understand the multi-stage ranking in a greater detail, we explored the ranking architecture with a number of different queries and examined the results to understand what each of the components(BM25, MonoT5 and DuoT5) does and how different passages get ranked (experiment under section 4.1).

2. We altered and checked with various documents count to be used as input for DuoT5 to get nearly state-of-art results as that step is computationally intensive (experiment under section 4.2).

3. We also experimented with various castorini models as part of experiment under section 4.3. We studied the results closely by varying models, dataset and with different batch size (which needed to be modified to best fit GPU at hand).

This project gave us a hands-on learning for the subject matter taught in week 5 (Lecture - More on BERT based ranking). It was supplementary to what we understood in class. After learning about the efficiency of multi-stage architecture theoretically, we really got interested in this topic. We read a number of papers by the author Rodrigo Nogueira whose paper was presented to us in class. We chose this paper out of other papers we read (mentioned in References section) as it was one of the latest and also showed state-of-art results. This also gave us a good background on the series of developments happening in the field of Information Retrieval. With this project, we got a good exposure to various latest ranking models such as T5 (which has better performance as compared to BERT)

Conclusion

The paper proposed Expando-Mono-Duo T5 as a design pattern for multi-stage ranking architecture which introduced us to various state-of-art techniques such as document expansion, pointwise ranking and pairwise ranking. Document expansion is a sequence-to-sequence transformation while re-ranking is done with appropriate input sequence templates and examining model outputs to extract relevance probability. The results achieved were close to state of the art which has also been claimed by the authors of the paper. It provided an introduction to transformer-based multi-stage ranking architectures.

RankEMD

The Expando-Mono-Duo Design Pattern for Text Ranking with Pre-trained Sequence-to-Sequence Models

Objective and Challenges

State-of-art Techniques

Key algorithms and techniques

Experiments

Analysis of Multi-stage re-ranking on MS-MARCO passage dataset with 10 sample passages

Multi-stage re-ranking on TREC-COVID dataset

T5 vs BERT

Efforts

Conclusion

References

Next project

CareDrive

→