Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why need check corpus_id != query_id in DenseRetrievalExactSearch.search() #169

Open
mengyao00 opened this issue Apr 11, 2024 · 1 comment

Comments

@mengyao00
Copy link

Why do we need this line to check corpus_id != query_id

for a query with id_q, the corpus with the same id id_q does not mean it is the positive corpus for it. So why do we need to avoid corpus_id == query_id

            for query_itr in range(len(query_embeddings)):
                query_id = query_ids[query_itr]                  
                for sub_corpus_id, score in zip(cos_scores_top_k_idx[query_itr], cos_scores_top_k_values[query_itr]):
                    corpus_id = corpus_ids[corpus_start_idx+sub_corpus_id]
                    if corpus_id != query_id:
                        if len(result_heaps[query_id]) < top_k:
                            # Push item on the heap
                            heapq.heappush(result_heaps[query_id], (score, corpus_id))
                        else:
                            # If item is larger than the smallest in the heap, push it on the heap then pop the smallest element
                            heapq.heappushpop(result_heaps[query_id], (score, corpus_id))

        for qid in result_heaps:
            for score, corpus_id in result_heaps[qid]:
                self.results[qid][corpus_id] = score
        
        return self.results 
@thakur-nandan
Copy link
Member

Hi @mengyao00, thanks for asking the question.

We require this line for two datasets: ArguAna and Quora, where corpus_ids and query_ids are similar, i.e., the query is also present within the corpus.

The line is used to avoid the edge case of self-retrieval where the query is self-retrieved at the top-1 position, which reduces the nDCG@10 score for ArguAna and Quora.

Hope it helps!

Regards,
Nandan Thakur

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants