Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

mismatch between encoded results and wiki passages #250

Open
Hannibal046 opened this issue Oct 16, 2023 · 0 comments
Open

mismatch between encoded results and wiki passages #250

Hannibal046 opened this issue Oct 16, 2023 · 0 comments

Comments

@Hannibal046
Copy link

Hi, thanks so much for the great work. I have a question about the size of wiki passages and encoded index. After downloading the data as instructed, I found the size of index doesn't match that of passages:

import pickle,csv

n_embedding = 0
for idx in range(50):
    index_path = f"DPR/dpr/downloads/data/retriever_results/nq/single/wikipedia_passages_{idx}.pkl"
    data = pickle.load(open(index_path,'rb'))
    n_embedding += len(data)


n_doc = 0
wikidata_path = "DPR/dpr/downloads/data/wikipedia_split/psgs_w100.tsv"
docs = []
with open(wikidata_path) as f:
    reader = csv.reader(f, delimiter="\t")
    for row in reader:
        if row[0] == "id":continue
        n_doc += 1

print("n_embedding=",n_embedding)
print("n_doc=",n_doc)

The results are:

n_embedding= 21015300
n_doc= 21015324
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant