Why are some document titles missing? #26

mukhal · 2021-10-28T22:21:58Z

Thank you for the amazing repo.

I am curious why are some titles missing from the tfidf index. It seems that during evaluation we get multiple such warnings:

Oranjegekte_0 is missing
James Gunn_0 is missing
..

I assume this means that some document titles are not found in the database. Is that normal? could you explain?

Thanks!

The text was updated successfully, but these errors were encountered:

AkariAsai · 2022-03-26T21:33:18Z

Hi, sorry for my late response! Could you share the command you are running and in which dataset you have that issue?
I think I have seen the same issue when the Wikipedia title (id) cannot be matched with any of the ids in the database. In particular,

the code cannot handle well some Unicode characters
the Wikipedia entity titles have been changed or directed to the new one

mukhal · 2022-03-27T01:56:43Z

Thanks for the response. This happens with HotpotQA when I run the following command or similar commands.

python run_graph_retriever.py \
        --task hotpot_open \
        --bert_model bert-base-uncased --do_lower_case \
        --dev_file_path path/to/hotpotqa/dev \
        --output_dir path/to/output \
        --model_suffix 3\
        --max_para_num 10 \
        --tfidf_limit 50 \
        --beam 4\
        --eval_chunk 200 \
        --eval_batch_size 64 \
        --split_chunk 1000\
        --pruning_by_links \
        --example_limit 128

I think the main issue is that some titles are retrieved by the tfidf retriever, but when trying to retrieve their content using tfidf_retriever.load_abstract_para_text(), it outputs this warning for some documents. Not sure if I should worry about it, though since I was able to reproduce your results with the warning happening many times.

mukhal changed the title ~~Why are some documents missing?~~ Why are some document titles missing? Oct 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are some document titles missing? #26

Why are some document titles missing? #26

mukhal commented Oct 28, 2021 •

edited

AkariAsai commented Mar 26, 2022

mukhal commented Mar 27, 2022 •

edited

Why are some document titles missing? #26

Why are some document titles missing? #26

Comments

mukhal commented Oct 28, 2021 • edited

AkariAsai commented Mar 26, 2022

mukhal commented Mar 27, 2022 • edited

mukhal commented Oct 28, 2021 •

edited

mukhal commented Mar 27, 2022 •

edited