Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are some document titles missing? #26

Open
mukhal opened this issue Oct 28, 2021 · 2 comments
Open

Why are some document titles missing? #26

mukhal opened this issue Oct 28, 2021 · 2 comments

Comments

@mukhal
Copy link

mukhal commented Oct 28, 2021

Thank you for the amazing repo.

I am curious why are some titles missing from the tfidf index. It seems that during evaluation we get multiple such warnings:

Oranjegekte_0 is missing
James Gunn_0 is missing
..

I assume this means that some document titles are not found in the database. Is that normal? could you explain?

Thanks!

@mukhal mukhal changed the title Why are some documents missing? Why are some document titles missing? Oct 28, 2021
@AkariAsai
Copy link
Owner

Hi, sorry for my late response! Could you share the command you are running and in which dataset you have that issue?
I think I have seen the same issue when the Wikipedia title (id) cannot be matched with any of the ids in the database. In particular,

  • the code cannot handle well some Unicode characters
  • the Wikipedia entity titles have been changed or directed to the new one

@mukhal
Copy link
Author

mukhal commented Mar 27, 2022

Thanks for the response. This happens with HotpotQA when I run the following command or similar commands.

python run_graph_retriever.py \
        --task hotpot_open \
        --bert_model bert-base-uncased --do_lower_case \
        --dev_file_path path/to/hotpotqa/dev \
        --output_dir path/to/output \
        --model_suffix 3\
        --max_para_num 10 \
        --tfidf_limit 50 \
        --beam 4\
        --eval_chunk 200 \
        --eval_batch_size 64 \
        --split_chunk 1000\
        --pruning_by_links \
        --example_limit 128 

I think the main issue is that some titles are retrieved by the tfidf retriever, but when trying to retrieve their content using tfidf_retriever.load_abstract_para_text(), it outputs this warning for some documents. Not sure if I should worry about it, though since I was able to reproduce your results with the warning happening many times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants