Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with loading Additional Entities #19

Open
seanaedmiston opened this issue Mar 6, 2023 · 6 comments
Open

Issue with loading Additional Entities #19

seanaedmiston opened this issue Mar 6, 2023 · 6 comments

Comments

@seanaedmiston
Copy link

I have tried to load additional entities as per the README by running preprocess_all. Everything appears to run fine - however when I try and load the refined model afterwards with something like:

refined = Refined(
    model_file_or_model=data_dir+ "/wikipedia_model_with_numbers/model.pt",
    model_config_file_or_model_config=data_dir + "/wikipedia_model_with_numbers/config.json",
    entity_set="wikidata",
    data_dir=data_dir,
    use_precomputed_descriptions = False,
    download_files=False,
    preprocessor=preprocessor
)

I get an error like:

Traceback (most recent call last):
  File "/home/azureuser/Hafnia/email_ee/email_refined.py", line 91, in <module>
    refined = Refined(
  File "/home/azureuser/ReFinED/src/refined/inference/processor.py", line 100, in __init__
    self.model = RefinedModel.from_pretrained(
  File "/home/azureuser/ReFinED/src/refined/model_components/refined_model.py", line 643, in from_pretrained
    model.load_state_dict(checkpoint, strict=False)
  File "/home/azureuser/.pyenv/versions/venv3108/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RefinedModel:
        size mismatch for entity_typing.linear.weight: copying a param with shape torch.Size([1369, 768]) from checkpoint, the shape in current model is torch.Size([1447, 768]).
        size mismatch for entity_typing.linear.bias: copying a param with shape torch.Size([1369]) from checkpoint, the shape in current model is torch.Size([1447]).
        size mismatch for entity_disambiguation.classifier.weight: copying a param with shape torch.Size([1, 1372]) from checkpoint, the shape in current model is torch.Size([1, 1450]).

To the best of my understanding, this is because the number of classes in the wikidata dump has changed since the original model was trained. (Number of class_to_label.json now has 1446 entries.) Is there any way to accomodate this without completely retraining the model?

@lucatorellimxm
Copy link

I went through a very similar issue after updating the files with latest wiki dumps. I believe that is indeed to attribute to the different shape of the classes tensor.

To perform zero-shot inference, without the need to retrain your model, you may want to use a mixture of original files (the ones that consider the old number of classes) and newly generated ones.

The combination that i figured out to run the model effectively is the following:

  • class_to_idx.json ----------------------------------------------------- (original)
  • class_to_label.json --------------------------------------------------- (original)
  • description_tns.pt ---------------------------------------------------- (new)
  • human_qcodes.json ----------------------------------------------------- (new)
  • nltk_sentence_splitter_english.pickle ------------------------------- (new)
  • pem.lmdb --------------------------------------------------------------- (new)
  • qcode_to_class_tns_<number>.pt --------------------------------------- (original)
  • qcode_to_idx.lmdb ----------------------------------------------------- (original)
  • qcode_to_wiki.lmdb ---------------------------------------------------- (NOTE)
  • subclasses.lmdb ------------------------------------------------------- (new)

NOTE: qcode_to_wiki.lmdb is generated by translating qcode_to_idx.json into a lmdb dictionary, which means that instead of mapping qcodes to wikipedia titles (as intended), it returns numerical indexes. This might be a bug worth of a new issue. However, I tried to solve this by simply renaming the newly generated additional_data/qcode_to_label.lmdb as qcode_to_wiki.lmdb, and it works just fine.

@seanaedmiston
Copy link
Author

Thanks heaps for replying @lucatorellimxm. With your suggestions I was at least able to run the model... but for whatever reason the performance is way off. Some entities that it was previously disambiguating/linking are no longer linking correctly, and my 'additional entities' are also not linking.

@seanaedmiston
Copy link
Author

Just an update in case anyone ever looks here... Eventually got everything working well... But discovered 2 things:

  1. To use 'additional_entities' without retraining the model in full, the trick is to copy the 'chosen_classes.txt' from the original. This just means that when all of the indexes are rebuilt with the additional_entities in them - it uses the exact same classes as the original model was trained on. (This avoids the error I initially reported above.)
  2. Even having done that - linking performance was terrible. I eventually tracked it down to an issue processing wikipedia redirects. (Redirects turns out to be one of the biggest sources of disambiguation data.) For 'new' wikipedia dumps, the redirects handling was completely broken. Reworked in fork here: https://github.com/Simbolo-io/ReFinED

@lucatorellimxm
Copy link

Great advices, thank you @seanaedmiston.

Does 2. still hold true in case of full model training? I am experiencing some linking issues with rather easy mentions even after training the model from scratch on new data and that could be the case.

@seanaedmiston
Copy link
Author

Yes - I saw poor linking performance (point 2) even with full model training. Fixing the 'redirect' parsing problems I found should fix that. It made a huge difference for me. My fork is a bit of a mess, but the only changes you should need are in process_wiki.py : main...Simbolo-io:ReFinED:main#diff-7aac257f29f9e00bda22f968125b52fc5bc3ced71e9627c5bf51780c4a8230c3

One little wrinkle, in the latest wikipedia dumps there is an article title that consists of just a backslash. If that causes you problems, you may need the additional fix to loaders.py here: main...Simbolo-io:ReFinED:main#diff-7fbb3c56891f6094624a3872d81cde9dab1d4585452975093f5fdd63dece42ea

@yhifny
Copy link

yhifny commented Sep 6, 2023

I am trying to add additional entities without retraining. I am not able to find the file "chosen_classes.txt" in the original folder:
` additional_data:

datasets:

roberta-base:
config.json merges.txt pytorch_model.bin vocab.json

wikipedia_data:
class_to_idx.json descriptions_tns.pt nltk_sentence_splitter_english.pickle qcode_to_class_tns_6269457-138.np qcode_to_wiki.lmdb
class_to_label.json human_qcodes.json pem.lmdb qcode_to_idx.lmdb subclasses.lmdb

wikipedia_model:
config.json model.pt

wikipedia_model_with_numbers:
config.json model.pt
`
how can I find it and thanks in advance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants