Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: API Error on • character #176

Open
lukehare opened this issue Feb 14, 2023 · 8 comments
Open

Bug: API Error on • character #176

lukehare opened this issue Feb 14, 2023 · 8 comments
Assignees

Comments

@lukehare
Copy link
Collaborator

The • character appears relatively frequently in our newspaper data, and the toponym resolution pipeline doesn't no how to handle it. This causes the API to return an error.

E.g.
Input:

{'sentence': ' • - ST G pOllO-P• FERRIS - • - , i '}

Output:

<Response [500]>
@kallewesterling
Copy link
Collaborator

Couldn't there be a regex search-and-replace for something like this? I think that's what Defoe does with some of this stuff...

[a-zA-Z]+$

Obviously, you might want to include - in there still...

@fedenanni
Copy link
Contributor

I'll look into it to understand exactly at which point of the pipeline this happens, as it might be that it's either the ner or deezymatch or REL crashing and based on that we can decide how to handle it. But I agree with @kallewesterling that we can then quickly fix it with a regex

@kasparvonbeelen
Copy link
Collaborator

@fedenanni @lukehare @kallewesterling I'd guess this is caused by the tokenizer. In this case, it should be straightforward to add special tokens.

@fedenanni
Copy link
Contributor

@lukehare I'm looking into it (see the work in progress PR: #177) but, from a first test, the bug does not seem to be in the pipeline. I have just added this test and the text goes through the entire pipeline without an issue.

@fedenanni
Copy link
Contributor

Can you check if the issue is on the API side?

@lukehare
Copy link
Collaborator Author

lukehare commented Mar 1, 2023

I am still seeing the error, unfortunately. It looks from my logs that it is coming from DeezyMatch / the candidate_ranker. I have tried it via the API and running locally and I get the same result. Interestingly though it doesn't appear to specifically be because of the • character, as I have been able to get it to work by slightly changing the input text (deleting some characters) but leaving that character in.

See logs:

>>> resolved = geoparser.run_text(
...         " • - ST G pOllO-P• FERRIS - • - , i ",
...     )
Traceback (most recent call last):                                                                                                                                 
  File "<stdin>", line 1, in <module>
  File "/home/lukehare/toponym-resolution/geoparser/pipeline.py", line 226, in run_text
    sentence_dataset = self.run_sentence(
  File "/home/lukehare/toponym-resolution/geoparser/pipeline.py", line 149, in run_sentence
    wk_cands, self.myranker.already_collected_cands = self.myranker.find_candidates(mentions)
  File "/home/lukehare/toponym-resolution/geoparser/ranking.py", line 372, in find_candidates
    cands, self.already_collected_cands = self.run(queries)
  File "/home/lukehare/toponym-resolution/geoparser/ranking.py", line 349, in run
    return self.deezy_on_the_fly(queries)
  File "/home/lukehare/toponym-resolution/geoparser/ranking.py", line 287, in deezy_on_the_fly
    candidates = candidate_ranker(
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/DeezyMatch/candidateRanker.py", line 327, in candidate_ranker
    tmp_dirname = query_vector_gen(query, model, train_vocab, dl_inputs, verbose)
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/DeezyMatch/utils_candidate_ranker.py", line 60, in query_vector_gen
    test_model(
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/DeezyMatch/rnn_networks.py", line 594, in test_model
    pred = model(
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/DeezyMatch/rnn_networks.py", line 878, in forward
    x1_embs_not_packed = self.emb(x1_seq)
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Whereas this works...

>>> resolved = geoparser.run_text("• - ST G pOllO-P• FERR")
>>> resolved
[{'mention': 'G', 'candidates': {'Q133083': 0.985, None: 0.316}, 'ner_score': 0.579, 'pos': 7, 'sent_idx': 0, 'end_pos': 8, 'tag': 'LOC', 'sentence': '• - ST G pOllO-P• FERR', 'prediction': 'Q133083', 'ed_score': 0.985, 'latlon': [-26.0, 28.0], 'wkdt_class': 'Q191093'}]

Other examples that failed:

{'sentence': ' BY HER LETTERS W PATENT, corner of Deansgate, and B • RatNo 1, BLACKFRIARSeSTREET, Agents to the Corporation', 'place': 'Manchester, Greater Manchester, England'}
 output: <Response [500]>

{'sentence': ' - experience, Who wil! take every precaution tO promote theitealth tte View to tako plaee o •ednes•lay, July sth', 'place': 'Manchester, Greater Manchester, England'}
 output: <Response [500]>

{'sentence': " 5, N • ' Buildings, Market-street", 'place': 'Manchester, Greater Manchester, England'}
 output: <Response [500]>

{'sentence': ' built expressly for the Liverpo a • - York trade • al , is equ • JODWOODS T A K E S, JULY 26TH, 1848', 'place': 'Manchester, Greater Manchester, England'}
 output: <Response [500]>

@lukehare
Copy link
Collaborator Author

lukehare commented Mar 3, 2023

Update: We have identified that the bug occurs if the character is passed to the candidate ranker in DeezyMatch. We think this is caused by an incorrect OCR model (w2v_ocr) used in the API deployment. We're looking into where this model came from, and assuming it is out-of-date, we will redeploy the API with the correct model asap.

@fedenanni
Copy link
Contributor

Regarding this, @mcollardanuy suggests it might be due to the fact that you created a "test" OCR model. The name should be different from the one i have (I should be _test, see here) but maybe due to a bug this is not true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants