-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: API Error on • character #176
Comments
Couldn't there be a regex search-and-replace for something like this? I think that's what Defoe does with some of this stuff...
Obviously, you might want to include |
I'll look into it to understand exactly at which point of the pipeline this happens, as it might be that it's either the |
@fedenanni @lukehare @kallewesterling I'd guess this is caused by the tokenizer. In this case, it should be straightforward to add special tokens. |
Can you check if the issue is on the API side? |
I am still seeing the error, unfortunately. It looks from my logs that it is coming from DeezyMatch / the candidate_ranker. I have tried it via the API and running locally and I get the same result. Interestingly though it doesn't appear to specifically be because of the • character, as I have been able to get it to work by slightly changing the input text (deleting some characters) but leaving that character in. See logs:
Whereas this works...
Other examples that failed:
|
Update: We have identified that the bug occurs if the |
Regarding this, @mcollardanuy suggests it might be due to the fact that you created a "test" OCR model. The name should be different from the one i have (I should be |
The • character appears relatively frequently in our newspaper data, and the toponym resolution pipeline doesn't no how to handle it. This causes the API to return an error.
E.g.
Input:
Output:
The text was updated successfully, but these errors were encountered: