Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] What's the best way for matching OCR text? #6

Open
hv0905 opened this issue Dec 26, 2023 · 0 comments
Open

[Discussion] What's the best way for matching OCR text? #6

hv0905 opened this issue Dec 26, 2023 · 0 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@hv0905
Copy link
Owner

hv0905 commented Dec 26, 2023

Currently we use BERT model (more precisely, bert-base-chinese) to vectorize OCR text, then use COSINE distance for indexing and searching.

However, this method seems to have low performance when processing partial keywords or semantically similar sentences.

For instance,
image

FYI, the OCR text of the image:
1. please
2. 你最
3. 叔
4. 什么情况兄弟
5. 爱
6. 爱
7. 害怕
8. 乳
9. 嘿

And only when I provide more detailed text, the server can return some more accurate result:
image

Any solution to improve the OCR text matching?

Related code

https://github.com/hv0905/NekoImageGallery/blob/master/app/Services/transformers_service.py#L59

Related documentation

https://huggingface.co/tasks/sentence-similarity

@hv0905 hv0905 added enhancement New feature or request help wanted Extra attention is needed labels Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant