Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formulas in gif format #7

Open
OlgaGKononova opened this issue Jun 29, 2018 · 2 comments
Open

Formulas in gif format #7

OlgaGKononova opened this issue Jun 29, 2018 · 2 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@OlgaGKononova
Copy link
Collaborator

OlgaGKononova commented Jun 29, 2018

I found, that some ECS papers has gif pictures for formulas and numbers.
For example: http://jes.ecsdl.org/content/157/3/J69.full
span class="inline-formula" id="inline-formula-38"><img class="math mml" alt="Formula" src="J69/embed/mml-math-38.gif"

Can we check how many of those cases and do something about it?
Thank you.

@shaunrong
Copy link
Contributor

shaunrong commented Jul 6, 2018

Nice catch @OlgaGKononova ! TY.

This can be quickly fixed with writing an OCR ingredient targeting these gifs and converting them to string formats using pytesseract.

@tiagobotari let me know if you can take care of this issue. If not, I will push an ingredient component.

@hhaoyan
Copy link
Contributor

hhaoyan commented May 15, 2019

This MongoDB query gets:

db.getCollection('Paper_Raw_HTML').find({Publisher: 'ECS', Paper_Raw_HTML: /img alt="Formula"/}).count()

3019 papers have this issue:

broken_dois.txt

@hhaoyan hhaoyan added enhancement New feature or request help wanted Extra attention is needed labels May 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants