Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation #8

Open
alejandrojapkin opened this issue Dec 17, 2018 · 3 comments
Open

Documentation #8

alejandrojapkin opened this issue Dec 17, 2018 · 3 comments

Comments

@alejandrojapkin
Copy link

Would it be possible for the project to have a full extraction example from png or pdf, into training (or using of pre-existing model) and to the point of writing output?
textract is pretty good but it assumes a couple things wrong, like for instance that every pdf file can be consumed in the same way.

@kororo
Copy link
Owner

kororo commented Jan 17, 2019

That sounds good idea to add more integration with textract. Let me put a few more examples with that. Do you have any other feedbacks?

@alejandrojapkin
Copy link
Author

Summary

  • textract assumes that a file extension defines its format. This is very wrong in real life, people stores both structured/TeX compatible data as well as scanned documents (images) in the PDF format. A full image pdf (very common) will be attempted to be extracted using the PDF extractor which will render a wrong result.
  • it'd be good to have a full set of requirements to train (I had to deduce that from your example xls) and also a full example running from training and into producing an output, at least illustrative enough to digress how to produce an output.

@kororo
Copy link
Owner

kororo commented Mar 13, 2019

hi @datascienceteam01,

pretty good comments. I am doing lots of ingestion and you have valid point there. However, if excelcy trying to tap into those problems, the project scope is going to be over the places. That is why I am using other packages to get help on the data transformation (for example image -> text). I think the points should be more relevant to the textract package.

I am adding the extra documentation as per your suggestion. Going to release the newer version.

Thanks

Repository owner deleted a comment Jan 5, 2024
Repository owner deleted a comment Feb 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@kororo @alejandrojapkin and others