Documentation #8

alejandrojapkin · 2018-12-17T17:07:33Z

Would it be possible for the project to have a full extraction example from png or pdf, into training (or using of pre-existing model) and to the point of writing output?
textract is pretty good but it assumes a couple things wrong, like for instance that every pdf file can be consumed in the same way.

kororo · 2019-01-17T13:01:56Z

That sounds good idea to add more integration with textract. Let me put a few more examples with that. Do you have any other feedbacks?

alejandrojapkin · 2019-01-18T20:03:13Z

Summary

textract assumes that a file extension defines its format. This is very wrong in real life, people stores both structured/TeX compatible data as well as scanned documents (images) in the PDF format. A full image pdf (very common) will be attempted to be extracted using the PDF extractor which will render a wrong result.
it'd be good to have a full set of requirements to train (I had to deduce that from your example xls) and also a full example running from training and into producing an output, at least illustrative enough to digress how to produce an output.

kororo · 2019-03-13T06:42:25Z

hi @datascienceteam01,

pretty good comments. I am doing lots of ingestion and you have valid point there. However, if excelcy trying to tap into those problems, the project scope is going to be over the places. That is why I am using other packages to get help on the data transformation (for example image -> text). I think the points should be more relevant to the textract package.

I am adding the extra documentation as per your suggestion. Going to release the newer version.

Thanks

Repository owner deleted a comment Jan 5, 2024

Repository owner deleted a comment Feb 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation #8

Documentation #8

alejandrojapkin commented Dec 17, 2018

kororo commented Jan 17, 2019

alejandrojapkin commented Jan 18, 2019

kororo commented Mar 13, 2019

Documentation #8

Documentation #8

Comments

alejandrojapkin commented Dec 17, 2018

kororo commented Jan 17, 2019

alejandrojapkin commented Jan 18, 2019

kororo commented Mar 13, 2019