Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain how to read text data from PDF and PowerPoint and use it with Texthero #24

Open
selimelawwa opened this issue Jun 2, 2020 · 6 comments · May be fixed by #112
Open

Explain how to read text data from PDF and PowerPoint and use it with Texthero #24

selimelawwa opened this issue Jun 2, 2020 · 6 comments · May be fixed by #112
Labels
documentation Improvements or additions to documentation

Comments

@selimelawwa
Copy link
Contributor

PDF, PowerPoint presentations and other unstructured text, contain very valuable data that can be used for analysis.
There are many tools providing this features. It would be nice if we can provide a single method to read such files and don't bother user with this.

There is a python library textract provide this functionality unfortunately it is not maintained.

We can provide a method loadData or so that has different implementation depending on file type

@jbesomi
Copy link
Owner

jbesomi commented Jun 2, 2020

Very interesting comment. Completely agree that we should do something related.

textract There are also other python tools for PDF extraction such as PyPDF2, PDFminer, etc.

dataLoader as the use cases are quite different from task to task and also as this feature is a bit too far from the core idea of texthero, an alternative would be to add a detailed tutorial on the blog with also snippet of code (that can also be added somewhere in the github repo) that explain how to extract text data from different sources such as PDF and PowerPoint. What do you think about this? Also, having a universal dataLoader might be quite hard and that's why there is in general a custom python package that does only that.

As a final comment, it's important to define precisely what are the goals and objective of texthero, better doing one thing great than 5 average. We can discuss also that eventually.

@selimelawwa
Copy link
Contributor Author

Completely agree with your final comment ! Even though this is not one of the core goals of texthero, but I think it can be a cool feature to have. Just wanted to write it down so it can be made later on after core is built and running. I think having ideas written down / shared is good for the project.

My idea for a universal data loader is that it appears as "universal" to the user, however it will have multiple implementations and can use different packages under-hood depending on file type / data source.

For now yeah we can just have a tutorial on the blog!

@igponce
Copy link

igponce commented Jul 8, 2020

There's a good library TIKA-Python (https://github.com/chrismattmann/tika-python) that handles PDFs, emails, and other formats as well.
It is based on apache tika (http://tika.apache.org/) and the maintainer is on the Apache Tika board.

The only con I find is that it needs a JVM to run TIKA behind the scenes; but it's very easy to start using it:

import tika
from tika import parse, 

tika.initVM()  # Gets apache-tika jar file (if not present) and lauch tika from the JVM

filename = 'path/to/your/file(ppt|doc|docx|pdf)'
thedoc= parse(filename)

print( thedoc['metadata'] ) #  dict with information about the file itself
print(thedoc['content'])  # Output utf8 text from the file

# Dump attachments if the file has any (like .msg, .eml, etc).

if thedoc.get('attachments',False):
   print(thedoc['attachments'])

@jbesomi
Copy link
Owner

jbesomi commented Jul 8, 2020

Hi @igponce! Thank you for your comment!

Adding native PDF support might be a bit out of Texthero's purposes.

What it's definitely useful is to have a tutorial on the Texther's blog page that explains how to start hero-analyzing a collection of documents, starting from raw and other formats.

There are different solutions for doing that, another valid alternative is for instance to use pdfminer.six as it's very simple to use and it's based only on python (no need for the JVM).

For example, to go from raw pdf data to a Pandas Dataframe this line of code does the job:

import glob
from pdfminer.high_level import extract_text

all_pdf = glob.glob("filepath_to_pdf_collection/*.pdf")
text = [extract_text(p) for p in all_pdf]
df = pd.DataFrame(text_review, columns=['text'])

.. do hero analysis

Would you be interested in writing such a blog post? It would be great to show how to go from raw data to Pandas/Hero using different tools, including Apache Tika and Pdfminer, Textract, ...

regards,

@jbesomi jbesomi changed the title Support reading text data from PDF and PowerPoint Explain how to reading text data from PDF and PowerPoint and use it with Texthero Jul 8, 2020
@jbesomi jbesomi changed the title Explain how to reading text data from PDF and PowerPoint and use it with Texthero Explain how to read text data from PDF and PowerPoint and use it with Texthero Jul 8, 2020
@jbesomi jbesomi added the documentation Improvements or additions to documentation label Jul 8, 2020
@igponce
Copy link

igponce commented Jul 8, 2020

Good point on getting PDF etc. out of scope: it's vert tempting to add stuff; but hard to leave it aout.
I'll send you a draft, just after I make some experiments myself. Maybe next week.

@jbesomi
Copy link
Owner

jbesomi commented Jul 8, 2020

Sounds amazing! Looking forward to that!

@selimelawwa selimelawwa linked a pull request Jul 22, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants