grid-ripper

Extract text from PDFs to .csv format, content is in one column, other columns have info like coordinates on the page for the text

Dependencies: PDFBox 2.X - assuming you are using the Eclipse IDE, you will need to download PDFBox and drag the jar file into the "lib" folder of your Eclipse project

GridRipper uses PDFBox to get the text from 1-n PDFs, and outputs this information to a .csv file. There is only a single class, which extends the PDFTextStripper class of PDFBox.

This class was developed to process documents that litigants produce only in PDF format (as opposed to: native electronic format). They do this intentionally, to deny us the ability to put the data into a spreadsheet where we can analyze it programmatically (calculating things like overtime, doubletime, etc.). With the .csv output from this program, you can attempt to create a spreadsheet containing the information held by the PDF.

For example, you can sort by the y-coordinate of the text, and then delete all text on the top 1 inch of the page (if every page has a header that you want to get rid of). You can use the x-coordinate to determine which column the data goes into. You can use the y-coordinates (along with the page number) to determine whether two words are on the same line. With that information, you can try to build a spreadsheet manually, or you could use a Java program to parse the .csv file.

github@levycivilrights.com

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GridRipperController.java		GridRipperController.java
GridRipperCsvWriter.java		GridRipperCsvWriter.java
GridRipperGUI.java		GridRipperGUI.java
GridRipperOdsStreamWriter.java		GridRipperOdsStreamWriter.java
GridRipperPage.java		GridRipperPage.java
GridRipperPdfReader.java		GridRipperPdfReader.java
GridRipperPrintWriter.java		GridRipperPrintWriter.java
GridRipperRow.java		GridRipperRow.java
GridRipperWriter.java		GridRipperWriter.java
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md

CONTRIBUTING.md

CONTRIBUTING.md

GridRipperController.java

GridRipperController.java

GridRipperCsvWriter.java

GridRipperCsvWriter.java

GridRipperGUI.java

GridRipperGUI.java

GridRipperOdsStreamWriter.java

GridRipperOdsStreamWriter.java

GridRipperPage.java

GridRipperPage.java

GridRipperPdfReader.java

GridRipperPdfReader.java

GridRipperPrintWriter.java

GridRipperPrintWriter.java

GridRipperRow.java

GridRipperRow.java

GridRipperWriter.java

GridRipperWriter.java

LICENSE

LICENSE

README.md

README.md

Repository files navigation

grid-ripper

About

Releases 1

Packages

Languages

License

michaelaaronlevy/grid-ripper

Folders and files

Latest commit

History

Repository files navigation

grid-ripper

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages