Skip to content

OpenReviewKit (ORK): tools for extracting text from PDFs while maintaining a memory of where each bit of text came from (the file, page, and coordinates on the page)

michaelaaronlevy/ork

Repository files navigation

ork

OpenReviewKit (ORK): tools for extracting text from PDFs while maintaining a memory of where each bit of text came from (the file, page number, and coordinates on the page--referred to as the "contextual information").

ORK can be used to extract text from a PDF, and save the extracted text directly to a spreadsheet, which contains not only the text itself (one bit of text per row) but also multiple columns containing contextual information like: which file each bit of text came from, which page, the coordinates on the page, the rotation, and the size of the text.

You can open the newly-created spreadsheet with Excel and sort based on the content of the text, and you can also sort based on contextual information. For example, if the PDF in question has a header as the top two inches of every page, you can sort by y-coordinate. Every bit of text that is part of the header will sort to the top. So you can, for example, delete all of the header text, which would be useful if you want to isolate the more interesting information in the body of the document.

If the original PDF is just a big table of data, you can sort by x-coordinate to isolate all of the text that belongs to specific columns in the PDF. You can use the coordinates (and page numbers) to identify, for every bit of text, which row and which column it belongs to from the table in the original PDF. And you can then use that information to re-create in Excel the table that was in the original PDF. So you will have all of the information from the original PDF, but loaded into an Excel spreadsheet, so you can use the tools provided by Excel to analyze the information. I have especially found this to be useful for analyzing payroll information and employee time clock records.

ORK can also be used to provide such information directly to another program in a convenient format (see the MemenText and MemenPage classes). The API is well-documented and intuitive.

As a proof of concept, and useful in its own right, ORK comes with a program to convert AT&T phone bills directly into a spreadsheet (such that each row represents a single call, text message, or data transfer, where the columns contain information about the communications like the time/date and sender/recipient). Because the AT&T Account Statements have a fairly rigid format, it was not difficult to create a class specifically for the purpose of converting AT&T Account Statement information into this more useful format. The AT&T Account Statement program is contained in a single java file with only about 283 lines of code. Any kind of PDF that follows rigid rules (which is to say, most PDFs that are generated from information kept in a database, such as payroll reports, time punch information, credit card statements, etc.) may be a good target for this type of custom-written converter using the Open Review Kit API.

ORK also comes with the ability to create a searchable word index for multiple PDFs (even: thousands of PDFs) so you can conduct keyword searches (with nested boolean queries) and then browse the search results with a graphical tool that shows one page at a time. You can move forward or backward one page at a time, or one file at a time, or jump to any random page, or jump between search results. So if, out of 10,000 pages, you identify seven pages that contain all three of the words "manager" "overtime" and "retaliation," you can view those seven particular pages, one after another. It would be as simple as creating the word index (drag and drop all of the files into a window and press "execute"), creating the search ("manager & overtime & retaliation"), viewing the search results ("view (manager & overtime & retaliation)"), and then clicking an arrow key six times to move between the seven pages.

What is ORK not good for? It only works with PDFs. It only works with PDFs that contain text, in that any images or other non-text content will be ignored by ORK. ORK does not perform OCR and is generally not very useful at fixing OCR errors. If a PDF is protected by password, ORK may not be able to extract text.

To run ORK, if you have Java 8 installed, all you need to do is double-click on the runnable jar file (OpenReviewKit-v1.0.0.jar). You can download a Java 8 installer from adoptopenjdk.net. ORK is compatible with later versions of Java, but those versions may not support being able to double-click on the jar file in order to run it. You may need to create a batch file or use the command line to run ORK if you do not have Java 8 installed.

To open the ORK GUI programmatically, call the main method of the class OpenReviewKit. The GUI supports drag-and-drop so it is easy to identify to ORK which PDF(s) you want to extract from. You choose the mode of operation (e.g., whether it extracts text to a .csv file, or whether it extracts to an .ods, or if will create a searchable word index) by left-clicking or right-clicking on the mode button. When you have selected the PDFs, and the mode, you click on the "execute" button.

If you want to make an application that uses PdfToTextGrid to extract text from one or more PDFs, and then processes that output in some way, and you do not want to use the ORK GUI, the easiest way would be to implement the PageConsumer interface. I prepared code that you can use as a starting point for such an application. It is at: com.github.michaelaaronlevy.ork.sorcerer.YourApplication. The source code for "YourApplication" is in the public domain.

If you want to make an application that does the same thing, except it uses a GUI with drag & drop to select the PDFs to act on, and uses a GUI to prompt the user to select a file for the application's text output, I prepared code that you can use as a starting point for such an application. It is at: com.github.michaelaaronlevy.ork.sorcerer.YourApplicationWithDragDrop. The source code for "YourApplicationWithDragDrop" is in the public domain.

About

OpenReviewKit (ORK): tools for extracting text from PDFs while maintaining a memory of where each bit of text came from (the file, page, and coordinates on the page)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages