PDF file security and privacy auditing tool written in Python
The tool parses a pdf file looking for the following keys: (GoTo), GoToR and GoToE, Launch, URI, SumbitForm, JavaScript, OpenAction, and AA.
PDFaudit outputs the content of the corresponding values and object location to aid assessment of security and privacy risks.
The tool is currently in development, with approximately 90% of the required code implemented. The pdf standard has multiple options to store document information, which makes it a challenge to cover all possible scenario's. Moreover, the document specification is at some instances less concisely defined, making room for multiple interpretations. Combined with the current development phase of pdfaudit, a 100% success rate can not yet be guaranteed of the tool parsing each pdf correctly in the first place, and detecting and reporting the security and privacy threats. The end goal however is to be compatible with all features defined in ISO 32000-1:2008, which covers PDF versions up to and including PDF1.7.
PDFaudit is written in Python, and uses Python3 code. For windows and OSx, see https://www.python.org/downloads/. For Linux, Python3 may already have been installed or can be retrieved using your package manager and native repositories.
Simply download the github code to your computer, and running it.
Using the python command, explicitly using python3:
python3 pdfaudit.py inputfile.pdf
On recent versions of Windows, the following may work as well:
pdfaudit.py inputfile.pdf
To be able to do the same on Linux, the script has to be made executable first:
chmod u+x pdfaudit.py
before the script can be executed like:
pdfaudit.py inputfile.pdf
- DCTDecode filter
- Refactoring iterations in general
- Speed optimizations (like don't uncompress if not needed)
- Determine if a split between high and low-risk threats is useful
- Summarize exceptions that occurred using filters
- Linearized pdf's (still relevant? we are able to process objectstreams)
- Implement predictor filter in regular object streams, like we did for cross reference streams
- Check the content of streams themselves
- indirect references in dictionaries are now followed for OpenAction and AA as well.
- incorrect object numbering of object streams
- Progress indication also when retrieving pdf document structure
- Progress indicator displayed previous object number
- Incorrect escape of literal strings
- Print only printable characters when showing threats
- Scans pdf for objects independent of cross reference tables. Hardening / handling of malformed pdf's: ability to handle incorrect location of xref, reference to incorrect location of objects, of non-existent objects.
- Filter function is now byte-wise (resulting in incorrect object numbers, locations, etc)
- Output number of scanned objects
- Checks pdf header
- Handling of escaped backlashes in literal strings, translation of hexadecimal strings, and hex characters in names (obfuscation)
- Printing non-printable characters (verbosity >1)
- Ability to find cross reference tables despite incorrect position references to them
- Option to print pdf document structure to screen with object locations
- Crossreference lists not used as global variables in functions
- Comment handling for some EOF situations
- /OpenAction and /AA
- Fixed incorrect escape handling in strings
- Ability to scan object streams
- Some speed optimizations
- Only read DecodeParms if they exist
- EOF was not handled correctly in readword function
- Ability to read cross reference streams
First rudimentary version
pdfaudit Copyright (C) 2020 Joseph Heller
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
- Didier Stevens for sharing his knowledge on malicious pdf files
- Yusuke Shinyama, for some of the decoding filters from pdfminer