Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PDF support, cover all versions, better hash format #3976

Open
JDLH opened this issue Mar 27, 2024 · 0 comments
Open

Improve PDF support, cover all versions, better hash format #3976

JDLH opened this issue Mar 27, 2024 · 0 comments

Comments

@JDLH
Copy link

JDLH commented Mar 27, 2024

Describe the feature
Make one hashcat invocation able to find passwords for hashes of any version of PDF file, including owner as well as user password. Also, make an improved hash format which a) can describe hashes for any version of PDF file, b) can include a filename and path of the PDF document described by each hash, which may need to be c) a structured format like JSON or bencode rather than the current asterisk-separated flat file format, with d) clear documentation of the hash format, and e) an official, reliable pdf-to-hash tool.

Current behavior
Currently there are seven different hash types for PDF: -m 10400, 10410, 10420, 10500, 25400, 10600, 10700. They each address one of four different Acrobat versions and PDF encryption specs. The hash file formats for each type are similar enough to be confusable, but different enough that each hash type requires the exact has file format it supports, and only that. The has file formats are not, as far as I can see, documented outside the module code. PDF has two passwords, whereas hashcat and the hash file formats are really only set up to handle one password. One hash type, 25400, can find the owner password as well the user password, but it does not support the most recent versions of PDF.

There is no official pdf-to-hash tool. The one pointed to in the FAQ, stricture's pdf2hashcat.py generates invalid output. The tool appears to be a one-off just-good-enough parser of PDF, rather than reusing the well-exercised code of a general PDF library like pypdf. It has poor documentation of its usage alone and in combination with hashcat.

There are several versions of PDF language, and several versions of encryption specification. This number will continue to grow as PDF evolves.

A collection of PDF files will probably have include multiple PDF language and encryption versions. Right now, hashcat requires the user to identify the encryption version for each PDF file, identify which hashcat hash type corresponds with each encryption version, and run hashcat separately for each encryption version. Then hashcat --show returns a list of hashes, with no clue about which PDF file each hash corresponds to. The user must do separate tracking of path and PDF filename for each hash.

The hash file format is a flat, text, asterisk-separated values (ASV) format. It does not easily label the differing fields required by differing versions of the PDF encryption specifications. It does not easily accommodate two passwords, or a path and filename of the source PDF file.

Overall, using hashcat to recover passwords for a collection of PDF files is confusing for a user.

Expected behavior
A user should be able to recover both user and owner passwords for a collection of PDF files of differing versions as one procedure. The hashcat documentation should point the user to a single, reliable, hashcat-endorsed utility for creating hashes from multiple PDF files. The user should not need to concern themselves with the versions of the PDF files. The user invokes hashcat once, with a single -m option that covers all PDF versions, and passes hashcat a single hash file with entries for all the PDF files regardless of version. hashcat finds both owner and user passwords, as apppropriate, for each file. hashcat --show displays a list of PDF files by filename and path, with the user and owner passwords for each file, and as much of the hash detail as is helpful.

Over time, as future PDF format and encryption versions appear, hashcat can add support for those versions, and users can continue to use the same hashcat tools and invocation for PDF files of those newer versions.

The hash file format is documented. It is practical for developers to make other tools to generate hashes in that format. The file format gracefully accommodates new PDF versions. I suspect that a flat file format will be inadequate for these requirements, so the hash will need a structured format with mapping structures (associative arrays), such as JSON or bencode. There are interesting design decisions to make about whether the hash format needs to remain plain-text and line-oriented — maybe it need not?

The same hash file format works with a future version of John the Ripper and other cooperating password-recovery tools.

Hashcat version

  • OS: macOS Monterey 12.7.4
  • Distribution: MacPorts
  • Version: 6.2.6

However, the issues in this feature request are relevant to basically all OSs and all recent hashcat versions.

Additional context
This request has large scope. It will probably be necessary to break it down into sub-projects. There will probably be need for design discussions.

This request is motivated by my experience as a new user of hashcat, with a collection of PDF files spanning decades, wanting to use hashcat conveniently. I am naive about the constraints of hashcat's architecture and history. I welcome comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant