Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review CertificateParser to support new tika "x-x509-cert" contentType. #1978

Open
patrickdalla opened this issue Nov 14, 2023 · 1 comment · May be fixed by #1981
Open

Review CertificateParser to support new tika "x-x509-cert" contentType. #1978

patrickdalla opened this issue Nov 14, 2023 · 1 comment · May be fixed by #1981

Comments

@patrickdalla
Copy link
Collaborator

patrickdalla commented Nov 14, 2023

I have just reviewed Pkcs7Parser code from tika.

Pkcs7 is a container spec to hold content and its signature info in same file/stream. Pkcs7Parser of tika only strips/ignores the signature and delegate the content parsing to the corresponding parser. Pkcs7Parser doesn't parse any signature and respectives certification information.

Pkcs7 is most used to save certification revogation list and certificate files itself (when included with entire certificates of certification path). The CertificateParser uses java.security.cert.CertificateFactory that can extract the certificates these files PKCS7 formatted contains.

PKCS7 is not the format of the certificate used to sign the APK.

It seems from https://issues.apache.org/jira/browse/TIKA-3205, code done after the implementation of CertificateParser, that TIKA didn't classified PEM and DER files as "x-x509-ca-cert". But now it do.

I have created in CertificateParser "application/x-pem-file" and "application/pkix-cert" mime-types to identify this kind of content, but now it seems it can use the new "application/x-x509-ca-cert" identified by Tika.

patrickdalla added a commit that referenced this issue Nov 14, 2023
classify DER and PEM encoded certificates as this kind of mimetype.
Other minor formatting changes were included.
@lfcnassif
Copy link
Member

lfcnassif commented Nov 14, 2023

Thanks @patrickdalla. I'll try to crawl certificate samples to test CertificateParser, so we can enable it by default if everything seems good.

patrickdalla added a commit that referenced this issue Nov 14, 2023
to be expanded and have its certificates extracted as subitems.
patrickdalla added a commit that referenced this issue Nov 14, 2023
patrickdalla added a commit that referenced this issue Nov 14, 2023
as subitems if in format PKCS7 and to be used in conjunction with tika
PKCS7Parser.
@patrickdalla patrickdalla linked a pull request Nov 14, 2023 that will close this issue
patrickdalla added a commit that referenced this issue Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants