PDF detected as application/octet-stream #285

peric · 2022-05-10T09:43:45Z

Attach the file for which the detection is inaccurate

Unfortunately, I am not able to share the original file, but I've tried to fake it and create a new one.

fake-pdf.pdf

Expected MIME type

application/pdf

Returned MIME type

application/octet-stream

Version of the library you are using

1.4.0

Output of go version

go version go1.18.1 darwin/arm6

Additional context

As mentioned above, I am not able to share the real PDF file, but I've tried to fake it.

If you try to open the file above, you'll see just a blank document. But, if you open the source of that file, you'll see that %PDF-1.4 is in the 2nd line and not in the 1st one. And this is exactly the same problem that I've found in my real PDF file - the only difference is that my real PDF file works locally (i.e. with Preview/MacOS) without any issues and the one I faked actually got blank after I moved the %PDF-1.4 to the 2nd line.

Therefore, I understand that my file is kinda corrupted (although, the customer got it from some accounting system) but it also works locally without any issues. Although, when I try to send it through the Detect function, it returns application/octet-stream.

Not sure if this is something that can or should be fixed, but let's see. Thanks in advance.

The text was updated successfully, but these errors were encountered:

For newline and BOM prefixed signatures. For #285

gabriel-vasile · 2022-05-16T20:19:40Z

Hi, @peric

Thank you for reporting this issue. It should be fixed.
You can upgrade to latest commit to test, if you don't want to wait until the next release:

go get -u github.com/gabriel-vasile/mimetype@v1.4.1-rc1.0.20220516201151-14adc293c750

peric · 2023-04-17T11:59:07Z

@gabriel-vasile

Hey there, it's me again 🙃

I stumbled upon similar example, so I'll mention it here instead of opening another issue (at least for now).

Basically, the beginning of the source for the file provided above looks like this:


%PDF-1.4
%�쏢
5 0 obj

In the example I currently have, the source starts like this:

-------------------------------28944242429299
Content-Disposition: form-data; name="example.pdf"; filename="example.pdf"
Content-Type: application/x-gzip

%PDF-1.4
%����

The file works typically when you try to open it with a PDF reader, although mimetype.Detect returns application/octet-stream.

Is this also something that can be covered with a similar solution? Thanks in advance

gabriel-vasile · 2023-04-20T02:34:35Z

Hi @peric,
Please help me debug this issue. The problem PDF should have been detected by the regular signature:

mimetype/internal/magic/document.go

Lines 9 to 10 in 6f575b3

    
           // usual pdf signature 
        
           []byte("%PDF-"),

Please show what is the output of running xxd the_file.pdf | head -2 in command line.

Also, it would help knowing details about how this PDF was created, like:

If this is not against your privacy concerns, please show the output of
strings the_file.pdf | grep "Creator\|Producer".

peric · 2023-05-11T18:24:12Z

Hi @peric, Please help me debug this issue. The problem PDF should have been detected by the regular signature:

mimetype/internal/magic/document.go

Lines 9 to 10 in 6f575b3

// usual pdf signature

[]byte("%PDF-"),

Please show what is the output of running xxd the_file.pdf | head -2 in command line.

Also, it would help knowing details about how this PDF was created, like: If this is not against your privacy concerns, please show the output of strings the_file.pdf | grep "Creator\|Producer".

Hey @gabriel-vasile, sorry for the late reply. The notification got lost somewhere and I forgot to provide you an answer. Also, we found a workaround in the meantime, so that's one more reason why I forgot to answer.

The output of xxd the_file.pdf | head -2:

00000000: 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d  ----------------
00000010: 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d32  ---------------2

And the output of strings the_file.pdf | grep "Creator\|Producer" (I deleted the last line which refers to the vendor):

<pdf:Producer>Antenna House PDF Output Library 7.1.1639</pdf:Producer>
<xmp:CreatorTool>AH CSS Formatter V7.1 MR2 for Linux64 : 7.1.3.50324 (2021-04-26T09:47+09)</xmp:CreatorTool>
/Creator (AH CSS Formatter V7.1 MR2 for Linux64 : 7.1.3.50324 \(2021-04-26T09:47+09\))
/Producer (Antenna House PDF Output Library 7.1.1639)
  <xmp:CreatorTool>AH CSS Formatter V7.1 MR2 for Linux64 : 7.1.3.50324 (2021-04-26T09:47+09)</xmp:CreatorTool>

Hope that helps.

gabriel-vasile · 2023-05-17T03:21:26Z

Thank you, @peric!

I generated some PDFs using Antenna House but couldn't reproduce the issue.

I have one more question to you: what's the output of

file --mime the_file.pdf

If it is application/pdf then I'll look more into the issue and how file does detection compared to mimetype.
If it is not application/pdf then I'm sorry, I don't think it's ok to add logic to detect any kind of corrupted files.

gabriel-vasile added a commit that referenced this issue May 16, 2022

Add missing PDF signatures

62fd23c

For newline and BOM prefixed signatures. For #285

gabriel-vasile mentioned this issue May 16, 2022

Add missing PDF signatures #289

Merged

ysyrota mentioned this issue May 20, 2022

Added support for PDF files started with 0xA #292

Closed

peric closed this as completed Sep 4, 2022

gabriel-vasile reopened this Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF detected as application/octet-stream #285

PDF detected as application/octet-stream #285

peric commented May 10, 2022

gabriel-vasile commented May 16, 2022

peric commented Apr 17, 2023 •

edited

gabriel-vasile commented Apr 20, 2023 •

edited

peric commented May 11, 2023

gabriel-vasile commented May 17, 2023

PDF detected as application/octet-stream #285

PDF detected as application/octet-stream #285

Comments

peric commented May 10, 2022

gabriel-vasile commented May 16, 2022

peric commented Apr 17, 2023 • edited

gabriel-vasile commented Apr 20, 2023 • edited

peric commented May 11, 2023

gabriel-vasile commented May 17, 2023

peric commented Apr 17, 2023 •

edited

gabriel-vasile commented Apr 20, 2023 •

edited