Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docx file created by LibreOffice returns Zip mime type #38

Open
RangerMak opened this issue Nov 23, 2017 · 2 comments
Open

Docx file created by LibreOffice returns Zip mime type #38

RangerMak opened this issue Nov 23, 2017 · 2 comments

Comments

@RangerMak
Copy link

I think its because LibreOffice (OpenOffice) put files into docx-archive in different order than MS Word.
So file have another signature and detects as simple zip-archive.

Example file is attached.
DocxByLibreOffice.docx

@j256
Copy link
Owner

j256 commented Dec 6, 2017

That's correct because it is a zip file. For me to detect it as something else means that I have to open it up and process the file contents which I'm less interested in doing. I'm specifically worried about processing a large zip file just to see if it is a doc file. But maybe I can read in the first X bytes of the zip file and look for key files....

@fjtorres
Copy link

@j256 Did you add something to detect this case as a DOCX file? I found similar problem and I tried several ways to detect it as a DOCX file but just "Tika core" library was able to detect this case correctly.

Tika tika = new Tika();
String mimeType = tika.detect(filePath);
// output mime type: application/vnd.openxmlformats-officedocument.wordprocessingml.document

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants