Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass MIME type and file extension to Tika #7

Open
lukasgraf opened this issue Nov 11, 2013 · 2 comments
Open

Pass MIME type and file extension to Tika #7

lukasgraf opened this issue Nov 11, 2013 · 2 comments
Assignees

Comments

@lukasgraf
Copy link
Member

Currently a temporary file without file extension is used to store the original document passed to Tika.

We probably should

  • store the document in a temporary file using the same extension as the original file
  • pass the MIME type the transform gets from the TransformEngine's convertTo method on to Tika
@ghost ghost assigned lukasgraf Nov 11, 2013
@lukasgraf
Copy link
Member Author

It seems the Apache Tika command line interface doesn't support passing in the MIME type of the document (or any additional metadata for that matter).
Tika's Detector Interface would consider such metadata, but the metadata argument seems to be only exposed in the Tika API, not the command line interface.

@lukasgraf
Copy link
Member Author

So this leaves us with one option: Set the file extension of the temporary file, and let Tika's MIME type detection do its work.

The Tika Content Detection docs say that Tika

  • First uses Magic bytes in the input stream
  • Then refines this detection result using the file extension (if available)
  • And then refines it again using the content type from the supplied metadata (which we can't set)

The command line interface help describes a switch

-d  or --detect        Detect document type

Which seems to be enabled by default (otherwise, converting a temporary file with no extension wouldn't have worked). Still, we should probably enable this switch to be sure content type detection is always performed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant