Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for custom extensions #14

Open
skin27 opened this issue Mar 7, 2017 · 2 comments
Open

Support for custom extensions #14

skin27 opened this issue Mar 7, 2017 · 2 comments

Comments

@skin27
Copy link

skin27 commented Mar 7, 2017

I would like to scan custom extensions as well. I work a lot with structured documents like .csv, .xml, .json etc. These could be scanned like normal text files.

@mirkosertic
Copy link
Owner

Ah, a good requirement! Yet, what about document metadata? I don't thing authors can be extracted from the files, the only viable information would be the last modified date and the extracted content language. Maybe the new NLP features might find some named entities, but I don't think there are more options here. What do you think?

@mlt
Copy link
Contributor

mlt commented May 29, 2019

One can extend Tika to extract metadata if those xml, json, etc have a certain structure and contain necessary information.
Since there is always going to be someone who says I miss extension X, I wonder if it would make sense to use patterns for things to scan somehow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants