Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index all files recognized by Tika, not by extension #10

Open
mlt opened this issue Apr 14, 2016 · 2 comments
Open

Index all files recognized by Tika, not by extension #10

mlt opened this issue Apr 14, 2016 · 2 comments

Comments

@mlt
Copy link
Contributor

mlt commented Apr 14, 2016

Here is the use case. I'd like to index custom types (by extending Tika). Those files can have different extensions (it doesn't matter). Also extensions sometime can vary, like htm and html that are included separately as of now. Also I'd like to index metadata from photos, so jpg vs jpeg is next that comes to mind.

I suggest to check content-type (or something) returned by Tika instead of supportsFile and, probably, ignore it if it is application/octet-stream.

@mirkosertic
Copy link
Owner

This would introduce a heavy load on the IO system. We would also have to check compressed files, which might also include compressed files and so on. This introduces another problem: opening a file from a compressed file from within the search results. There is no cross-platform solution available to implement this.

@mlt
Copy link
Contributor Author

mlt commented May 29, 2019

This can be an option. I'm not sure if enabling that per folder is an overkill, but certainly doable as a global option if I really want to index everything only in a certain folder and I don't need anything else.

I would just skip compressed files… at least for now and report those as such. I'm not 100% sure but on top of my head, Tika can consider single level containers, i.e. not stuff in a zip inside of another zip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants