Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEATURE: More flexible ingestion command #3561

Open
PlainSite opened this issue Jan 8, 2024 · 2 comments
Open

FEATURE: More flexible ingestion command #3561

PlainSite opened this issue Jan 8, 2024 · 2 comments
Labels
feature-request Requests for new features or enhancements of existing features Moderate Issue that may require attention

Comments

@PlainSite
Copy link

The alephclient ingestion tool has a "crawldir" option to ingest every file in a folder, but in many cases I just want to ingest one file at a time. This is because I often add PDFs to a folder where some of the PDFs (the old ones) have already been ingested/OCRed but the new ones have not. It's a waste of CPU resources to repeatedly re-ingest documents that are already in Aleph.

Either by default the alephclient command should skip over files where the SHA-1 checksum matches a checksum already in ElasticSearch (isn't this the point of storing the checksum?) with something like a --force option if it's really necessary to re-run the ingestion process for those files, and/or alephclient should allow more precise targeting of specific files to avoid the re-ingestion problem.

The alternative of moving files to a random folder in /tmp to ingest them there isn't ideal because alephclient absorbs some of that file structure metadata when it's figuring out what it's ingesting.

It would also be nice to be able to define metadata to go along with an ingestion batch by passing some information from the command line. Right now I'm not sure how to do that, and such metadata might be useful later to filter ingested data in a certain way.

@PlainSite PlainSite added feature-request Requests for new features or enhancements of existing features triage These issues need to be reviewed by the Aleph team labels Jan 8, 2024
@Rosencrantz Rosencrantz added Major issue that requires attention Moderate Issue that may require attention and removed triage These issues need to be reviewed by the Aleph team Major issue that requires attention labels Jan 16, 2024
@Rosencrantz
Copy link
Contributor

Hey @PlainSite

Thanks for raising this feature request. This is something that we are aware of, and that we'd like to address with time. Right now the team are focused on a significant backlog of other work, so it may be a while before we can get to this. If you felt like putting together a pull request then we'd love to review it.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Requests for new features or enhancements of existing features Moderate Issue that may require attention
Projects
None yet
Development

No branches or pull requests

3 participants