Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datawave Ingest: Add configurable term position limit for tokenized fields. #2106

Open
drewfarris opened this issue Sep 29, 2023 · 0 comments
Assignees

Comments

@drewfarris
Copy link
Collaborator

In certain pathologic cases, large documents tokenized during ingest yield a tremendous number of term positions.

The goal of this ticket is to implement limits to the term positions we create per document. This is possibly similar to what we have in place for the multi-field limits which (1) limits to the number of values we will support for a single field (2) define an action to take when that limit is hit for a single document.

Notes on the multi-value threshold implementation:

The configuration options for multi-value field limits are defined in the file

. The implementation that uses these are in and https://github.com/NationalSecurityAgency/datawave/blob/integration/warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/tokenize/ExtendedContentDataTypeHelper.java#L361

In this multi-field case, in addition to specifying a numeric limit the the number of values we submit, we also define an action to perform should that threshold be met. Valid actions for the multi-value logic are: DROP, TRUNCATE, REPLACE or FAIL.

Thoughts on the term position limit threshold implementation:

For the limits we implement on term positions, minimally TRUNCATE (which means just insert terms up to the limit and then do not insert terms less than the limit) or FAIL (error the entire document that had too many terms) are, but you should check the way this is implemented for multivalued limits to see if REPLACE or FAIL.

However, they may be implemented in a way that is specific to the CSV file type. The term position limit should be something that is implemented for anything that gets tokenized. This may mean making changes in multiple places (e.g., the CSVIngestHlper and ExtendedContentDataTypeHelper).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants