You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In certain pathologic cases, large documents tokenized during ingest yield a tremendous number of term positions.
The goal of this ticket is to implement limits to the term positions we create per document. This is possibly similar to what we have in place for the multi-field limits which (1) limits to the number of values we will support for a single field (2) define an action to take when that limit is hit for a single document.
Notes on the multi-value threshold implementation:
The configuration options for multi-value field limits are defined in the file
In this multi-field case, in addition to specifying a numeric limit the the number of values we submit, we also define an action to perform should that threshold be met. Valid actions for the multi-value logic are: DROP, TRUNCATE, REPLACE or FAIL.
Thoughts on the term position limit threshold implementation:
For the limits we implement on term positions, minimally TRUNCATE (which means just insert terms up to the limit and then do not insert terms less than the limit) or FAIL (error the entire document that had too many terms) are, but you should check the way this is implemented for multivalued limits to see if REPLACE or FAIL.
However, they may be implemented in a way that is specific to the CSV file type. The term position limit should be something that is implemented for anything that gets tokenized. This may mean making changes in multiple places (e.g., the CSVIngestHlper and ExtendedContentDataTypeHelper).
The text was updated successfully, but these errors were encountered:
In certain pathologic cases, large documents tokenized during ingest yield a tremendous number of term positions.
The goal of this ticket is to implement limits to the term positions we create per document. This is possibly similar to what we have in place for the multi-field limits which (1) limits to the number of values we will support for a single field (2) define an action to take when that limit is hit for a single document.
Notes on the multi-value threshold implementation:
The configuration options for multi-value field limits are defined in the file
datawave/warehouse/ingest-configuration/src/main/resources/config/mycsv-ingest-config.xml
Line 163 in 8a97485
datawave/warehouse/ingest-core/src/main/java/datawave/ingest/data/config/ingest/CSVIngestHelper.java
Line 167 in 8a97485
In this multi-field case, in addition to specifying a numeric limit the the number of values we submit, we also define an action to perform should that threshold be met. Valid actions for the multi-value logic are: DROP, TRUNCATE, REPLACE or FAIL.
Thoughts on the term position limit threshold implementation:
For the limits we implement on term positions, minimally TRUNCATE (which means just insert terms up to the limit and then do not insert terms less than the limit) or FAIL (error the entire document that had too many terms) are, but you should check the way this is implemented for multivalued limits to see if REPLACE or FAIL.
However, they may be implemented in a way that is specific to the CSV file type. The term position limit should be something that is implemented for anything that gets tokenized. This may mean making changes in multiple places (e.g., the CSVIngestHlper and ExtendedContentDataTypeHelper).
The text was updated successfully, but these errors were encountered: