Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Implements fingerprint ingest processor #13612

Open
gaobinlong opened this issue May 9, 2024 · 0 comments
Open

[Feature Request] Implements fingerprint ingest processor #13612

gaobinlong opened this issue May 9, 2024 · 0 comments
Labels
enhancement Enhancement or improvement to existing feature or request ingest-pipeline v2.15.0 Issues and PRs related to version 2.15.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@gaobinlong
Copy link
Contributor

gaobinlong commented May 9, 2024

Is your feature request related to a problem? Please describe

Currently we have community_id ingest processor which is used to generate community ID flow hash for network flow tuples based on the community id hash algorithm, but for common data such as application log or e-commerce data, we can also introduce a new type of ingest processor which can generate hash value based on part of the fields or all fields in a document, just like content hash, the fingerprint for each document can be used to deduplicate the documents and collapse search results.

The usage of the new fingerprint ingest processor could be:

"processors": [
      {
        "fingerprint": {
          "fields": ["foo", "bar"],
          "target_field": "fingerprint"
        }
      }
    ]

or

"processors": [
      {
        "fingerprint": {
          "include_all": true,
          "target_field": "fingerprint"
        }
      }
    ]

, after executing the processor, a new field fingerprint will be added to each document, then users can use the value of that field to deduplicate documents:

1. check if there are duplicated documents based on the fingerprint of each document
GET test1/_search
{
  "size": 0,
  "aggs": {
    "test": {
      "terms": {
        "field": "fingerprint",
        "min_doc_count": 2
      }
    }
  }
}
, the result is:
...
"aggregations": {
    "test": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "MspgpPqOACPsB5VvjDbn1PdaClo=",
          "doc_count": 2
        }
      ]
    }
  }
, then user knows that there are 2 documents having same fingerprint, they may decide to delete one of them if it's not as expected.

, another use case is for collapsing search results:

GET test1/_search
{
  "collapse": {
    "field": "fingerprint"                
  }
}
, the search hits will only have one document if there're more than one hit documents containing the same fingerprint.

Describe the solution you'd like

Add a new ingest processor which can generate fingerprint for the incoming document.

Related component

Indexing

Describe alternatives you've considered

Generate the fingerprint in client side, which is not friendly for users.

Additional context

No response

@gaobinlong gaobinlong added enhancement Enhancement or improvement to existing feature or request untriaged labels May 9, 2024
@github-actions github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label May 9, 2024
@dhwanilpatel dhwanilpatel added ingest-pipeline and removed Indexing Indexing, Bulk Indexing and anything related to indexing labels May 15, 2024
@reta reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.15.0 Issues and PRs related to version 2.15.0 and removed untriaged labels May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request ingest-pipeline v2.15.0 Issues and PRs related to version 2.15.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
Development

No branches or pull requests

3 participants