Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Avoid duplicating every field in PubsubToElasticsearch #1367

Open
henrikno opened this issue Mar 14, 2024 · 0 comments
Open
Labels
addition New feature or request needs triage p2

Comments

@henrikno
Copy link

Related Template(s)

PubsubToElasticsearch

What feature(s) are you requesting?

PubsubToElasticsearch is duplicating every field in the message field, which leads to increased costs both in storage, indexing and transfer.

A vpcflow document that comes in can look like this:

{
  "insertId": "z94saofd6xcee",
  "jsonPayload": {
    "bytes_sent": "370688",
    "connection": {
      "dest_ip": "1.2.3.4",
      "dest_port": 443,
      "protocol": 6,
      "src_ip": "4.3.2.1",
      "src_port": 32848
    },
    "dest_instance": {
      "project_id": "test-project",
      "region": "us-central1",
      "vm_name": "vm-001",
      "zone": "us-central1-c"
    },
    "dest_vpc": {
      "project_id": "test-project",
      "subnetwork_name": "vpc-default-us-central1",
      "vpc_name": "vpc-test-us-central1"
    },
    "end_time": "2024-02-14T20:25:32.344962097Z",
    "packets_sent": "384",
    "reporter": "DEST",
    "src_location": {
      "asn": 14061,
      "continent": "Europe",
      "country": "deu"
    },
    "start_time": "2024-02-14T20:25:02.094812111Z"
  },
  "logName": "projects/cloud-production-168820/logs/compute.googleapis.com%2Fvpc_flows",
  "receiveTimestamp": "2024-02-14T20:26:10.756443191Z",
  "resource": {
    "labels": {
      "location": "us-central1",
      "project_id": "test-project",
      "subnetwork_id": "351176161561165321",
      "subnetwork_name": "vpc-default-us-central1"
    },
    "type": "gce_subnetwork"
  },
  "timestamp": "2024-02-14T20:26:10.756443191Z"
}

When it gets through transformation and written to Elasticsearch, it is expanded and looks something like this:

{
  "insertId": "z94saofd6xcee",
  "jsonPayload": {
    "bytes_sent": "370688",
    "connection": {
      "dest_ip": "1.2.3.4",
      "dest_port": 443,
      "protocol": 6,
      "src_ip": "4.3.2.1",
      "src_port": 32848
    },
    "dest_instance": {
      "project_id": "test-project",
      "region": "us-central1",
      "vm_name": "vm-001",
      "zone": "us-central1-c"
    },
    "dest_vpc": {
      "project_id": "test-project",
      "subnetwork_name": "vpc-default-us-central1",
      "vpc_name": "vpc-test-us-central1"
    },
    "end_time": "2024-02-14T20:25:32.344962097Z",
    "packets_sent": "384",
    "reporter": "DEST",
    "src_location": {
      "asn": 14061,
      "continent": "Europe",
      "country": "deu"
    },
    "start_time": "2024-02-14T20:25:02.094812111Z"
  },
  "logName": "projects/cloud-production-168820/logs/compute.googleapis.com%2Fvpc_flows",
  "receiveTimestamp": "2024-02-14T20:26:10.756443191Z",
  "resource": {
    "labels": {
      "location": "us-central1",
      "project_id": "test-project",
      "subnetwork_id": "351176161561165321",
      "subnetwork_name": "vpc-default-us-central1"
    },
    "type": "gce_subnetwork"
  },
  "@timestamp": "2024-02-14T20:26:10.756443191Z",
  "agent": {
    "type": "dataflow",
    "name": "",
    "version": "999.999.999",
    "id": ""
  },
  "data_stream": {
    "type": "logs",
    "dataset": "gcp.vpcflow",
    "namespace": "test-namespace"
  },
  "ecs": {
    "version": "1.10.0"
  },
  "message": "{  \"insertId\": \"z94saofd6xcee\",  \"jsonPayload\": {    \"bytes_sent\": \"370688\",    \"connection\": {      \"dest_ip\": \"1.2.3.4\",      \"dest_port\": 443,      \"protocol\": 6,      \"src_ip\": \"4.3.2.1\",      \"src_port\": 32848    },    \"dest_instance\": {      \"project_id\": \"test-project\",      \"region\": \"us-central1\",      \"vm_name\": \"vm-001\",      \"zone\": \"us-central1-c\"    },    \"dest_vpc\": {      \"project_id\": \"test-project\",      \"subnetwork_name\": \"vpc-default-us-central1\",      \"vpc_name\": \"vpc-test-us-central1\"    },    \"end_time\": \"2024-02-14T20:25:32.344962097Z\",    \"packets_sent\": \"384\",    \"reporter\": \"DEST\",    \"src_location\": {      \"asn\": 14061,      \"continent\": \"Europe\",      \"country\": \"deu\"    },    \"start_time\": \"2024-02-14T20:25:02.094812111Z\"  },  \"logName\": \"projects/cloud-production-168820/logs/compute.googleapis.com%2Fvpc_flows\",  \"receiveTimestamp\": \"2024-02-14T20:26:10.756443191Z\",  \"resource\": {    \"labels\": {      \"location\": \"us-central1\",      \"project_id\": \"test-project\",      \"subnetwork_id\": \"351176161561165321\",      \"subnetwork_name\": \"vpc-default-us-central1\"    },    \"type\": \"gce_subnetwork\"  },  \"timestamp\": \"2024-02-14T20:26:10.756443191Z\"}",
  "service": {
    "type": "gcp.vpcflow"
  },
  "event": {
    "module": "gcp",
    "dataset": "gcp.vpcflow"
  }
}

Note every field is also included in message.
The GCP integration in Elasticsearch parses message into a json.jsonPayload, extracts/converts the fields into ECS mapping. However it does drop this json field, so everything is only stored twice.

I've worked around it by dropping the jsonPayload field in an ingest pipeline, which helps on storage, but we're still paying for the extra data transfer and processing.

Ideally the document would only contain the json fields and no message, and then the integration would use the json fields instead. It might need some modification to the ingest pipeline to detect the presence of jsonPayload and ignore message.

@henrikno henrikno added addition New feature or request needs triage p2 labels Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addition New feature or request needs triage p2
Projects
None yet
Development

No branches or pull requests

1 participant