Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki v3.0.0 quietly gives incomplete results for some queries including a line_format #12798

Open
kilo-gi opened this issue Apr 26, 2024 · 1 comment
Labels
type/bug Somehing is not working as expected

Comments

@kilo-gi
Copy link

kilo-gi commented Apr 26, 2024

Describe the bug
In v3.0.0, Loki will quietly drop some log lines in some cases when the query includes a line format.

To Reproduce
Steps to reproduce the behavior:

  1. Started Loki v3.0.0, (and v2.7.5 for comparison)
  2. Started Promtail v2.7.5
  3. Ingested a significant number of logs
  4. Make queries for a specified time range like the following '{image="standalone/llms:c38c6aa8f223b1aa86e1f0726362b15432c8697f", job="container_logs"} |= saved checkpoint at batch 75 | json | line_format .'
  5. Make queries for the same time range like the following ''{image="standalone/llms:c38c6aa8f223b1aa86e1f0726362b15432c8697f", job="container_logs"} |= saved checkpoint at batch 75 | json'
  6. Observe that for the exact same time range and logs ingested:
    v2.7.5 with and without the line format give 1800 logs
    v3.0.0 without the line format gives 1800 logs
    v3.0.0 with the line format only gives 1793 logs

Expected behavior
Neither the version nor the line format pipeline directive should change the count of logs received.

Environment:

  • Infrastructure: bare metal
  • Deployment tool: Running the images directly in docker, starting them via ansible.

It's a single replica deployment.

Screenshots, Promtail config, or terminal output
I've posted screenshots and dumps of logs in the community slack: https://grafana.slack.com/archives/C06RUQJHTHQ/p1713891805440339

Here's the loki config we're using for v3.0.0:

schema_config:
  configs:
  - from: 2023-08-01
    store: boltdb-shipper
    object_store: s3
    schema: v11
    index:
      prefix: loki_
      period: 24h
  - from: 2024-04-23
    store: tsdb
    object_store: s3
    schema: v13
    index:
      prefix: index_
      period: 24h

common:
  path_prefix: "/loki"
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: "inmemory"
  instance_addr: 127.0.0.1

storage_config:
  tsdb_shipper:
    active_index_directory: "/loki/tsdb-index"
    cache_location: "/loki/tsdb-cache"
    cache_ttl: 168h
  boltdb_shipper:
    active_index_directory: "/loki/index"
    cache_location: "/loki/boltdb-cache"
    cache_ttl: 168h
  aws:
    s3: "s3://us-west-2/test-int8-logs"
    # TODO: we will want to replace these with a dedicated set at some point
    access_key_id: {{ lookup('ansible.builtin.env', 'AWS_ACCESS_KEY_ID', default=Undefined) }}
    secret_access_key: {{ lookup('ansible.builtin.env', 'AWS_SECRET_ACCESS_KEY', default=Undefined) }}
  index_queries_cache_config: &general_cache_config
    default_validity: 168h
    embedded_cache: &general_embedded_cache_config
      enabled: true
      max_size_mb: 4096 # 4gb should be plenty with our 2.5g/logs per day and still manageable, even duplicated 5 places.
      ttl: 168h


# Here, Auth refers to specifically the X-Scope-OrgID header, which we don't use.
auth_enabled: false

server:
    http_tls_config:
        cert_file: "/loki/certs/server.crt"
        key_file: "/loki/certs/server.key"
        client_ca_file: "/loki/certs/root.crt"
        # For debugging, we can set this to "VerifyClientCertIfGiven" temporarily to get at the metrics and api in your browser.
        client_auth_type: "VerifyClientCertIfGiven"
    grpc_server_max_recv_msg_size: 67108864 # 64mb, default 4mb
    grpc_server_max_send_msg_size: 67108864
    http_server_read_timeout: 2m
    http_server_write_timeout: 2m
    log_level: debug

memberlist:
  node_name: singleton

limits_config:
  # We're actually pretty good about being <1kb per log line, so this corresponds to 100mb, which is large but not massive.
  # We specifically want a higher value than the default because grafana doesn't really do pagination, so this is the max that can be displayed there. Most queries should probably be limited to ~10k lines
  max_entries_limit_per_query: 100000
  # The default of 32 is quite low - Grafana serves queries for everyone
  max_query_parallelism: 512
  # Default of 1h shards things far too thin (4k requests for a single month long query). 6h is much more reasonable
  split_queries_by_interval: 6h
  # We hit the default of 10k streams per user :(
  max_streams_per_user: 0
  max_global_streams_per_user: 0
  allow_structured_metadata: false
  ingestion_rate_mb: 12
  ingestion_burst_size_mb: 48
  max_line_size: 16KB
  max_line_size_truncate: true

frontend:
  # Default of 100 is far too low
  max_outstanding_per_tenant: 2048
  log_queries_longer_than: 20s


frontend_worker:
    grpc_client_config: &grpc_client_config
      max_send_msg_size: 67108864 # 64mb, default 16mb
      max_recv_msg_size: 67108864

querier:
  # Default of 10 is quite low. We've got beefy machines
  max_concurrent: 20

query_scheduler:
  max_outstanding_requests_per_tenant: 10000
  grpc_client_config: *grpc_client_config

query_range:
  # https://github.com/grafana/loki/issues/4613#issuecomment-1021421653
  parallelise_shardable_queries: false
  align_queries_with_step: true
  cache_results: true
  results_cache:
    cache: *general_cache_config
  cache_instant_metric_results: true
  instant_metric_query_split_align: true
  instant_metric_results_cache:
    cache: *general_cache_config

chunk_store_config:
  chunk_cache_config:
    default_validity: 168h
    embedded_cache:
      <<: *general_embedded_cache_config
      max_size_mb: 40960 # pretty big 40gb so we can just keep a whole weeks worth of logs in memory. Why not?
  write_dedupe_cache_config: *general_cache_config
  # It's really unclear from docs what this number is actually supposed to be. Just the point at which we're confident logs have been ingested?
  cache_lookups_older_than: 1h

Querying v3.0.0 without the line format gives 1800 lines:
https://files.slack.com/files-pri/T05675Y01-F06V8NGCVB9/download/screenshot_2024-04-23_at_12.03.56___pm.png?origin_team=T05675Y01
https://files.slack.com/files-pri/T05675Y01-F0706C4AS85/download/explore-v3.0.0-no-line-format.json?origin_team=T05675Y01

v3.0.0 with the line format gives 1793 lines:
https://grafana.slack.com/files/U06V4UJKA7N/F0706CB2A8M/screenshot_2024-04-23_at_12.05.42___pm.png?origin_team=T05675Y01&origin_channel=C05675Y4F
https://files.slack.com/files-pri/T05675Y01-F070Y3GHSDN/download/explore-v3.0.0-with-line-format.json?origin_team=T05675Y01

v2.7.5 without the line format is 1800 lines:
https://grafana.slack.com/files/U06V4UJKA7N/F0709A4M34J/screenshot_2024-04-23_at_12.06.52___pm.png?origin_team=T05675Y01&origin_channel=C05675Y4F
https://files.slack.com/files-pri/T05675Y01-F070Y3NP4U8/download/explore-v2.7.5-no-line-format.json?origin_team=T05675Y01

And v2.7.5 with the line format is 1800 lines:
https://grafana.slack.com/files/U06V4UJKA7N/F0706CTN38V/screenshot_2024-04-23_at_12.08.55___pm.png?origin_team=T05675Y01&origin_channel=C05675Y4F
https://files.slack.com/files-pri/T05675Y01-F070Y3ZBERW/download/explore-v2.7.5-with-line-format.json?origin_team=T05675Y01

To get good diffs you have to remove all the "service_name": lines fields and format the json. With that done:

⟐:Downloads ▶ wc -l Explore-v*
   72362 Explore-v2.7.5-no-line-format.json
   72362 Explore-v2.7.5-with-line-format.json
   72362 Explore-v3.0.0-no-line-format.json
   72075 Explore-v3.0.0-with-line-format.json
  289161 total
⟐:Downloads ▶
@kilo-gi
Copy link
Author

kilo-gi commented Apr 26, 2024

This has been observed by others on the community forums:

https://community.grafana.com/t/line-format-reduces-log-count-to-only-a-few/119730

@JStickler JStickler added the type/bug Somehing is not working as expected label Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests

2 participants