Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A subset of metrics disappear from m3db after some rounds of scale up/down #4103

Open
abdulmi opened this issue Apr 14, 2022 · 0 comments
Open

Comments

@abdulmi
Copy link

abdulmi commented Apr 14, 2022

Filing M3 Issues

General Issues

We recently had an m3db outage where a subset of metrics just "disappeared" for some time period and couldn't be queried. Here is the series of events that we think caused this to happen:

  1. We scaled up m3db cluster from 1 replica/isolation group -> 2 replicas/isolation group. We have 3 replication groups.
  2. We scaled down the cluster from 2 replicas/isolation group back to 1 replica/isolation group
  3. We scaled up the cluster from 1 replica/isolation group to 2 replicas/isolation group

After step 3, we started to see some metrics disappear from the cluster and can't be queried anymore (there was a metric gap for some metrics after step 3). All writes and reads to the cluster were successful and there were no failures. One thing worth to mention is, when the new replicas came up from the second scale-up in step 3, they were using the same disks that were provisioned by the reps that were brought up in step 1, which upon a quick look had some index data, but I don't think it had any actual metrics data.

To mitigate this, we scaled down the cluster by editing the placement, deleted the old disks and then scaled the cluster back up with new provisioned disks. The metrics started to work normally, and they reappeared even during the incident time so there was no longer a metric gap.

What service is experiencing the issue? (M3Coordinator, M3DB, M3Aggregator, etc)

M3db v1.3.0

What is the configuration of the service? Please include any YAML files, as well as namespace / placement configuration (with any sensitive information anonymized if necessary).

Here is the m3db configuration yaml

spec:
  annotations:
    ...
  configMapName: m3db-config-map
  containerResources:
    limits:
      cpu: 8.5
      memory: 88Gi
    requests:
      cpu: 8.5
      memory: 88Gi
  dataDirVolumeClaimTemplate:
    metadata:
      creationTimestamp: null
      name: m3db-data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 2Ti
      storageClassName: m3db-storage-class
    status: {}
  dnsPolicy: ClusterFirstWithHostNet
  etcdEndpoints:
  - http://etcd-0.etcd.m3.svc.cluster.local:2379
  - http://etcd-1.etcd.m3.svc.cluster.local:2379
  - http://etcd-2.etcd.m3.svc.cluster.local:2379
  hostNetwork: true
  image: ...
  isolationGroups:
  - name: group1
    nodeAffinityTerms:
    - key: pool
      values:
      - m3
    numInstances: 2
  - name: group2
    nodeAffinityTerms:
    - key: pool
      values:
      - m3
    numInstances: 2
  - name: group3
    nodeAffinityTerms:
    - key: pool
      values:
      - m3
    numInstances: 2
  namespaces:
  - name: default
    options:
      bootstrapEnabled: true
      cleanupEnabled: true
      flushEnabled: true
      indexOptions:
        blockSize: 8h
        enabled: true
      repairEnabled: true
      retentionOptions:
        blockDataExpiry: true
        blockDataExpiryAfterNotAccessPeriod: 5m
        blockSize: 8h
        bufferFuture: 10m
        bufferPast: 15m
        retentionPeriod: 2160h
      snapshotEnabled: true
      writesToCommitLog: true
  numberOfShards: 64
  parallelPodManagement: true
  podIdentityConfig:
    sources: []
  podMetadata:
    creationTimestamp: null
  priorityClassName: m3db
  replicationFactor: 3

How are you using the service? For example, are you performing read/writes to the service via Prometheus, or are you using a custom script?

We're performing reads and writes through m3 coordinators

Is there a reliable way to reproduce the behavior? If so, please provide detailed instructions.

We haven't yet attempted to reproduce this issue yet, but we wanted to see if the series of events provided above is not something that should be done in general

Please let me know if you need any more details/configs. Is the series of events done above not meant to work ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant