Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getCompactionTaskCapacity is not worker category aware. #15847

Open
m-ghazanfar opened this issue Feb 7, 2024 · 2 comments · May be fixed by #16391
Open

getCompactionTaskCapacity is not worker category aware. #15847

m-ghazanfar opened this issue Feb 7, 2024 · 2 comments · May be fixed by #16391
Labels

Comments

@m-ghazanfar
Copy link
Contributor

Description

The getCompactionTaskCapacity function which is used to ascertain that the Druid cluster has enough task slots before the coordinator schedules additional compaction tasks doesn't take into consideration overlord dynamic config. The overlord dynamic config can prevent compaction tasks from running on specific categories of workers. This way, the compaction task capacity is incorrectly overestimated.

For example, I have two worker categories, compaction-category with a total of 600 task slots and another ingestion-category with 2000 slots(high number because of multiple ingestion task replicas).

Using the overlord dynamic config,

  • compaction-category is configured to run the following task types,

    • kill
    • compact
    • single_phase_sub_task
    • partial_dimension_cardinality
    • partial_index_generate
    • partial_index_generic_merge
  • ingestion-category is configured to run,

    • index_kafka

Now, getCompactionTaskCapacity would return 2600 as the total capacity, which is inaccurate since only 600 slots are actually available for compaction tasks. While this might not pose a problem in a healthy cluster, it becomes critical during compaction task failures. The oversight leads to the coordinator creating excessive compaction tasks, resulting in contention on compaction slots and slowing down all compaction tasks. This creates a feedback loop where the increasing number of compaction tasks exacerbates contention, ultimately overwhelming the overlord with too many tasks to handle

Affected Version

Saw this on Druid 25. Is also present in master.

@m-ghazanfar
Copy link
Contributor Author

My overlord dynamic config looks like this,

overlord dynamic config
{
  "type": "equalDistributionWithCategorySpec",
  "workerCategorySpec": {
    "categoryMap": {
      "kill": {
        "defaultCategory": "abc_category",
        "categoryAffinity": {}
      },
      "compact": {
        "defaultCategory": "abc_category",
        "categoryAffinity": {}
      },
      "single_phase_sub_task": {
        "defaultCategory": "abc_category",
        "categoryAffinity": {}
      },
      "partial_dimension_cardinality": {
        "defaultCategory": "abc_category",
        "categoryAffinity": {}
      },
      "partial_index_generate": {
        "defaultCategory": "abc_category",
        "categoryAffinity": {}
      },
      "partial_index_generic_merge": {
        "defaultCategory": "abc_category",
        "categoryAffinity": {}
      },
      "partial_range_index_generate": {
        "defaultCategory": "abc_category",
        "categoryAffinity": {}
      },
      "partial_dimension_distribution": {
        "defaultCategory": "abc_category",
        "categoryAffinity": {}
      },
      "index_kafka": {
        "defaultCategory": "xxx_category",
        "categoryAffinity": {
          "datasource_2": "yyy_category",
          "datasource_3": "zzz_category"
        }
      }
    },
    "strong": true
  }
}

@m-ghazanfar
Copy link
Contributor Author

We worked around this by explicitly setting the maxCompactionTaskSlots to the total available worker slots on the compaction tier.(Actually, we set it a little higher than that to improve middle manager utilisation)

See Update capacity for compaction tasks

Example,

curl --request POST "http://ROUTER_IP:ROUTER_PORT/druid/coordinator/v1/config/compaction/taskslots?max=600"

@AlbericByte AlbericByte linked a pull request May 5, 2024 that will close this issue
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants