Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix grace period after kill for health check failure #7221

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

komuta
Copy link
Contributor

@komuta komuta commented Sep 22, 2020

Following the change of task id / instance id format, task health status
was not properly tracked because bound to the same instance id after
successive kills for health check failure. This fix propose to address
the issue by tracking health status by task id to ensure it to cleaned
each time a task is terminated, whatever the reason.

JIRA Issues: MARATHON-8745

Following the change of task id / instance id format, task health status
was not properly tracked because bound to the same instance id after
successive kills for health check failure. This fix propose to address
the issue by tracking health status by task id to ensure it to cleaned
each time a task is terminated, whatever the reason.

JIRA Issues: MARATHON-8745
@@ -40,7 +41,7 @@ private[health] class HealthCheckActor(
implicit val mat = ActorMaterializer()
import context.dispatcher

val healthByInstanceId = TrieMap.empty[Instance.Id, Health]
val healthByInstanceId = TrieMap.empty[Task.Id, Health]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we oughta rename this map?

inactiveInstanceIds.foreach { inactiveId =>
healthByInstanceId.remove(inactiveId)
}
val activeTaskIds: Set[Task.Id] = instances.map(_.appTask).filter(_.isActive).map(_.taskId).to(Set)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val activeTaskIds: Set[Task.Id] = instances.map(_.appTask).filter(_.isActive).map(_.taskId).to(Set)
val activeTaskIds: Set[Task.Id] = instances.iterator.map(_.appTask).filter(_.isActive).map(_.taskId).to(Set)

healthByInstanceId.remove(inactiveId)
}
val activeTaskIds: Set[Task.Id] = instances.map(_.appTask).filter(_.isActive).map(_.taskId).to(Set)
healthByInstanceId.retain((taskId, health) => activeTaskIds(taskId))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

@@ -192,7 +191,9 @@ private[health] class HealthCheckActor(
}

def receive: Receive = {
case GetInstanceHealth(instanceId) => sender() ! healthByInstanceId.getOrElse(instanceId, Health(instanceId))
case GetInstanceHealth(instanceId) =>
sender() ! healthByInstanceId.find(_._1.instanceId == instanceId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... so we're going to scan a map? This seems like this could potentially perform poorly with lots of health checks. Maybe we should keep it indexed by instance id and evict the health status if the task id changes?

@timcharper
Copy link
Contributor

Thanks for the PR! I've reviewed it and left some feedback, I'm most concerned about changing the index and introducing a scan for what was previously a map lookup. Thank you!

Copy link

@rohitjain25 rohitjain25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def update(result: HealthResult): Health =
  result match {
    case Healthy(_, _, time, _) =>
      copy(
        firstSuccess = firstSuccess.orElse(Some(time)),
        lastSuccess = Some(time),
        consecutiveFailures = 0
      )
    case Unhealthy(_, _, cause, time, _) =>
      copy(
        lastFailure = Some(time),
        lastFailureCause = Some(cause)
      )
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants