Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WM-2570] Get task cost #7415

Merged
merged 86 commits into from
Jun 7, 2024
Merged

[WM-2570] Get task cost #7415

merged 86 commits into from
Jun 7, 2024

Conversation

kpierre13
Copy link
Contributor

No description provided.

@kpierre13 kpierre13 marked this pull request as ready for review May 10, 2024 14:10
@kpierre13 kpierre13 requested a review from a team as a code owner May 10, 2024 14:10
Copy link
Contributor

@THWiseman THWiseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff! Great job figuring out how to efficiently ask for some extra data from TES when we need it. Very useful feature.

)(implicit ec: ExecutionContext): Future[String] =
for {
logs <- getTaskLogsFn(handle)
taskEndTime = logs.end_time.get
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a good idea to avoid the .get here since it will cause a crash if the end_time isn't present. Perhaps its better for this function to return a Future[Option[String]] since tes might successfully respond, but not include the end_time in its response?

@@ -401,9 +431,41 @@ class TesAsyncBackendJobExecutionActor(override val standardParams: StandardAsyn

override def requestsAbortAndDiesImmediately: Boolean = false

override def onTaskComplete(runStatus: TesRunStatus, handle: StandardAsyncPendingExecutionHandle): Unit = {
val taskEndTime = getTaskEndTime(handle, getTaskLogs)
if (runStatus == Error() | runStatus == Failed()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might not do exactly what we want, since it's possible for two Error or Failed case classes to not == one another.

For example: Error(Seq("an error message")) != Error(Seq.empty)
I think it might be better to match on the type, like

    runStatus match {
      case Error(_) => // fetch and tell errors
      case Failed(_) => // fetch and tell errors
      case _ => Unit
    }

import scala.util.{Failure, Try}

class TesAsyncBackendJobExecutionActorSpec
extends AnyFlatSpec
extends TestKitSuite
with AnyFlatSpecLike
with Matchers
with MockSugar
with TableDrivenPropertyChecks {
behavior of "TesAsyncBackendJobExecutionActor"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would really benefit from some tests of the polling behavior, since that seems to be the bulk of the new logic in this branch. I think it's important to know what web requests are made and what status is returned when calling pollStatusAsync in different conditions. We could try to make that function pure and test it directly in unit tests, or do some work to spin up our own mock TesAsyncBackendJobExecutionActor to use for unit tests.

I think it would also be great to have a test that makes sure we're emitting the expected metadata when calling onTaskComplete in different conditions.

Happy to help!

Copy link
Contributor

@jgainerdewar jgainerdewar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's awesome to see this coming to fruition!

Comment on lines 435 to 437
val taskEndTime = getTaskEndTime(handle, getTaskLogs)
if (runStatus == Error() | runStatus == Failed()) {
val errors = getErrorSeq(runStatus, handle, getErrorLogs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this will be making two identical requests to TES, can we fetch the full task view once?

makeRequest[MinimalTaskView](HttpRequest(uri = s"$tesEndpoint/${handle.pendingJob.jobId}?view=MINIMAL")) map {
response =>
val state = response.state
getTesStatus(Option(state), Option.empty, handle.pendingJob.jobId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than passing through empty cost data, should this be passing the cost data we already loaded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific line, I'm passing in Option.empty because the tesVmCostData is not in scope; however for the lines similar above this one, I'm passing in that vm cost object rather than Option.empty

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm wondering if that's going to lead to the behavior we want. I might be missing something, but it looks like right now, if we already have all the cost data, when we poll we'll end up with a status without any cost data attached - which means we'll fetch it again next time we poll.

Failed()
case s if s.contains("EXECUTOR_ERROR") =>
jobLogger.info(s"TES reported a failure for Job ${jobId}: '$s'")
Failed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to track cost data for failed tasks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think so, but is it necessary here for any of the states? I think it makes sense to get the cost data and put it into the metadata, but I'm not sure if it's necessary to add it here when we get the status. I'm going to remove it from all occurrences for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I get it - these are terminal states, so they don't need to carry cost data with them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wouldn't think the terminal states would care about whether or not we fetched the cost data since we'll stop polling anyway.

def isTerminal = true
}

object TesAsyncBackendJobExecutionActor {
val JobIdKey = "tes_job_id"
private type StandardAsyncRunInfo = Any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I overheard some discussion of how we ended up with Any here but I didn't get the details - what prevents us from using a more narrowly-defined type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Any conversation was surrounding a previous function that is now split into getTaskEndTime and getErrorSeq which no long have Any as its return type.

with Matchers
with MockSugar
with TableDrivenPropertyChecks {
behavior of "TesAsyncBackendJobExecutionActor"

type StandardAsyncRunInfo = Any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reference the types in TesAsyncBackendJobExecutionActor here rather than defining these again?

@@ -70,6 +82,15 @@ class TesAsyncBackendJobExecutionActorSpec
content = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are great - nice mocking! I would love to see more, though, particularly around the polling (and then not polling) behavior, and how we respond to various forms of the FULL_VIEW from TES. You can set up Actor-level testing that inspects the metadata stored by actions, which would be really useful here.

@broadinstitute broadinstitute deleted a comment from jgainerdewar May 13, 2024
@kpierre13
Copy link
Contributor Author

@jgainerdewar Accidentally deleted your comment in the new IJ UI. Reposting here:

What about tasks that error out? Do we need to store cost data there the same way we do complete and cancelled tasks?

@kpierre13
Copy link
Contributor Author

@jgainerdewar Accidentally deleted your comment in the new IJ UI. Reposting here:

What about tasks that error out? Do we need to store cost data there the same way we do complete and cancelled tasks?

I think we should include cost data for errored/failed tasks so the information is available and can be stored. From my understanding, task that doesn't immediately fail/error can still incur cost that should still be calculated.

Copy link
Contributor

@aednichols aednichols left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a change based on new information that makes our lives easier.

The Funnel source code search was much less esoteric than one might guess, I searched for "tes" "size_bytes" and it was the third Google result.

kpierre13 and others added 5 commits June 6, 2024 09:50
…/TesTask.scala

Co-authored-by: Adam Nichols <anichols@broadinstitute.org>
…/TesResponseJsonFormatter.scala

Co-authored-by: Adam Nichols <anichols@broadinstitute.org>
…/TesResponseJsonFormatter.scala

Co-authored-by: Adam Nichols <anichols@broadinstitute.org>
@kpierre13 kpierre13 merged commit f2ade5b into develop Jun 7, 2024
37 checks passed
@kpierre13 kpierre13 deleted the wm-2570-task-per-column-kp branch June 7, 2024 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants