-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
time accumulation from profile is inaccurate #165
Comments
in this example, processes 22, 23, and 24 were run simultaneously. 26,27,28 likewise. they were piped streams. these two commands are getting triple counted, hence the almost tripled report of time spent. |
good catch. Yeah, it looks like the subcommands in the piped command are timed with an offset equal to the current runtime for the entire pipe, which is wrong.
good idea. Should be really easy, as well |
what about hashing the actually pipeline run as well? you could has the that way, you'd know if two commands were run in the same run or in a different run. |
then how do you calculate the time, by taking the maximum value for the hash? assuming that the maximum runtime across any piped commands is the time from all those piped commands which were running simultaneously? is that true though? |
the reported time is not the max but the last one. However, the actual pipe order is not preserved in the profile.. so in this case:
the returned time for the pipe is 2:15:56.990000. Whereas the correct answer is 2:16:26.790000. So what the current approach is missing (I just realized) is ordering the rows that are assigned the same hash by the command ID before the last one is returned.
that a good idea |
another problem with accumulating time in this way is that it doesn't account for any time that is not spent in the 'run' command, right? I think we discussed this... but what about time spent in, for example, result reporting functions or follow functions? These can actually be substantial. |
in a recent run I saw this:
This was a single run. so why the huge discrepancy between 'time' and 'total elapsed time', since it's not coming from a previous run?
I think it's because we're double-counting piped processes. So the bowtie2 command is piped to samtools. these are different hashes, but they haev the same time:
it must be duplicating that.
perhaps hashes should be done on commands rather than on individual steps in a pipeline?
The text was updated successfully, but these errors were encountered: