time accumulation from profile is inaccurate #165

nsheff · 2019-08-12T18:28:41Z

in a recent run I saw this:

> `Time`	4:14:23	PEPPRO	_RES_
> `Success`	08-09-14:18:00	PEPPRO	_RES_

##### [Epilogue:]
*   Total elapsed time:  11:18:17.100000

This was a single run. so why the huge discrepancy between 'time' and 'total elapsed time', since it's not coming from a previous run?

I think it's because we're double-counting piped processes. So the bowtie2 command is piped to samtools. these are different hashes, but they haev the same time:

413534	13e1541610	24	2:16:26.790000	0.8886	samtools	lock.aligned_hg38__GSM3618131_sort.bam
413533	ea03362230	23	2:16:17.120000	0.0046	samtools	lock.aligned_hg38__GSM3618131_sort.bam
413532	603f76945d	22	2:15:56.990000	3.5303	bowtie2	lock.aligned_hg38__GSM3618131_sort.bam
440203	cfc45bc51c	28	1:15:05.910000	0.888	samtools	lock.aligned_hg38__GSM3618131_sort_dups.bam
440201	c1225bc513	26	1:14:14.460000	3.5274	bowtie2	lock.aligned_hg38__GSM3618131_sort_dups.bam
440202	ea03362230	27	1:14:14.380000	0.0046	samtools	lock.aligned_hg38__GSM3618131_sort_dups.bam
411725	159cf5da58	18	0:07:52.500000	1.6758	fastq_pair	lock.fastq__GSM3618131_R2_trimmed_dups.fastq.paired.fq
409959	a492c7578f	14	0:06:47.670000	1.6388	fastq_pair	lock.fastq__GSM3618131_R2_trimmed.fastq.paired.fq
448047	04cc4259c8	55	0:01:45.390000	0.0143	preseq	lock.QC_hg38__GSM3618131_preseq_yield.txt

it must be duplicating that.

perhaps hashes should be done on commands rather than on individual steps in a pipeline?

The text was updated successfully, but these errors were encountered:

nsheff · 2019-08-12T18:29:49Z

in this example, processes 22, 23, and 24 were run simultaneously.

26,27,28 likewise.

they were piped streams. these two commands are getting triple counted, hence the almost tripled report of time spent.

stolarczyk · 2019-08-12T19:59:17Z

good catch. Yeah, it looks like the subcommands in the piped command are timed with an offset equal to the current runtime for the entire pipe, which is wrong.

perhaps hashes should be done on commands rather than on individual steps in a pipeline?

good idea. Should be really easy, as well

nsheff · 2019-08-13T14:47:44Z

what about hashing the actually pipeline run as well? you could has the time value in seconds since unix epoch...

that way, you'd know if two commands were run in the same run or in a different run.

nsheff · 2019-08-13T14:49:18Z

good idea. Should be really easy, as well

then how do you calculate the time, by taking the maximum value for the hash? assuming that the maximum runtime across any piped commands is the time from all those piped commands which were running simultaneously? is that true though?

stolarczyk · 2019-08-13T15:16:59Z

the reported time is not the max but the last one. However, the actual pipe order is not preserved in the profile.. so in this case:

413534	aaaaaaaa	24	2:16:26.790000	0.8886	samtools	lock.aligned_hg38__GSM3618131_sort.bam
413533	aaaaaaaa	23	2:16:17.120000	0.0046	samtools	lock.aligned_hg38__GSM3618131_sort.bam
413532	aaaaaaaa	22	2:15:56.990000	3.5303	bowtie2	lock.aligned_hg38__GSM3618131_sort.bam

the returned time for the pipe is 2:15:56.990000. Whereas the correct answer is 2:16:26.790000.

So what the current approach is missing (I just realized) is ordering the rows that are assigned the same hash by the command ID before the last one is returned.

what about hashing the actually pipeline run as well? you could has the time value in seconds since unix epoch...

that a good idea

nsheff · 2019-08-13T21:17:21Z

another problem with accumulating time in this way is that it doesn't account for any time that is not spent in the 'run' command, right? I think we discussed this...

but what about time spent in, for example, result reporting functions or follow functions? These can actually be substantial.

nsheff added the bug label Aug 12, 2019

stolarczyk self-assigned this Aug 12, 2019

stolarczyk added a commit that referenced this issue Aug 12, 2019

hash entire command, even it it is a pipe; addresses #165

e16658f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

time accumulation from profile is inaccurate #165

time accumulation from profile is inaccurate #165

nsheff commented Aug 12, 2019

nsheff commented Aug 12, 2019

stolarczyk commented Aug 12, 2019

nsheff commented Aug 13, 2019

nsheff commented Aug 13, 2019

stolarczyk commented Aug 13, 2019

nsheff commented Aug 13, 2019

time accumulation from profile is inaccurate #165

time accumulation from profile is inaccurate #165

Comments

nsheff commented Aug 12, 2019

nsheff commented Aug 12, 2019

stolarczyk commented Aug 12, 2019

nsheff commented Aug 13, 2019

nsheff commented Aug 13, 2019

stolarczyk commented Aug 13, 2019

nsheff commented Aug 13, 2019