Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time accumulation from profile is inaccurate #165

Open
nsheff opened this issue Aug 12, 2019 · 6 comments
Open

time accumulation from profile is inaccurate #165

nsheff opened this issue Aug 12, 2019 · 6 comments
Assignees
Labels

Comments

@nsheff
Copy link
Member

nsheff commented Aug 12, 2019

in a recent run I saw this:

> `Time`	4:14:23	PEPPRO	_RES_
> `Success`	08-09-14:18:00	PEPPRO	_RES_

##### [Epilogue:]
*   Total elapsed time:  11:18:17.100000

This was a single run. so why the huge discrepancy between 'time' and 'total elapsed time', since it's not coming from a previous run?

I think it's because we're double-counting piped processes. So the bowtie2 command is piped to samtools. these are different hashes, but they haev the same time:

413534	13e1541610	24	2:16:26.790000	0.8886	samtools	lock.aligned_hg38__GSM3618131_sort.bam
413533	ea03362230	23	2:16:17.120000	0.0046	samtools	lock.aligned_hg38__GSM3618131_sort.bam
413532	603f76945d	22	2:15:56.990000	3.5303	bowtie2	lock.aligned_hg38__GSM3618131_sort.bam
440203	cfc45bc51c	28	1:15:05.910000	0.888	samtools	lock.aligned_hg38__GSM3618131_sort_dups.bam
440201	c1225bc513	26	1:14:14.460000	3.5274	bowtie2	lock.aligned_hg38__GSM3618131_sort_dups.bam
440202	ea03362230	27	1:14:14.380000	0.0046	samtools	lock.aligned_hg38__GSM3618131_sort_dups.bam
411725	159cf5da58	18	0:07:52.500000	1.6758	fastq_pair	lock.fastq__GSM3618131_R2_trimmed_dups.fastq.paired.fq
409959	a492c7578f	14	0:06:47.670000	1.6388	fastq_pair	lock.fastq__GSM3618131_R2_trimmed.fastq.paired.fq
448047	04cc4259c8	55	0:01:45.390000	0.0143	preseq	lock.QC_hg38__GSM3618131_preseq_yield.txt


it must be duplicating that.

perhaps hashes should be done on commands rather than on individual steps in a pipeline?

@nsheff
Copy link
Member Author

nsheff commented Aug 12, 2019

in this example, processes 22, 23, and 24 were run simultaneously.

26,27,28 likewise.

they were piped streams. these two commands are getting triple counted, hence the almost tripled report of time spent.

@nsheff nsheff added the bug label Aug 12, 2019
@stolarczyk
Copy link
Member

good catch. Yeah, it looks like the subcommands in the piped command are timed with an offset equal to the current runtime for the entire pipe, which is wrong.

perhaps hashes should be done on commands rather than on individual steps in a pipeline?

good idea. Should be really easy, as well

@nsheff
Copy link
Member Author

nsheff commented Aug 13, 2019

what about hashing the actually pipeline run as well? you could has the time value in seconds since unix epoch...

that way, you'd know if two commands were run in the same run or in a different run.

@nsheff
Copy link
Member Author

nsheff commented Aug 13, 2019

good idea. Should be really easy, as well

then how do you calculate the time, by taking the maximum value for the hash? assuming that the maximum runtime across any piped commands is the time from all those piped commands which were running simultaneously? is that true though?

@stolarczyk
Copy link
Member

the reported time is not the max but the last one. However, the actual pipe order is not preserved in the profile.. so in this case:

413534	aaaaaaaa	24	2:16:26.790000	0.8886	samtools	lock.aligned_hg38__GSM3618131_sort.bam
413533	aaaaaaaa	23	2:16:17.120000	0.0046	samtools	lock.aligned_hg38__GSM3618131_sort.bam
413532	aaaaaaaa	22	2:15:56.990000	3.5303	bowtie2	lock.aligned_hg38__GSM3618131_sort.bam

the returned time for the pipe is 2:15:56.990000. Whereas the correct answer is 2:16:26.790000.

So what the current approach is missing (I just realized) is ordering the rows that are assigned the same hash by the command ID before the last one is returned.

what about hashing the actually pipeline run as well? you could has the time value in seconds since unix epoch...

that a good idea

@nsheff
Copy link
Member Author

nsheff commented Aug 13, 2019

another problem with accumulating time in this way is that it doesn't account for any time that is not spent in the 'run' command, right? I think we discussed this...

but what about time spent in, for example, result reporting functions or follow functions? These can actually be substantial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants