Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary file version inconsistencies #240

Open
victorlin opened this issue Jan 1, 2021 · 3 comments
Open

Summary file version inconsistencies #240

victorlin opened this issue Jan 1, 2021 · 3 comments

Comments

@victorlin
Copy link
Collaborator

Examples for ERR2756788.

Original summary header line [S3 link]:

SUMZER_COMMENT=sra=ERR2756788,genome=cov3ma,date=200607-01:47;

New summary header line [S3 link]:

readlength=150;SUMZER_COMMENT=sra=ERR2756788,genome=cov3ma,version=200818,date=200817-21:05;

New psummary contents [S3 link]:

sra=ERR2756788;SUMZER_COMMENT=sra=ERR2756788,genome=protref5,date=200831-02:23,type=protein;totalalns=77449;readlength=141;truncated=no;
sra=ERR2756788;famcvg=AAUWAWAAUAAAWWAAOAAWAAAAO;fam=Coronaviridae;score=100;pctid=71;alns=16477;avgcols=47;
sra=ERR2756788;famcvg=auwa_aoa_awwwu_aoowmmamo_;fam=Dicistroviridae;score=100;pctid=68;alns=663;avgcols=47;
...
sra=ERR2756788;gencvg=_.___.wwoomUooUUWmwwaaoa:;gen=Coronaviridae.S;score=100;pctid=66;alns=1607;avgcols=45;
sra=ERR2756788;gencvg=AWmUAWWAmAWAUUWWAUAWAAAAU;gen=Coronaviridae._prot1;score=100;pctid=73;alns=11375;avgcols=48;

2 questions:

  1. Can the new summary header line be arranged to start with SUMZER_COMMENT= as it was originally?
  2. For the new psummary, can the sra=ERR2756788; be removed from the beginning of every line?

I know these files have already been uploaded, so this is more a note for any future reprocessing.

@ababaian
Copy link
Owner

ababaian commented Jan 1, 2021

  1. I think this is just a straight bug, the first line should be starting with SUMZER_COMMENT=, I totally agree

  2. Is something me and @rcedgar have argued about. I disliked the sra=XXXX on every line quite a bit originally as it looks ugly, but in practice it's incredibly pragmatic since we grep these files very often for spot checking and development. If someone were to work with the summary files in bulk I think the same point is true there, it's very useful to have the sra= on each line. It solves some ugly problems with working with millions of files on a linux file-system. I'd opt to retain it.

@victorlin
Copy link
Collaborator Author

Good point about the grep. Would it be equally beneficial to have the sra=XXXX for nucleotide summary files as well? That way it's more consistent.

@rcedgar
Copy link
Collaborator

rcedgar commented Jan 1, 2021

Yes, equally beneficial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants