Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Do we still need the snp_clusters param for diagnostics rule? #1017

Open
joverlee521 opened this issue Oct 17, 2022 · 3 comments
Labels
source: discussion forum Issue mentioned on Nextstrain Discussions

Comments

@joverlee521
Copy link
Contributor

Context

@huddlej's response to a discussion question prompted me to take a closer look at the diagnostics rule/script.
I wanted to understand how the hard-coded params were used in the script:

params:
clock_filter = 20,
snp_clusters = 1,
contamination = 5,

I saw that the snp_clusters param is only relevant when there is a snp_clusters column within the input metadata file.

if "snp_clusters" in metadata.columns:
snp_clusters = np.array([float(x) if isfloat(x) else np.nan for x in metadata.snp_clusters])
else:
snp_clusters = np.zeros(len(metadata), dtype=bool)

This input metadata file is generated by the join-metadata-and-clades script, which does not add a snp_clusters column. Only the following columns from Nextclade are included in the metadata file:

column_map = {
"clade": "Nextstrain_clade",
"Nextclade_pango": "Nextclade_pango",
"totalMissing": "missing_data",
"totalSubstitutions": "divergence",
"totalNonACGTNs": "nonACGTN",
"privateNucMutations.totalUnlabeledSubstitutions": "rare_mutations",
"privateNucMutations.totalReversionSubstitutions": "reversion_mutations",
"privateNucMutations.totalLabeledSubstitutions": "potential_contaminants",
"qc.overallScore": "QC_overall_score",
"qc.overallStatus": "QC_overall_status",
"qc.missingData.status": "QC_missing_data",
"qc.mixedSites.status": "QC_mixed_sites",
"qc.privateMutations.status": "QC_rare_mutations",
"qc.snpClusters.status": "QC_snp_clusters",
"qc.frameShifts.status": "QC_frame_shifts",
"qc.stopCodons.status": "QC_stop_codons",
"frameShifts": "frame_shifts",
"deletions": "deletions",
"insertions": "insertions",
"substitutions": "substitutions",
"aaSubstitutions": "aaSubstitutions"
}

It seems like the snp_clusters param is unused unless users add the column to their own metadata outside the workflow.
snp_clusters used to be included as a column in the ncov-ingest produced metadata file, but it has been removed. I think the presence of an unused param here can cause confusion for users (it definitely confused me!).

Possible solution

I'm not familiar with the diagnosis rule so I wanted to ask what would be an appropriate action here.

  1. Add a comment to the diagnostics rule that the snp_clusters is only kept for backwards compatibility, but it is not used in the latest version of the workflow.
  2. Just remove the snp_clusters param from the diagnostics rule.
  3. Update the diagnostics script to check QC_snp_clusters != "bad" instead of the number of SNP clusters.
@joverlee521 joverlee521 added the source: discussion forum Issue mentioned on Nextstrain Discussions label Oct 17, 2022
@j23414
Copy link

j23414 commented Oct 17, 2022

I lean toward possible solution 2, but I'm probably missing the historical context.

@emmahodcroft
Copy link
Member

My guess is that this was removed when we tried to pare down what we were bringing in from Nextclade to keep it to only things that were of most use. I imagine it won't come back and it's hard for me to imagine that users are supplying their own value here. However @rneher & @corneliusroemer might be able to provide a bit more context just in case we're missing anything.

@corneliusroemer
Copy link
Member

This predates me, I can't really comment here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: discussion forum Issue mentioned on Nextstrain Discussions
Projects
No open projects
Development

No branches or pull requests

4 participants