Skip to content

nychealth/COVID-consensus-genomes-pangolin-analysis

Repository files navigation

Lineage assignments using phylogenetic placement/UShER is superior to machine learning methods

This repository contains the New York City Public Health Laboratory local datasets for the paper, "Lineage assignments using phylogenetic placement/UShER is superior to machine learning methods." COVID samples were collected between August 01, 2021 and November 30, 2021.

Files:

  • pipeline.txt - Overview of pipeline and associated scripts

Data files:

Fasta files can be directly inputted into any software that takes multi-fasta format such as pangolin or Nextclade. This is not to be confused with multiple sequence alignment (MSA), which aligns the sequences against each other instead of just listing them.

  • nyc_failed_aug-nov2021.fasta - Multi-Fasta file containing 469 genome consensus sequences for SARS-CoV-2 that had N >10%.
  • nyc_passed_aug-nov2021.fasta.xz - Compressed fasta file containing genome consensus sequences for SARS-CoV-2 that had N <10%.
    • To uncompress with the xz-utils package, the command is unxz nyc_passed_aug-nov2021.fasta.xz
  • ca_nyc_mafft.alignment.fasta.xz - Compressed MSA created by MAFFT
    • To uncompress with the xz-utils package, the command is unxz ca_nyc_mafft.alignment.fasta.xz
  • pango_consensus_ca_nyc.aligned_sept8_2023_masked_maple034inferenceJC_noAncertaintyAssignment_reRooted_nexusTree.tree - MAPLE tree used for lineage assignment validation
  • 60k_public_meta.tsv - NCBI metadata for 2021 global dataset
  • 2022-global-episet.pdf - GISAID supplemental table to access 2022 global dataset

Script files:

  • compareLineages.py - Python script to compare pangolin lineages to MAPLE tree
  • comparison_script_w_ami.py - Python script to calculate Adjusted Mutual Information
  • snp_scorpio-comparisons.sh - Bash script to preprocess SNP distance matrix
  • snp_scorpio-comparisons.Rmd - R script to analyze SNP distance matrix data and scorpio
  • tables_and_violin_plots.R - R script to create visualization to compare genome coverage to reassignment and other tables
  • sankey_plots.R - R script to create visualization to look at lineage stability
  • Files catalogueing new lineages during the study periodbetween pangolin versions which we considered as permitted changes
    • expected.13.14.tsv
    • expected.14.15.tsv
    • expected.15.16.tsv
    • expected.2021-11-09_v1.2.133.tsv

Supplemental files:

  • Supplementary_table_1.csv
  • Supplementary_table_2.csv
  • Supplementary_table_3.csv
  • Supplementary_table_4.csv
  • Supplementary_file_public_60k_pusher_nohash.html - Interactive HTML showing the lineage reassignments across different versions of pUSHER
  • Supplementary_file_public_60k_plearn_nohash.html - Interactive HTML showing the lineage reassignments across different versions of pangoLEARN

Authors

  • Adriano de Bernardi Schneider
  • Michelle Su
  • Angie S. Hinrichs
  • Jade Wang
  • Helly Amin
  • John Bell
  • Debra A. Wadford
  • Ainde O'Toole
  • Emily Scher
  • Marc D. Perry
  • Yatish Turakhia
  • Nicola De Maio
  • Andrew Rambaut
  • Scott Hughes
  • Russ Corbett-Detig

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages