Low-coverage whole genome sequencing for a highly selective cohort of severe COVID-19 patients

This repository contains the scripts used in Santos et al. Low-coverage whole genome sequencing for a highly selective cohort of severe COVID-19 patients. Supporting data, including validation files and patient clinical histories, are archived in our Figshare collection. Genetic data for the patient cohort can be found in our European Genome-phenome Archive study EGAS00001007573. This repository will remain open for support with reproducibility and issues.

Abstract

Background

Despite advances in identification of genetic markers associated to severe COVID symptoms, the full genetic characterisation of the disease remains elusive. Imputation of low-coverage whole genome sequencing has emerged as a competitive method to study such disease-related genetic markers as they enable genotyping of most common genetic variants used for genome wide association studies. This study aims at exploring the potential use of imputation in low-coverage whole genome sequencing for a highly selected severe COVID-19 patient cohort.

Findings

We generated an imputed dataset of 79 variant call format (VCF) patient files using the GLIMPSE1 tool, each containing, on average, 9.5 million single nucleotide variants. The validation assessment of imputation accuracy yielded a squared Pearson correlation of approximately 0.97 across sequencing platforms, showing that GLIMPSE1 can be used to confidently impute variants with minor allele frequency up to approximately 2% in Spanish ancestry individuals. We conducted a comprehensive analysis on the patient cohort, examining hospitalisation and intensive care utilisation, sex and age-based differences, and clinical phenotypes using a standardised set of medical terms specifically developed to characterise severe COVID-19 symptoms for this cohort.

Conclusion

This dataset highlights the utility and accuracy of low-coverage whole genome sequencing imputation in the study of COVID-19 severity, setting a precedent for other applications in resource-constrained environments linked to comprehensive analyses of genetic components for various complex diseases. The methods and findings presented here may be leveraged in future genomic projects, providing vital insights for health challenges like COVID-19.

Software implementation

All the source code used to generate the results and figures in the paper are in the scripts folder. See the README.md files in each directory for a full description of each figure.

Setup

Getting the code

You can download a copy of all the files in this repository by cloning this git repository.

git clone https://github.com/renatosantos98/GLIMPSE-low-coverage-WGS-imputation.git

A copy of the repository is also archived at doi.org/10.25452/figshare.plus.21679799.

Dependencies

You'll need a working Python environment to run the code. We recommend you set up your environment through Anaconda, which provides the conda package manager.

Run the following command in the main repository folder (where environment.yml is located) to create a conda environment and install all required dependencies in it.

conda env create -f environment.yml
conda activate glimpse

Input data requirements

The data required as input for this pipeline consists of .cram case and validation files, stored inside a directory named bam.

The 1000 Genomes reference panel will be retrieved and set up by the 1_setup.sh script.

Running the code

All scripts were designed to be run from the main repository folder. To reproduce the data generated in the paper, run the scripts in the following order and syntax:

bash scripts/1_setup.sh
bash scripts/2_gl_calling.sh
bash scripts/3_glimpse_impute_parallel.sh
bash scripts/4_vcf_filtering.sh
bash scripts/5_glimpse_concordance.sh
bash scripts/6_pca.sh

See the README.md files in the scripts directory for a full description of each script and required files.

License

All source code is made available under an MIT license. You can freely use and modify the code, without warranty. See LICENSE.md for the full license text. The authors reserve the rights to the article content, which is currently submitted for publication.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.vscode		.vscode
filtered_vcf		filtered_vcf
maps		maps
pca		pca
phenostats		phenostats
scripts		scripts
vcf_stats		vcf_stats
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vscode

.vscode

filtered_vcf

filtered_vcf

maps

maps

pca

pca

phenostats

phenostats

scripts

scripts

vcf_stats

vcf_stats

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

environment.yml

environment.yml

Repository files navigation

Low-coverage whole genome sequencing for a highly selective cohort of severe COVID-19 patients

Abstract

Background

Findings

Conclusion

Software implementation

Setup

Getting the code

Dependencies

Input data requirements

Running the code

License

About

Languages

License

renatosantos98/GLIMPSE-low-coverage-WGS-imputation

Folders and files

Latest commit

History

Repository files navigation

Low-coverage whole genome sequencing for a highly selective cohort of severe COVID-19 patients

Abstract

Background

Findings

Conclusion

Software implementation

Setup

Getting the code

Dependencies

Input data requirements

Running the code

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages