Skip to content

databricks-industry-solutions/db-omics

Repository files navigation

DBR Life Sciences GLOW

Use Case

This solution accelerator is aimed at Computational Biologists working with genomic data. It aims to integrate genomics data with other relevant datasets, presenting findings through interactive dashboards for clinical scientists, geneticists, and other practitioners. Our focus lies in analyzing population-level trends and identifying samples associated with specific causal variants previously discovered through Genome Wide Association Studies (GWAS). We utilize the GWAS catalog in this accelerator.

Primarily, we leverage Project Glow to access and ingest 1000 Genomes Project data from public cloud storage. We compute various sample-level and variant-level summary statistics, constructing a database of human genetic variation alongside GWAS catalog data. Subsequently, we develop an interactive dashboard facilitating exploration of genetic variation across different populations and facilitating the identification of samples associated with specific risk alleles for particular traits or diseases.

D1D2

To create the dashboard, you can simply import the lakeview dashboard's JSON file located in ./resources/1000 Genome Samples Dashboard.lvdash.json in Lakeview.

Datasets

1000 Genomes Project Variant Data:

The 1000 Genomes Project began in 2008 with the aim of mapping human genetic variation. It entailed sequencing the genomes of more than 2,500 individuals from 26 diverse populations worldwide. The project sought to construct a detailed map of genetic distinctions within human DNA. By analyzing genomes from a broad and varied sample, it identified millions of genetic variants, including single nucleotide polymorphisms (SNPs) and structural variations like insertions, deletions, and copy number variations.

Data from the 1000 Genomes Project serves as a vital tool for researchers investigating human genetics. It has been pivotal in numerous studies exploring the genetic underpinnings of complex diseases, contributing to our understanding of human evolution and population history.

GWAS Catalog

The GWAS Catalog, a collaboration between NHGRI and EMBL-EBI, is a curated repository of published genome-wide association studies (GWAS). It houses information on genetic variants, traits, and p-values from these studies, serving as a community resource for data access and analysis tools. The Catalog adheres to FAIR principles, encouraging study identifier citation and offering APIs for data access. Future plans include incorporating unpublished GWAS data. Overall, it is an invaluable asset for researchers exploring the genetic underpinnings of human traits and diseases.

Reference Architecture

graph LR
subgraph SG1 [EBI]
B[TSV: https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/\ntechnical/working/20130606_sample_info/\n20130606_sample_info.txt]
C[TSV: www.ebi.ac.uk/gwas/api/search/downloads/full]
end
subgraph SG2 [s3://1000genomes/phase1/]
A[VCF: chr22.SHAPEIT2_integrated_phase1_v3]
end

subgraph SG3 [Databricks]
subgraph SG4 [Unity Catalog]
A -- 🧬 Glow \nVCF parser --> AA[glow_chr22_vars]
B --> BB[sample_information]
C --> CC[gwas_catalog_full]
AA --🧬 Glow --> DD[glow_chr22_sample_qc]
AA --🧬 Glow --> EE[glow_chr22_vars]
end
subgraph SG5 [Lakeview]
AA -.-> D(Variant Explorer \n Dashboard)
BB -.-> D
CC -.-> D
DD -.-> D
EE -.-> D
end
end 

style SG1 fill:#DCE0E2
style SG2 fill:#DCE0E2
style SG3 fill:#98102A,color:#F9F7F4 
style SG4 fill:#EEEDE9
style SG5 fill:#EEEDE9 

Authors

amir.kermany@databricks.com

Project support

Please note the code in this project is provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects. The source in this project is provided subject to the Databricks License. All included or referenced third party libraries are subject to the licenses set forth below.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.

License

© 2024 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License [https://databricks.com/db-license-source]. All included or referenced third party libraries are subject to the licenses set forth below.

library description license source
Project Glow an open-source toolkit to enable bioinformatics at biobank-scale and beyond. Apache 2.0 https://github.com/projectglow/glow

About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages