ClinCluster: A Package for Aggregating Disease Terms in ClinVar

List of participants and affiliations:

Melissa Landrum, NIH/NCBI (Co-Team Leader)
Guangfeng Song, NIH/NCBI (Co-Team Leader)
Lauren Edgar, NIH/NHGRI
Benjamin Kesler, Vanderbilt University
Nicholas Minor, University of Wisconsin
Michael Muchow, Unaffiliated
Rebecca Orris, NIH/NCBI
Wengang Zhang, NIH/NCI

Project Goals

Naming for human genetic diseases is complex. Diseases may be named for the phenotype; other information may be alluded to including the relevant gene, mode of inheritance, or the mechanism of disease. Diseases may be described at a high level with a generic name, or at a lower level with a more specific name; however, in the context of a variant in a specific gene, these differences may not be considered important. ClinVar data would be easier to ingest in bulk and to read in web displays if there were a meaningful way to aggregate diseases that effectively mean the same thing in the context of a gene.

Problem Statement: Diseases in ClinVar are very granular and result in many variant-disease records.
Can we use an ML/AI approach to aggregate disease terms in ClinVar to reduce the number of variant-disease records?

Example of granularity for Familial Hypercholesterolemia

Approach

Develop a ML/AI approach to aggregating diseases for 5 genes with variants in ClinVar: LDLR, KCNQ1, USH2A, SCN5A, TSC1.

Use the gene symbol to guide whether different terms are meaningfully different for that gene, e.g. Familial hypercholesterolemia vs Hypercholesterolemia, familial, 1
Use other information such as mode of inheritance, clinical features, and mechanism of disease to decide if terms should be aggregated or not
Demonstrate how RCV records for variant-condition pairs in ClinVar for one or more genes would be different using aggregated diseases

Aggregating the disease terms

Extract disease names from all RCV records associated with a single gene
Extract all unique MedGen ID and their disease names
Cluster the name into their umbrella disease category
Program LLM to cluster the disease terms
Fork https://github.com/simonw/llm-cluster/blob/main/llm_cluster.py
Identify variant records with similar disease names that belong to the same umbrella diseases
Assign those RCV records with the corrected (umbrella) name

Results

Demo

Future Work

Evaluate and test other LLMs with more training in biomedical knowledge to reduce the amount of curation needed
Generate a test set of ClinVar data using aggregated disease terms
Review test data set with ClinVar users to determine the value added and areas to improve

Common Acronyms

Abbreviation	Acronym
ACMG	American College of Medical Genetics and Genomics
CGV	Comparative Genome Viewer
DBSCAN	Density-Based Spatial Clustering of Applications with Noise
GPT	Generative Pre-trained Transformer
HGNC	HUGO Gene Nomenclature Committee
HGVS	Human Genome Variation Society
HI	Haploinsufficiency
HUGO	Human Genome Organization
LLM	Large Language Model
MeSH	Medical Subject Headings
NGS	Next-Generation Sequencing
NLP	Natural Language Processing
OMIM	Online Mendelian Inheritance in Man
PMC	PubMed Central
RCV	Reference (variant-condition) ClinVar Variant Accession Identifier
SCV	Submitted ClinVar Variant Accession Identifier
SNP	Single-Nucleotide Polymorphism
SRA	Sequence Read Archive
TS	Triplosensitivity
VUS	Variant of Uncertain Significance
VCV	Variant ClinVar Variant Accession Identifier

NCBI Codeathon Disclaimer

This software was created as part of an NCBI codeathon, a hackathon-style event focused on rapid innovation. While we encourage you to explore and adapt this code, please be aware that NCBI does not provide ongoing support for it.

For general questions about NCBI software and tools, please visit: NCBI Contact Page

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
.github/workflows		.github/workflows
2024-team-landrum-song		2024-team-landrum-song
assets		assets
bin		bin
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ClinCluster final presentation 03012.pdf		ClinCluster final presentation 03012.pdf
ClinCluster final presentation 03012.pptx		ClinCluster final presentation 03012.pptx
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build_and_push_docker.sh		build_and_push_docker.sh
main.nf		main.nf
nextflow.config		nextflow.config
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

NCBI-Codeathons/mlxai-2024-team-landrum-song

Folders and files

Latest commit

History

Repository files navigation

ClinCluster: A Package for Aggregating Disease Terms in ClinVar

Project Goals

Approach

Aggregating the disease terms

Results

Demo

Future Work

Common Acronyms

NCBI Codeathon Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Languages