Skip to content

NCBI-Codeathons/mlxai-2024-team-smith

Repository files navigation

A Gentle Introduction to ML/AI as Applied to Antibody Engineering

Team Smith Roster

Role Participant Affiliation
Team Lead Todd Smith, PhD Digital World Biology, LLC
Tech Lead Herminio Vazquez Copado Inc.
Flex Zainab Adenaike NIH/NLM/NCBI
Flex Jake Lance student, University of Toronto
Flex Mohsen Sharifi Renani Spotify AB
Writer Stephen Panossian Unaffiliated

Project Goals

The project focused on developing resources and documentation for teacing data science and machine learning / artificial intelligence (ML/AI) cocepts related to antibody engineering. Immune profiling (immunoprofiling) datasets were used as a source of antibody sequneces for both data science and ML. The team develope Jupyter notebooks to undertake comparative analyses of iReceptor datasets, and then incorporate the AbLang2 antibody-specific language model to characterize data from CoV-AbDab. A dictionary and glossary of terms defining essential computer and biology terms related to the computations processed within the Jupyter notebook were also developed.

Methods

Datasets

  • CoV-AbDab database in csv format. CoV-AbDab is a public database to document all published/patented antibodies and nanobodies able to bind to coronaviruses, including SARS-CoV2, SARS-CoV1, and MERS-CoV. The codathon used the Feb 8, 2024 release containing 12,916 entries. Entries are highly annotated and indicate neutralizing ability, kind of receptor (antibody, nanobodie), where data are pair (heavy and light chaing, just heavy), epitope bound, if a stucture exists, virus reactivitiy among others.
  • iReceptor (free account required) lymphoma dataset uptained with the following filters: Study ID: PRJEB1289; Study type Case Control (Ontology ID): NCIT:C15197; Filter by Sample > PCR target: IGH or IGK or IGL

Software

  • Immune Profiling: See notebooks for details: Key python libraries include Pandas for structuring and manipulating data, json for reading metadata, Matplot lib for graphing and Seaborn for exploring correlations between data in columns.
  • Machine learning: AbLang2

The following diagrams represent the high-level methods employed in Data Science and Bioinformatics

Antibody (Immune Profiling) Sequencing

The common source for antibody seqeunce data comes from immune profiling experiements and assays.

    flowchart TD
    A[Collect Samples] --> B[Isolate DNA / RNA->cDNA] --> C[PCR] -- V-gene, C-gene primers --> D[Sequence DNA] -- NGS - massively parallel --> E[IgBLAST] -- Vh Dh Jh, Vl Jl, Vk Jk references --> F[Immune Profile Dataset];
    F -- repeat --> A
    F --> G[Explore data, analyze];
    F --> H[Machine learning]; 

Example Data Method

High level data science workflow.

    flowchart LR
    
    A[Collect] --> B[Profile]
    B -->C{complete?}
    C-->|Yes|D[Exploration]
    C-->|No|A
    D --> E[Charts]
    D --> F[Impute]
    E --> G[Aggregate]
    F --> G
    G --> H[Model]
    H --> I[Feature Engineering]
    I --> J[Train/Test]
    I --> K[Tune]
    K --> J
    J --> L[Predict]
    L --> M[Operationalize]
    M --> N[Monitor]

See mermaid to learn about making the figure. Mermaid.org, and flow charts provide complete documentation.

Approach

The team used software tools including Amazon Web Service (AWS) cloud computing accounts, Jupyter notebooks, and datasets from both iReceptor and SAbDab (The Structural Antibody Database) from the Oxford Protein Information Group (OPIG). The general workflow is: 1) create an AWS instance, 2) step through the enclosed Jupyter notebook, and 3) analyze the antibody results. Minor experimentation was done with Docker containers.

Prior work illustrates this approach:

Example: Covid not Covid

Example: Immune Profiling

2024 ML/AI Codeathon Log

Date Issues
26-Feb-2024 Several issues utilizing Docker
27-Feb-2024 Team accessed AWS account and Jupyter notebook; runtime challenges
28-Feb-2024 None reported
29-Feb-2024 None reported
01-Mar-2024 Final Presetation

Results

Many jupyter notebooks and notebook fragements were created. All are in the notebooks folder. The most instructive notebooks are:

Machine Learing

Immune Profiling

Future Work

Project materials will create a resource with instruction and hands-on examples that can demystify ML/AI for many scientists and students who need greater awareness of the data, steps, and practicalities. The focus on antibodies supports work in basic research and biotechnology. Digital World Biology's Antibody Engineering Hackathons are creating materials for course-base undergraduate research in community colleges (https://antibody-engineers.org/).

The resulting work will be used in Digital World Biology's National Science Foundation funded summer hackathon (August 2024) on antibody engineering. In the project we will consider ML applications for predicting antibody antigen recognition, genetic contributions to antibody expression, and de novo antibody design. Work will identify one or two examples that include specific datasets, workflows, an appropriate ML method, and tests. The examples will then be used to create instructions and explanations that can be used in classroom settings, starting points for undergraduate research, and scientists wishing they had ways to better understand ML.

NCBI Codeathon Disclaimer

This software was created as part of an NCBI codeathon, a hackathon-style event focused on rapid innovation. While we encourage you to explore and adapt this code, please be aware that NCBI does not provide ongoing support for it.

For general questions about NCBI software and tools, please visit: NCBI Contact Page

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published