Skip to content

Releases: DeepGraphLearning/torchdrug

0.2.1 Release

16 Jul 22:37
Compare
Choose a tag to compare

The new 0.2.1 release supports PyTorch 2.0 and Python 3.10. We fix quite a few bugs based on the suggestions from the community. Thanks everyone for contributing to this library.

Compatibility Changes

  • TorchDrug now supports Python versions from 3.7 to 3.10, and PyTorch versions from 1.8 to 2.0. There is no change in the minimal versions, so you can easily update your previous environment to use TorchDrug 0.2.1.
  • For PropertyPrediction , we change the its predict function to directly output original values rather than standardized values. This is more intuitive for new users. Note this change is not backward compatible. (#109, thanks to @kanojikajino)

Improvements

  • Add batch normalization and dropout in PropertyPrediction
  • Support custom edge feature function in GraphConstruction
  • Support ESM-2 models in EvolutionaryScaleModeling
  • Add full batch evaluation for KnowledgeGraphCompletion
  • Support dict, list and tuple in config dict
  • Add instructions for installation on Apple silicon (#176, thanks to @migalkin)
  • Reduce dependency to matplotlib-base (#141, thanks to @jamesmyatt)

Bug Fixes

  • Fix variable names in NeuralLogicProgramming (#126)
  • Fix interface for the new esm library (#133)
  • Fix the version of AlphaFoldDB to v2 (#137, thanks to @ShoufaChen)
  • Fix inconsistent output when using edge features in convolution layers (#53, #140)
  • Handle side cases in property optimization (#125, thanks to @jannisborn)
  • Fix a bug when using LR schedulers (#148, #152)
  • Fix a bug in graph construction (#158)
  • Fix a bug in layers.Set2Set (#185)
  • Avoid in-place RDKit operations in data.Molecule.from_molecule (#142)
  • Fix num_class in PropertyPrediction (#142)
  • Fix docker file for new CUDA image (#207, thanks to @cscandore)
  • Fix chain ID in data.Protein.from_molecule

Deprecations

  • Deprecate functional._size_to_index. Use torch.repeat_interleave instead.

0.2.0 Release

19 Sep 05:23
Compare
Choose a tag to compare

V0.2.0 is a major release with a new family member TorchProtein, a library for machine-learning-guided protein science. Aiming at simplifying the development of protein methods, TorchProtein encapsulates many complicated yet repetitive subroutines into functional modules, including widely-used datasets, flexible data processing operations, advanced encoding models, and diverse protein tasks.

Such comprehensive encapsulation enables users to develop protein machine learning solutions with one easy-to-use library. It avoids the embarrassment of gluing multiple libraries into a pipeline.

With TorchProtein, we can rapidly prototype machine learning solutions to various protein applications within 20 lines of codes, and conduct ablation studies by substituting different parts of the solution with off-the-shelf modules. Furthermore, we can easily adapt these modules to our own needs, and make systematic analyses by comparing the new results to a benchmark provided in the library.

Additionally, TorchProtein is designed to be accessible to everyone. For inexperienced users, like beginners or biological researchers, TorchProtein provides user-friendly APIs to simplify the development of protein machine learning solutions. Meanwhile, for professional users, TorchProtein also preserves enough flexibility to satisfy their demands, supported by features like modular design of the library and on-the-fly graph construction.

Main Features

Simplify Data Processing

  • It is challenging to transform raw bioinformatic protein datasets into tensor formats for machine learning. To reduce tedious operations, TorchProtein provides us with a data structure data.Protein and its batched extension data.PackedProtein to automate the data processing step.

    • data.Protein and data.PackedProtein automatically gather protein data from various bio-sources and seamlessly switch between data formats like pdb files, RDKit objects and sequences. Please see the section data structures and operations for transforming from and to sequences and RDKit objects.

      # construct a data.Protein instance from a pdb file
      pdb_file = ...
      protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol")
      print(protein)
      
      # write a data.Protein instance back to a pdb file
      new_pdb_file = ...
      protein.to_pdb(new_pdb_file)
      Protein(num_atom=445, num_bond=916, num_residue=57)
    • data.Protein and data.PackedProtein automatically pre-process all kinds of features of atoms, bonds and residues, by simply setting up several arguments.

      pdb_file = ...
      protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol")
      
      # feature
      print(protein.residue_feature.shape)
      print(protein.atom_feature.shape)
      print(protein.bond_feature.shape)
      torch.Size([57, 21])
      torch.Size([445, 3])
      torch.Size([916, 1])
    • data.Protein and data.PackedProtein automatically keeps track of numerous attributes associated with atoms, bonds, residues and the whole protein.

      • For example, reference offers a way to register new attributes as node, edge or graph property, and in this way, the new attributes would automatically go along with the node, edge or graph themself. More in-built attributes are listed in the section data structures and operations.
      protein = ...
      
      with protein.node():
          protein.node_id = torch.tensor([i for i in range(0, protein.num_node)])
      with protein.edge():
          protein.edge_cost = torch.rand(protein.num_edge)
      with protein.graph():
          protein.graph_feature = torch.randn(128)
      • Even more, reference can be utilized to maintain the correspondence between two well related objects. For example, the mapping atom2residue maintains relationship between atoms and residues, and enables indexing on either of them.
      protein = ...
      
      # create a mask indices for atoms in a glutamine (GLN)
      is_glutamine = protein.residue_type[protein.atom2residue] == protein.residue2id["GLN"]
      mask_indices = is_glutamine.nonzero().squeeze(-1)
      print(mask_indices)
      
      # map the masked atoms back to the glutamine residue
      residue_type = protein.residue_type[protein.atom2residue[mask_indices]]
      print([protein.id2residue[r] for r in residue_type.tolist()])
      tensor([ 26,  27,  28,  29,  30,  31,  32,  33,  34, 307, 308, 309, 310, 311,
              312, 313, 314, 315, 384, 385, 386, 387, 388, 389, 390, 391, 392])
      ['GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN']
  • It is useful to augment protein data by modifying protein graphs or constructing new ones. With the protein operations and the graph construction layers provided in TorchProtein,

    • we can easily modify proteins on the fly by batching, slicing sequences, masking out side chains, etc. Please see the tutorials for more details on masking.

      pdb_file = ...
      protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol")
      
      # batch
      proteins = data.Protein.pack([protein, protein, protein])
      
      # slice sequences
      # use indexing to extract consecutive residues of a particular protein
      two_residues = protein[[0,2]]
      two_residues.visualize()

      two residues

    • we can construct protein graphs on the fly with GPU acceleration, which offers users flexible choices rather than using fixed pre-processed graphs. Below is an example to build a graph with only alpha carbon atoms, please check tutorials for more cases, such as adding spatial / KNN / sequential edges.

      protein = ...
      # transfer from CPU to GPU
      protein = protein.cuda()
      print(protein)
      
      # build a graph with only alpha carbon (CA) atoms
      node_layers = [geometry.AlphaCarbonNode()]
      graph_construction_model = layers.GraphConstruction(node_layers=node_layers)
      
      original_protein = data.Protein.pack([protein])
      CA_protein = graph_construction_model(_protein)
      print("Graph before:", original_protein)
      print("Graph after:", CA_protein)
      Protein(num_atom=445, num_bond=916, num_residue=57, device='cuda:0')
      Graph before: PackedProtein(batch_size=1, num_atoms=[2639], num_bonds=[5368], num_residues=[350])
      Graph after: PackedProtein(batch_size=1, num_atoms=[350], num_bonds=[0], num_residues=[350])

Easy to Prototype Solutions

With TorchProtein, common protein tasks can be finished within 20 lines of codes, such as sequence-based protein property prediction task. Below is an example and more examples of different popular protein tasks and models can be found in Protein Tasks, Models and Tutorials.

import torch
from torchdrug import datasets, transforms, models, tasks, core

truncate_transform = transforms.TruncateProtein(max_length=200, random=False)
protein_view_transform = transforms.ProteinView(view="residue")
transform = transforms.Compose([truncate_transform, protein_view_transform])

dataset = datasets.BetaLactamase("~/protein-datasets/", residue_only=True, transform=transform)
train_set, valid_set, test_set = dataset.split()

model = models.ProteinCNN(input_dim=21,
                          hidden_dims=[1024, 1024],
                          kernel_size=5, padding=2, readout="max")

task = tasks.PropertyPrediction(model, task=dataset.tasks,
                                criterion="mse", metric=("mae", "rmse", "spearmanr"),
                                normalization=False, num_mlp_layer=2)

optimizer = torch.optim.Adam(task.parameters(), lr=1e-4)
solver = core.Engine(task, train_set, valid_set, test_set, optimizer, 
                     gpus=[0], batch_size=64)
solver.train(num_epoch=10)
solver.evaluate("valid")
mean absolute error [scaled_effect1]: 0.249482
root mean squared error [scaled_effect1]: 0.304326
spearmanr [scaled_effect1]: 0.44572

Compatible with Existing Molecular Models in TorchDrug

  • TorchProtein follows the scientific fact that proteins are macromolecules. The core data structures data.Protein and data.PackedProtein inherit from data.Molecule and data.PackedMolecule respectively. Therefore, we can apply any existing molecule model in TorchDrug to proteins

    import torch
    from torchdrug import layers, datasets, transforms, models, tasks, core
    from torchdrug.layers import geometry
    
    truncate_transform = transforms.TruncateProtein(max_length=200, random=False)
    protein_view_transform = transforms.ProteinView(view="residue")
    transform = transforms.Compose([truncate_transform, protein_view_transform])
    
    dataset = datasets.EnzymeCommission("~/protein-datasets/", transform=transform)
    train_set, valid_set, test_set = dataset.split()
    
    model = models.GIN(input_dim=21,
                        hidden_dims=[256, 256, 256, 256],
           ...
Read more

0.1.3 Release

04 Jun 04:28
Compare
Choose a tag to compare

TorchDrug 0.1.3 release introduces new features like W&B intergration and index reference. It also provides new functions and metrics for common development need. Note 0.1.3 has some compatibility changes and be careful when you update your TorchDrug from an older version.

  • W&B Integration
  • Index Reference
  • New Functions
  • New Metrics
  • Improvements
  • Bug Fixes
  • Compatibility Changes

W&B Integration

Tracking experiment progress is one of the most important demand from ML researchers and developers. For TorchDrug users, we provide a native integration of W&B platform. By adding only one argument in core.Engine, TorchDrug will automatically copy every hyperparameter and training log to your W&B database (thanks to @manangoel99).

solver = core.Engine(task, train_set, valid_set, test_set, optimizer, logger="wandb")

Now you can track your training and validation performance in your browser, and compare them across different experiments.

Wandb demo

Index Reference

Maintaining node and edge attributes could be painful when one applies a lot of transformations to a graph. TorchDrug aims to eliminate such tedious steps by registering custom attributes. This update extends the capacity of custom attributes to index reference. That means, we allow attributes to refer to indexes of nodes, edges or graphs, and they will be automatically maintained in any graph operation.

To use index reference, simply add a context manager when we define the attributes.

with graph.edge(), graph.edge_reference():
    graph.inv_edge_index = torch.tensor(inv_edge_index)

Foor more details on index reference, please take a look at our notes. Typical use cases include

  • A pointer to the inverse edge of each edge.
  • A pointer to the parent node of each node in a tree.
  • A pointer to the incoming tree edge of each node in a DFS.

Let us know if you find more interesting usage of index reference!

New Functions

Message passing over line graphs is getting more and more popular in the recent years. This version provides data.Graph.line_graph to efficiently construct line graphs on GPUs. It supports both a single graph or a batch of graphs.

We are constantly focusing on better batching of irregular structures, and the variadic functions in TorchDrug are an efficient way to process batch of variadic-sized tensors without padding. This update introduces 3 new variadic functions.

  • variadic_meshgrid generates a meshgrid from two variadic tensors. Useful for implementing pairwise operations.
  • variadic_to_padded converts a variadic tensor to a padded tensor.
  • padded_to_variadic converts a padded tensor to a variadic tensor.

New Metrics

New metrics include accuracy, matthews_corrcoef, pearsonr, spearmanr. All the metrics are the same as their counterparts in scipy, but they are implemented in PyTorch and support auto differentiation.

Improvements

  • Add data.Graph.to (#70, thanks to @cthoyt)
  • Extend tasks.SynthonCompletion for arbitrary atom features (#62)
  • Speed up lazy data loading (#58, thanks to @wconnell)
  • Speed up rspmm cuda kernels
  • Add docker support
  • Add more documentation for data.Graph and data.Molecule

Bug Fixes

  • Fix computation of output dimension in several GNNs (#92, thanks to @kanojikajino)
  • Fix data.PackedGraph.__getitem__ when the batch is empty (#83, thanks to @jannisborn)
  • Fix patched modules for PyTorch>=1.6.0 (#77)
  • Fix make_configurable for torch.utils.data (#85)
  • Fix multi_slice_mask, variadic_max for multi-dimensional input
  • Fix variadic_topk for input containing infinite values

Compatibility Changes

TorchDrug now supports Python 3.7/3.8/3.9. Starting from this version, TorchDrug requires a minimal PyTorch version of 1.8.0 and a minimal RDKit version of 2020.09.

Argument node_feature and edge_feature are renamed to atom_feature and bond_feature in data.Molecule.from_smiles and data.Molecule.from_molecule. The old interface is still supported with deprecated warnings.

0.1.2 Release

23 Oct 04:08
Compare
Choose a tag to compare

0.1.2 Release Notes

The recent 0.1.2 release of TorchDrug is an update on Colab tutorials, data structures, functions, datasets and bug fixes. We are grateful to see growing interests and involvement from the community, especially on the retrosynthesis task. Welcome more in the future!

  • Colab Tutorials
  • New Data Structures
  • New Functions
  • New Datasets
  • Bug Fixes

Colab Tutorials

To familiarize users with the logic and capacity of TorchDrug, we compile a full set of Colab tutorials, covering from basic usage to different drug discovery tasks. All the tutorials are fully interactive and may serve as boilerplate code for your own applications.

  • Basic Usage and Pipeline shows the manipulation of data structures like data.Graph and data.Molecule, as well as the training and evaluation pipelines for property prediction models.
  • Pretrained Molecular Representations demonstrates the steps for self-supervised pretraining of a molecular representation model and finetuning it on downstream tasks.
  • De novo Molecule Design illustrates the routine of training generative models for molecule generation and finetuning them with reinforcement learning for property optimization. Two popular models, GCPN and GraphAF, are covered in the tutorial.
  • Retrosynthesis shows how to use the state-of-the-art model, G2Gs, to predict a set reactants for synthesizing a target molecule.
  • Knowledge Graph Reasoning goes through the steps of training and evaluating models for knowledge graph completion, including both knowledge graph embeddings and neural inductive logic programming.

New Data Structures

  • A new data structure data.Dictionary that stores key-value mapping of PyTorch tensors on either CPUs or GPUs. It enjoys O(n) memory consumption and O(1) query time, and supports parallelism over batch of queries. This API provides a great opportunity for implementing sparse lookup tables or set operations in a PyTorchic style.
  • A new method data.Graph.match to efficiently retrieve all edges of specific patterns on either CPUs or GPUs. It scales linearly w.r.t. the number of patterns plus the number of retrieved edges, regardless the size of the graph. Typical usage of this method includes querying the existence of edges, generating random walks or even extracting ego graphs.

New Functions

Batching irregular structures, such as graphs, sets or sequences with different sizes, is a common demand in drug discovery. Instead of clumsy padding-based implementation, TorchDrug provides a family of functions that efficiently manipulate batch of variadic-sized tensors without padding. The update contains the following new variadic functions.

  • variadic_arange returns a 1-D tensor that contains integer intervals of variadic sizes.
  • variadic_softmax computes softmax over categories with variadic sizes.
  • variadic_sort sorts elements in sets with variadic sizes.
  • variadic_randperm returns random permutations for sets with variadic sizes, where the i-th permutation contains integers from 0 to size[i] - 1.
  • variadic_sample draws samples with replacement from sets with variadic sizes.

New Datasets

  • PCQM4M: A large-scale molecule property prediction dataset, originally used in OGB-LSC (thanks to @OPAYA )

Bug Fixes

  • Fix import of sascorer in plogp evaluation (#18, #31)
  • Fix atoms with stereo bonds in retrosynthesis (#42, #43)
  • Fix lazy construction for molecule datasets (#30, thanks to @DaShenZi721 )
  • Fix ChEMBLFiltered dataset (#36)
  • Fix ZINC2m dataset (#33)
  • Fix USPTO50k dataset (#32)
  • Fix bugs in core.Configurable (#26)
  • Fix/improve documentation (#16, #28, #41)
  • Fix installation on macOS (#29)