Skip to content
Sanjar Adylov edited this page May 27, 2021 · 5 revisions

MoleculeGen-ML Wiki

MoleculeGen-ML is a Python package to perform de novo drug design using generative language models.

Installation

The installation process is described in the main page. The package uses MXNet backend and its Gluon API to build neural networks.

Introduction

This wiki describes the means of preprocessing molecules, learning molecular language models, and generating novel molecule libraries. The objective is to simulate the framework introduced in paper Segler et al. Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks (https://arxiv.org/pdf/1701.01329.pdf).

In de novo drug design, we seek to generate novel focused molecule libraries, which are active toward a particular set of biological targets. We aim to create generative statistical models for molecular data that can capture the distributions of molecular compounds and generate a new set of valid compounds expected to be active toward the targets. One way to accomplish this is to use transfer learning as follows:

  1. Build a generative model to fit with a large set of diverse molecules and predict new molecules.
  2. Perform fine-tuning, i.e. re-train the model on a smaller data set of molecules active toward the targets.

Methods

Here we describe the machine learning methods adapted for use in drug design. For demonstration purposes, we use a small data set of SMILES strings of compounds (for bigger data sets, please refer to our scripts and queries or search for alternatives). We demonstrate how to implement the first stage.

Data

The main paper introduces a statistical language model for a molecular data set. Accordingly, a molecule is represented as a SMILES string. First, let's create a data set of SMILES strings:

>>> smiles_strings = (
    'N#N\n'
    'CN=C=O\n'
    '[Cu+2].[O-]S(=O)(=O)[O-]\n'
    'CN1CCC[C@H]1c2cccnc2\n'
    'O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5\n'
    'OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@@H](O)1\n'
    'OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H]2[C@@H]1c3c(O)c(OC)c(O)cc3C(=O)O2\n'
    'CC[C@H](O1)CC[C@@]12CCCO2\n'
    'CC(C)[C@@]12C[C@@H]1[C@@H](C)C(=O)C2\n'
    'OCCc1c(C)[n+](cs1)Cc2cnc(C)nc2N\n'
    'CC(=O)NCCC1=CNc2c1cc(OC)cc2\n'
    'CCc1c[n+]2ccc3c4ccccc4[nH]c3c2cc1'
)
>>> file_name = 'test_data.txt'
>>> with open(file_name, mode='w', encoding='ascii') as file_handler:
...     file_handler.write(smiles_strings)

Next, we create a data set with moleculegen.data.SMILESDataset, which stores the list of SMILES strings prescribed with the beginning-of-SMILES token '{' and the end-of-SMILES token '}':

>>> import moleculegen as mg
>>> dataset = mg.data.SMILESDataset(file_name)
>>> print('\n'.join(dataset[0:3]))
{N#N}
{CN=C=O}
{[Cu+2].[O-]S(=O)(=O)[O-]}

SMILESDataset instances are subscriptable and iterable, and every item in the dataset is a raw SMILES string. Every SMILES string consists of tokens. A token represents an atom ('Br', 'I', etc.), non-atom substance ('(', '+', etc.), subcompound ('[nH]', etc.), or special character ('{', '}', '_'). The moleculegen.Token class describes all the available tokens as well as methods to manipulate them and tokenize SMILES strings.

>>> mg.Token.tokenize('{CN=C=O}')  # Return the list of tokens.
['{', 'C', 'N', '=', 'C', '=', 'O', '}']

Now we know how to load and tokenize datasets for language models. But we need to get a measurable data for calculations as our models cannot directly process texts. For this purpose, we create a moleculegen.data.SMILESVocabulary instance. It first tokenizes our data, then creates a token-to-ID and ID-to-token mappings and token frequency statistics.

>>> vocabulary = mg.data.SMILESVocabulary(dataset=dataset, need_corpus=True)
>>> vocabulary
SMILESVocabulary{ '_', '{', '}', ')', '[', 'C', '3', '=', 'c', 'H', 's', '2', '#', ... }

The corpus attribute refers to the IDs of all the tokens in the data. To load and save the corpus, we passed need_corpus=True. This corpus will be passed into batch samplers to train our models.

>>> vocabulary.corpus[0]
[1, 25, 12, 25, 2]
>>> ''.join(vocabulary.get_tokens(vocabulary.corpus[0])) == dataset[0]
True
>>> vocabulary.idx_to_token[:-15]
['_', '{', '}', ')', '[', 'C', '3', '=', 'c', 'H', 's', '2', '#']
>>> vocabulary.token_to_idx['_'] == 0
True

During fine-tuning, we will also have a separate data, so we need to transform it into a corpus too (for labeled data, see moleculegen.data.SMILESTargetDataset). We can use the get_token_id_corpus method of vocabulary. Also, note that four attributes of SMILESVocabulary instances, token_freqs, token_to_idx, idx_to_token, and corpus, can be serialized with to_pickle method. To load a pickle next time, specify the load_from_pickle parameter.

Next, we should create a data loader that can sample mini-batches to pass into the model. We have at least two options: moleculegen.data.SMILESBatchSampler and moleculegen.data.SMILESBatchColumnSampler. We will use the first one. Its API is very similar to the Gluon's batch sampler API, so we need to pass a sequence sampler and a batch size as formal parameters. Our sequence sampler implementation is moleculegen.data.SMILESConsecutiveSampler (check out also SMILESRandomSampler). It divides token IDs into subsequences of length n_steps and pads the padding token '_' if the last subsequence of a SMILES string is of length less than n_steps:

>>> sequence_sampler = mg.data.SMILESConsecutiveSampler(
...     corpus=vocabulary.corpus, n_steps=16, shuffle=False)
>>> sample = next(iter(sequence_sampler))
>>> sample.inputs, sample.outputs
([1, 25, 12, 25, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [25, 12, 25, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> sample.valid_length == 4
True

As you can see, the sampler generates samples of type SMILESConsecutiveSampler.Sample, which have inputs, outputs, and valid_length attributes.

Now, let's create a batch sampler:

>>> batch_sampler = mg.data.SMILESBatchSampler(
...     sampler=sequence_sampler, batch_size=4, last_batch='rollover')
>>> len(batch_sampler)
8

So, this batch sampler will generate 8 mini-batches of samples of size 4 very similar to the demonstrated sample above.

Now we are ready to introduce our language model.

Model

Let S be a molecule represented as a SMILES string, which contains T symbols (tokens) from a SMILES vocabulary. Then according to a parametric language model, the probability of S is

P

Autoregressive models, particularly recurrent neural networks, are a good choice for this setup. In moleculegen, we introduce a SMILESLM neural network architecture comprising three blocks: embedding, encoder, and output (decoder). moleculegen.estimation.SMILESRNN is an RNN-based language model. It has a number of formal parameters (see docs), which can also be loaded from a configuration file using the from_config class method, but the main parameter is vocab_size:

>>> model = mg.estimation.SMILESRNN(len(vocabulary))
>>> model
SMILESRNN(
  (_embedding): HybridSequential(
    (0): Embedding(28 -> 32, float32)
    (1): Dropout(p = 0.4, axes=1)
  )
  (_encoder): LSTM(None -> 256, TNC, num_layers=2, dropout=0.6)
  (_decoder): Dense(None -> 28, linear)
)

To run model training, we need an optimization method and a loss function. Optionally, we can pass a list of callbacks from the moleculegen.callback subpackage:

>>> import mxnet as mx
>>> mx.npx.set_np()  # We use deepnumpy instead of mxnet nd.
>>> optimizer = mx.optimizer.Adam(learning_rate=0.001)
>>> loss_fn = mx.gluon.loss.SoftmaxCELoss()
>>> model.fit(batch_sampler, optimizer, loss_fn, n_epochs=10,
              callbacks=[mg.callback.ProgressBar()], verbose=True)
Epoch  1 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 2.516 (+/-0.499), 0.019 sec/batch
Epoch  2 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 2.349 (+/-0.419), 0.019 sec/batch
Epoch  3 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 2.168 (+/-0.381), 0.023 sec/batch
Epoch  4 [✓✓✓✓✓✓✓✗] Batch 8/8, Loss 2.178 (+/-0.356), 0.023 sec/batch
Epoch  5 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 2.092 (+/-0.354), 0.018 sec/batch
Epoch  6 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 2.125 (+/-0.332), 0.019 sec/batch
Epoch  7 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 2.010 (+/-0.367), 0.023 sec/batch
Epoch  8 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 2.061 (+/-0.313), 0.021 sec/batch
Epoch  9 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 1.928 (+/-0.334), 0.020 sec/batch
Epoch 10 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 1.923 (+/-0.264), 0.024 sec/batch
Time: 0:00:02.

Since SMILESRNN inherits from the Gluon's Block, you are free to utilize any of its methods, including parameter saving/loading.

Finally, we can sample new SMILES strings with moleculegen.generation functors including SoftmaxSearch:

>>> predictor = mg.generation.SoftmaxSearch(model, vocabulary, temperature=0.7)
>>> predictor()
ON=O=O

Further Information

For more information about moleculegen, refer to the source code of the package, in which every API is documented. Specifically, try evaluating models with moleculegen.evaluation and monitoring training progress with moleculegen.callbacks.

Clone this wiki locally