Home
MoleculeGen-ML is a Python package to perform de novo drug design using generative language models.
The installation process is described in the main page. The package uses MXNet
backend and its Gluon
API to
build neural networks.
This wiki describes the means of preprocessing molecules, learning molecular language models, and generating novel molecule libraries. The objective is to simulate the framework introduced in paper Segler et al. Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks (https://arxiv.org/pdf/1701.01329.pdf).
In de novo drug design, we seek to generate novel focused molecule libraries, which are active toward a particular set of biological targets. We aim to create generative statistical models for molecular data that can capture the distributions of molecular compounds and generate a new set of valid compounds expected to be active toward the targets. One way to accomplish this is to use transfer learning as follows:
- Build a generative model to fit with a large set of diverse molecules and predict new molecules.
- Perform fine-tuning, i.e. re-train the model on a smaller data set of molecules active toward the targets.
Here we describe the machine learning methods adapted for use in drug design. For demonstration purposes, we use a small data set of SMILES strings of compounds (for bigger data sets, please refer to our scripts and queries or search for alternatives). We demonstrate how to implement the first stage.
The main paper introduces a statistical language model for a molecular data set. Accordingly, a molecule is represented as a SMILES string. First, let's create a data set of SMILES strings:
>>> smiles_strings = (
'N#N\n'
'CN=C=O\n'
'[Cu+2].[O-]S(=O)(=O)[O-]\n'
'CN1CCC[C@H]1c2cccnc2\n'
'O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5\n'
'OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@@H](O)1\n'
'OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H]2[C@@H]1c3c(O)c(OC)c(O)cc3C(=O)O2\n'
'CC[C@H](O1)CC[C@@]12CCCO2\n'
'CC(C)[C@@]12C[C@@H]1[C@@H](C)C(=O)C2\n'
'OCCc1c(C)[n+](cs1)Cc2cnc(C)nc2N\n'
'CC(=O)NCCC1=CNc2c1cc(OC)cc2\n'
'CCc1c[n+]2ccc3c4ccccc4[nH]c3c2cc1'
)
>>> file_name = 'test_data.txt'
>>> with open(file_name, mode='w', encoding='ascii') as file_handler:
... file_handler.write(smiles_strings)
Next, we create a data set with moleculegen.data.SMILESDataset
, which stores the list of SMILES strings prescribed with the beginning-of-SMILES token '{'
and the end-of-SMILES token '}'
:
>>> import moleculegen as mg
>>> dataset = mg.data.SMILESDataset(file_name)
>>> print('\n'.join(dataset[0:3]))
{N#N}
{CN=C=O}
{[Cu+2].[O-]S(=O)(=O)[O-]}
SMILESDataset
instances are subscriptable and iterable, and every item in the dataset is a raw SMILES string. Every SMILES string consists of tokens. A token represents an atom ('Br'
, 'I'
, etc.), non-atom substance ('('
, '+'
, etc.), subcompound ('[nH]'
, etc.), or special character ('{'
, '}'
, '_'
). The moleculegen.Token
class describes all the available tokens as well as methods to manipulate them and tokenize SMILES strings.
>>> mg.Token.tokenize('{CN=C=O}') # Return the list of tokens.
['{', 'C', 'N', '=', 'C', '=', 'O', '}']
Now we know how to load and tokenize datasets for language models. But we need to get a measurable data for calculations as our models cannot directly process texts. For this purpose, we create a moleculegen.data.SMILESVocabulary
instance. It first tokenizes our data, then creates a token-to-ID and ID-to-token mappings and token frequency statistics.
>>> vocabulary = mg.data.SMILESVocabulary(dataset=dataset, need_corpus=True)
>>> vocabulary
SMILESVocabulary{ '_', '{', '}', ')', '[', 'C', '3', '=', 'c', 'H', 's', '2', '#', ... }
The corpus
attribute refers to the IDs of all the tokens in the data. To load and save the corpus, we passed need_corpus=True
. This corpus will be passed into batch samplers to train our models.
>>> vocabulary.corpus[0]
[1, 25, 12, 25, 2]
>>> ''.join(vocabulary.get_tokens(vocabulary.corpus[0])) == dataset[0]
True
>>> vocabulary.idx_to_token[:-15]
['_', '{', '}', ')', '[', 'C', '3', '=', 'c', 'H', 's', '2', '#']
>>> vocabulary.token_to_idx['_'] == 0
True
During fine-tuning, we will also have a separate data, so we need to transform it into a corpus too
(for labeled data, see moleculegen.data.SMILESTargetDataset
). We can use the get_token_id_corpus
method of vocabulary
. Also, note that four attributes of SMILESVocabulary
instances, token_freqs
,
token_to_idx
, idx_to_token
, and corpus
, can be serialized with to_pickle
method. To load a
pickle next time, specify the load_from_pickle
parameter.
Next, we should create a data loader that can sample mini-batches to pass into the model. We have at least two options: moleculegen.data.SMILESBatchSampler
and moleculegen.data.SMILESBatchColumnSampler
. We will use the first one. Its API is very similar to the Gluon's batch sampler API, so we need to pass a sequence sampler and a batch size as formal parameters. Our sequence sampler implementation is moleculegen.data.SMILESConsecutiveSampler
(check out also
SMILESRandomSampler
). It divides token IDs into subsequences of length n_steps
and pads the padding
token '_'
if the last subsequence of a SMILES string is of length less than n_steps
:
>>> sequence_sampler = mg.data.SMILESConsecutiveSampler(
... corpus=vocabulary.corpus, n_steps=16, shuffle=False)
>>> sample = next(iter(sequence_sampler))
>>> sample.inputs, sample.outputs
([1, 25, 12, 25, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[25, 12, 25, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> sample.valid_length == 4
True
As you can see, the sampler generates samples of type SMILESConsecutiveSampler.Sample
, which have inputs
, outputs
, and valid_length
attributes.
Now, let's create a batch sampler:
>>> batch_sampler = mg.data.SMILESBatchSampler(
... sampler=sequence_sampler, batch_size=4, last_batch='rollover')
>>> len(batch_sampler)
8
So, this batch sampler will generate 8 mini-batches of samples of size 4 very similar to the demonstrated sample above.
Now we are ready to introduce our language model.
Let S be a molecule represented as a SMILES string, which contains T symbols (tokens) from a SMILES vocabulary. Then according to a parametric language model, the probability of S is
Autoregressive models, particularly recurrent neural networks, are a good choice for this
setup. In moleculegen
, we introduce a SMILESLM neural network architecture comprising three
blocks: embedding, encoder, and output (decoder). moleculegen.estimation.SMILESRNN
is an
RNN-based language model. It has a number of formal parameters (see docs), which can also be loaded
from a configuration file using the from_config
class method, but the main parameter is vocab_size
:
>>> model = mg.estimation.SMILESRNN(len(vocabulary))
>>> model
SMILESRNN(
(_embedding): HybridSequential(
(0): Embedding(28 -> 32, float32)
(1): Dropout(p = 0.4, axes=1)
)
(_encoder): LSTM(None -> 256, TNC, num_layers=2, dropout=0.6)
(_decoder): Dense(None -> 28, linear)
)
To run model training, we need an optimization method and a loss function. Optionally, we can pass a list of callbacks from the moleculegen.callback
subpackage:
>>> import mxnet as mx
>>> mx.npx.set_np() # We use deepnumpy instead of mxnet nd.
>>> optimizer = mx.optimizer.Adam(learning_rate=0.001)
>>> loss_fn = mx.gluon.loss.SoftmaxCELoss()
>>> model.fit(batch_sampler, optimizer, loss_fn, n_epochs=10,
callbacks=[mg.callback.ProgressBar()], verbose=True)
Epoch 1 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 2.516 (+/-0.499), 0.019 sec/batch
Epoch 2 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 2.349 (+/-0.419), 0.019 sec/batch
Epoch 3 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 2.168 (+/-0.381), 0.023 sec/batch
Epoch 4 [✓✓✓✓✓✓✓✗] Batch 8/8, Loss 2.178 (+/-0.356), 0.023 sec/batch
Epoch 5 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 2.092 (+/-0.354), 0.018 sec/batch
Epoch 6 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 2.125 (+/-0.332), 0.019 sec/batch
Epoch 7 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 2.010 (+/-0.367), 0.023 sec/batch
Epoch 8 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 2.061 (+/-0.313), 0.021 sec/batch
Epoch 9 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 1.928 (+/-0.334), 0.020 sec/batch
Epoch 10 [✓✓✓✓✓✓✓✓] Batch 8/8, Loss 1.923 (+/-0.264), 0.024 sec/batch
Time: 0:00:02.
Since SMILESRNN
inherits from the Gluon's Block
, you are free to utilize any of its methods,
including parameter saving/loading.
Finally, we can sample new SMILES strings with moleculegen.generation
functors including
SoftmaxSearch
:
>>> predictor = mg.generation.SoftmaxSearch(model, vocabulary, temperature=0.7)
>>> predictor()
ON=O=O
For more information about moleculegen
, refer to the source code of the package, in which every API
is documented. Specifically, try evaluating models with moleculegen.evaluation
and monitoring
training progress with moleculegen.callbacks
.