custom molecules embeddings #634

Rainsmumu · 2024-01-24T03:07:46Z

Rainsmumu
Jan 24, 2024

Hi!
As #570 mentioned before, I want to modify the multi-molecule embedding process, tailoring it to meet unique requirements of my project. In short, my data looks like this:

primary SMILES	secondary SMILES-1	secondary SMILES-2
mol-1-1	mol-1-2	mol-1-3
mol-2-1	mol-2-2	mol-2-3
mol-3-1	mol-3-2	mol-3-3

The predicted property is collectively determined by multiple molecules, with varying degrees of importance. The first column represents the primary molecule, while the remaining columns correspond to secondary molecules. What I want to do is use Chemprop's multi-molecule models to embed these molecules, but with a slight variation in the embedding approach. Its something like this：

The main modifications needed involve using a separate MPNN encoder for the first primary molecule, while the remaining secondary molecules all use another shared encoder (i.e., mpn_shared = True). The encodings obtained from the secondary molecules through the MPNN encoder need to be summed and then concatenated with the encoding of the primary molecule to generate the final embedding for subsequent predictions.

Below are three modifications I made to the Chemprop source code:
mpy.py

if args.mpn_shared:
    self.encoder = nn.ModuleList([MPNEncoder(args, self.atom_fdim, self.bond_fdim)] * args.number_of_molecules)
    # The primary molecule uses an independent encoder,
    # while the remaining secondary molecules use another identical one.
elif args.use_custom_embeddings:
    self.encoder = nn.ModuleList([MPNEncoder(args, self.atom_fdim, self.bond_fdim)] + [MPNEncoder(args, self.atom_fdim, self.bond_fdim)] * (args.number_of_molecules - 1))

else:
    self.encoder = nn.ModuleList([MPNEncoder(args, self.atom_fdim, self.bond_fdim)
                                  for _ in range(args.number_of_molecules)])

# The encodings of secondary molecules are summed,
# and then concatenated with the encoding of the primary molecule.
if self.use_custom_embeddings:
    main = encodings[0]
    side = encodings[1:]
    side_encodings = torch.sum(torch.stack(side), dim=0)
    output = torch.cat([main, side_encodings], dim=1)
else:
    output = encodings[0] if len(encodings) == 1 else torch.cat(encodings, dim=1)

model.py

if args.reaction_solvent:
    first_linear_dim = args.hidden_size + args.hidden_size_solvent
else:
    # Change the size of first_linear_dim to 2 * hidden_size
    # 2 represents the combination of the primary molecule and the secondary molecules.
    if args.use_custom_embeddings:
        first_linear_dim = args.hidden_size * 2
    else:
        first_linear_dim = args.hidden_size * args.number_of_molecules

I have tested the modified, and it appears to be functioning as intended, but I am uncertain about the completeness and correctness of my modifications. Before proceeding further, I would like to know if there are any oversights or improvements needed in my modified code. Thank you in advance!

Due to English not being my native language, there might be some grammar or semantic errors in my expression. I appreciate your understanding.

davidegraff · 2024-01-24T16:01:50Z

davidegraff
Jan 24, 2024
Collaborator

I've been thinking about the same concept as it relates to v2. Currently, it wouldn't be very hard to hack this together in the v2 by subclassing MultcomponentMPNN and overriding the fingerprint method. Of course, you'll need to keep track of the corresponding predictor input size somehow as well.

More broadly, it would be nice (in v2.1 or 2.2) if we could programmatically encode this behavior somehow. That is, given a k-component input, (1) encode all k components, (2) aggregate components 0 and 1 together, (3) aggregate components 2, ..., k-1together, (4) concatenate the resulting latent codes. Currently, we can't perform steps (2) and (3) as all encoded inputs get concatenated together, so we would need some sort of "subaggregation" routine prior to concatenation (4).

On a more general note, given a k-component input, the current object model in v2 suggests that multicomponent encoding is "all or none" with regards to independent message passing blocks. That is, we either have an independent block for each component or share one block among all components so it would seem that we're not able to accommodate a middle ground, i.e., j components, where components 0 and 2 get sent to block 0 and component 1 gets sent to block 1. In fact, we actually can do this during initialization as long as we know the mapping. In the prior example of 2 blocks for 3 components: {0, 2} -> 0 and {1} -> {1}, we would just need to build our MulticomponentMessagePassing like so:

blocks: list[MessagePassing]
input_i_to_block_j = [0, 1,  0] # a map from input to block index
mp = MulticomponentMessagePassing([blocks[i] for i in input_i_to_block_j])

But we will first need to delete these lines:

elif not shared and len(blocks) != n_components:
    raise ValueError(
        "arg 'n_components' must be equal to `len(blocks)` if 'shared' is False! "
        f"got: {n_components} and {len(blocks)}, respectively."
    )

0 replies

Rainsmumu · 2024-01-26T01:24:53Z

Rainsmumu
Jan 26, 2024
Author

@davidegraff
Its definitely a more elegant way to do this using mapping, But since I haven't tried v2 yet, I guess I will still use the modified code in v1 to do my task.
By the way, when will the stable v2 be released? You mentioned in the doc that it would be in early 2024, so Im wonder if I can expect to see it in the next few months. Im eager to explore the new features.

0 replies

kevingreenman · 2024-02-02T17:47:02Z

kevingreenman
Feb 2, 2024
Maintainer

As mentioned on your other issue, I've converted this to a discussion, which is the forum we're planning to use going forward for these types of questions that are more general requests for advice/help or that require more extended discussion.

To answer your question about the stable v2 release, it should be released by the end of February.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom molecules embeddings #634

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

custom molecules embeddings #634

Rainsmumu Jan 24, 2024

Replies: 3 comments

davidegraff Jan 24, 2024 Collaborator

Rainsmumu Jan 26, 2024 Author

kevingreenman Feb 2, 2024 Maintainer

Rainsmumu
Jan 24, 2024

davidegraff
Jan 24, 2024
Collaborator

Rainsmumu
Jan 26, 2024
Author

kevingreenman
Feb 2, 2024
Maintainer