Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add use_custom_templates option #408

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

rostro36
Copy link

Introduction & motiviation

Dear OpenFold team,

Thank you very much for creating such a useful tool and give the community the opportunity to use, ablate and extend the implementation!

In my research, I look at the impact of templates and try to find new protocols centered around them. For some aspects of this, I found that localcolabfold was too restrictive and I would like to use some features, which OpenFold can offer. That is why I implemented the use_custom_templates option, which should be similar to the --custom-template-path argument of localcolabfold. I think this should be enough to have an implementation of AF2Rank (see tests below), touted by some their "favorite in the post-AF2 realm".

Changes

Basically, I fitted the template pipeline from thread_sequence to use it with run_pretrained_model.py

  • openfold/data/templates.py
    • extend get_custom_templates to use multiple files
    • create CustomHitFeaturizer to use this method
  • openfold/data/data_pipeline.py
    • make sure that we always read the templates if CustomHitFeaturizer is used, no matter what alignment we found
  • run_pretrained_openfold.py
    • implement use_custom_templates option
  • README.md
    • update docs about option

Tests

With the same pre-computed one row MSA, I saw that the inclusion of the template increased the pLDDT from 40ish to 80ish. I have not tested multimers or ESM single sequence mode, but it should not interfere with those. Don't know how well it plays along though.

I did a deeper dive and I think I found that using the very simple 1:1 mapping without alignment allows to use the template without any sequence information, except for the size of the sequence. As an empty use_precomputed_alignments also does not pass any information, this setup can be used for AF2Rank. The only other necessary condition is to correctly form the template to all GLY and only backbone + C-Beta. I can donate a script for the creation of this template in about one month after a deadline if you want.

The only test I did is with the randomly chosen, but short 3NIR against the AF2Rank implementation in the ColabFold notebook.

Do you think the differences is within margin of error or should I have another look?
https://gist.github.com/rostro36/42113291b644388730b17d2aac1186be

End

Please let me know what tests and adjustments I can fulfill to add this feature or if I should merge it to another branch first.
Thank you for work!

@jnwei
Copy link
Collaborator

jnwei commented Feb 29, 2024

Hi @rostro36 , thank you very much for this contribution. This is a really nice option to have for running with templates, and we would like to work with you to add this to OpenFold.

We're preparing some testcases based on previous work that we have done with templates, especially on the single sequence model. Hopefully these testcases will be helpful for adjusting the code changes to work for all inference modes.

I did run some quick integration tests with your PR. It looks like monomer inference mode using pre-computed alignments without templates has a small issue? Here's a sample command:

python $OPENFOLD_DIR/run_pretrained_openfold.py \
            $INPUT_DIR/fasta_dir \
                $MMCIF_DIR \
            --use_precomputed_alignments $PRECOMPUTED_ALIGNMENTS \
            --model_device "cuda:0" \
            --output_dir $OUTPUT_DIR \
            --config_preset "model_1_ptm" \
            --jax_param_path $CHECKPOINT_PATH

And the corresponding error:

Traceback (most recent call last):
File "/eagle/projects/MLProtein/jnwei/openfold/run_pretrained_openfold.py", line 476, in <module>
  main(args)
File "/eagle/projects/MLProtein/jnwei/openfold/run_pretrained_openfold.py", line 283, in main
  feature_dict = generate_feature_dict(
File "/eagle/projects/MLProtein/jnwei/openfold/run_pretrained_openfold.py", line 141, in generate_feature_dict
  feature_dict = data_processor.process_fasta(
File "/lus/eagle/projects/MLProtein/jnwei/openfold/openfold/data/data_pipeline.py", line 889, in process_fasta
  template_features = make_template_features(
File "/lus/eagle/projects/MLProtein/jnwei/openfold/openfold/data/data_pipeline.py", line 57, in make_template_features
  templates_result = template_featurizer.get_templates(
File "/lus/eagle/projects/MLProtein/jnwei/openfold/openfold/data/templates.py", line 1211, in get_templates
  return get_custom_template_features(
File "/lus/eagle/projects/MLProtein/jnwei/openfold/openfold/data/templates.py", line 980, in get_custom_template_features
  curr_features, curr_warnings = _extract_template_features(
File "/lus/eagle/projects/MLProtein/jnwei/openfold/openfold/data/templates.py", line 672, in _extract_template_features
  templates_all_atom_positions[k] = all_atom_positions[template_index][0]

I haven't had a chance to look into this deeply, but maybe have another look at the logic in get_custom_templates for parsing multiple mmcif files?

@rostro36
Copy link
Author

rostro36 commented Mar 2, 2024

Hey @jnwei ,
Thanks for running some tests and highlighting even more test cases.

  • Commit 10b6838 fixes always taking use_custom_templates.
  • Commit e9bacd8 makes the processing less dependent on the input sequence and instead take length of template
  • Commit 50f8617 fixes a mistake for taking many custom templates.

On my machine, the code now works for one or two custom templates with and without the use_custom_templates argument.

Let me know if/how I can help you with the further

README.md Outdated
@@ -174,7 +174,10 @@ where `data` is the same directory as in the previous step. If `jackhmmer`,
`/usr/bin`, their `binary_path` command-line arguments can be dropped.
If you've already computed alignments for the query, you have the option to
skip the expensive alignment computation here with
`--use_precomputed_alignments`.
`--use_precomputed_alignments`. If you wish to use a specific template as input,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this information to a different subheading of it's own? Maybe called "Custom Templates"?
Do mention that it'll be the same template(s) that will be used if multiple separate sequences are passed for inference.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments. The others are fulfilled without any question.
I am not sure if I understood you correctly about this one here though.

Now, I changed the README, so custom templates stands a bit more on it's own, but still in the monomer inference section.

Should I make a new section on the same level as monomer inference, multimer inference or soloseq?

openfold/data/templates.py Outdated Show resolved Hide resolved
openfold/data/templates.py Outdated Show resolved Hide resolved
@jnwei
Copy link
Collaborator

jnwei commented Mar 20, 2024

Hi @rostro36 thanks for the fixes.

I finally had a chance to take a closer look at this PR and do some more testing. I had a couple more thoughts/requests:

  1. Sequence alignments: As stated in your OP, this PR currently supports templates that are the same length as the query sequence for a 1:1 mapping, as per the AF2Rank experiment. However, it would be more generally useful if there was an option to support templates of other lengths/chains by first performing an alignment. I can give some more pointers and share my test setup if you're interested in this direction.

  2. Since Soloseq is designed to accept templates to make structure predictions form a single sequence, it would be awesome if this PR also supports that mode. More instructions on how to run Soloseq can be found here.

Let me know what you think, and if you'd be interested in contributing further on either of these directions. Thanks!

@rostro36
Copy link
Author

Hey @jnwei , @sachinkadyan7
Thanks for the review. I hope the changes are to your liking.

I should have a look to get some models running again soon and there is another OpenFold contribution I want to look into, but if it is nothing too big, then I can try to help with 1. the sequence alignments.

@jnwei
Copy link
Collaborator

jnwei commented Mar 21, 2024

Leaving my notes here regarding the sequence alignments in case they are helpful later.

  • A good test would be to compare structure prediction for a query using the default route (using msa to find template matches), and the custom template search route. To this end, I wanted to find which templates were being used by normal inference.

You can find the template matches for a query sequence by adding a breakpoint after the template features dict are generated, and looking at the values for the key template_domain_names. For my sample query: PDB:1LCI (luciferase), I found that the top four templates that matched to this structure in our databases were 6hiv_BC, 6mfw_A, 5wmm_A, 6mfy_A, where the underscores represent the chains.

  • To create a sequence alignment between the query and the custom template, it might be possible to adapt this function / these lines to the custom_templates route.
    parsed_a3m = parsers.parse_a3m(
    aligner.align([old_template_sequence, new_template_sequence])
    )

@rostro36
Copy link
Author

I added support for sequence alignments. 6mfw, 5wmm and 6mfy that you suggested did not have a good sequence similarity to 1lci and Kalign raised an error.

When supplying 3iep_A, 4g37_A and 5gz2_A (found with FoldSeek) with high sequence similarity, but not 100, everything worked and also did get high pLDDT scores in single sequence mode.

Removing the templates still worked, but gave a bad pLDDT in single sequence.

These checks look good to me so far. Let me know if there is something else to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants