Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recreating homo_search.py output -- minimal version #135

Open
avilella opened this issue Oct 5, 2023 · 3 comments
Open

recreating homo_search.py output -- minimal version #135

avilella opened this issue Oct 5, 2023 · 3 comments

Comments

@avilella
Copy link
Contributor

avilella commented Oct 5, 2023

Hi,

I am running Uni-Fold on antibody-antigen pairs, where the antigen (chain A) is always the same, and the antibody sequences (chain B in each prediction) are very similar to each other (same species).

Since the homo_search.py part of run_unifold.sh multimer takes a long time, but produces very similar hits, I would like to recreate it in a new folder for new predictions, so I can just calculate the second inference.py part of run_unifold.sh on it.

My plan is to aggregate each of the .sto files for a bunch of predictions, and produce a combined version in the new input folder structure to inference.py. The .sto format is a bit cumbersome to recreate, and if the inference.py part is not going to read the alignment structure from it, but rather just the fasta entries, would it be possible to provide the "combined inputs" as multi-fasta files rather than .sto files?

Thanks in advance.

[       4096 Oct  3 15:59]  ./B
[  231414423 Oct  3 15:59]  ./B/uniprot_hits.sto
[   62590981 Oct  3 15:59]  ./B/pdb_hits.sto
[     516587 Oct  3 15:59]  ./B/mgnify_hits.sto
[     462122 Oct  3 15:59]  ./B/bfd_uniclust_hits.a3m
[  184892903 Oct  3 15:59]  ./B/uniref90_hits.sto
[    1788444 Oct  3 15:59]  ./B.uniprot.pkl.gz
[         81 Oct  3 15:59]  ./B.timings.json
[          3 Oct  3 15:59]  ./chains.txt
[        833 Oct  3 15:59]  ./chain_id_map.json
[     811365 Oct  3 15:59]  ./B.feature.pkl.gz
[      31503 Oct  3 15:59]  ./A.uniprot.pkl.gz
[         80 Oct  3 15:59]  ./A.timings.json
[     282367 Oct  3 15:59]  ./A.feature.pkl.gz
[       4096 Oct  3 15:59]  ./A
[    1139693 Oct  3 15:59]  ./A/uniref90_hits.sto
[     966652 Oct  3 15:59]  ./A/uniprot_hits.sto
[   40494816 Oct  3 15:59]  ./A/pdb_hits.sto
[       3189 Oct  3 15:59]  ./A/mgnify_hits.sto
[     233861 Oct  3 15:59]  ./A/bfd_uniclust_hits.a3m
[        255 Oct  3 15:59]  ./1b634d49dfcce4784af7c9bbb7d53496.TRI002.mmer_B.fasta
[        123 Oct  3 15:59]  ./1b634d49dfcce4784af7c9bbb7d53496.TRI002.mmer_A.fasta
@ZiyaoLi
Copy link
Member

ZiyaoLi commented Oct 8, 2023

I would recommend you to refer to the mmseqs processing code here and here. It has a lighter processing pipeline.

@avilella
Copy link
Contributor Author

avilella commented Oct 9, 2023

If I am reading the code in inference.py correctly, for multimer, it reads the uniprot_msa_dir?

def load_feature_for_one_target(
    config, data_folder, seed=0, is_multimer=False, use_uniprot=False
):
    if not is_multimer:
        uniprot_msa_dir = None
        sequence_ids = ["A"]
        if use_uniprot:
            uniprot_msa_dir = data_folder

    else:
        uniprot_msa_dir = data_folder
        sequence_ids = open(os.path.join(data_folder, "chains.txt")).readline().split()
    batch, _ = load_and_process(
        config=config.data,
        mode="predict",
        seed=seed,
        batch_idx=None,
        data_idx=0,
        is_distillation=False,
        sequence_ids=sequence_ids,
        monomer_feature_dir=data_folder,
        uniprot_msa_dir=uniprot_msa_dir,
        is_monomer=(not is_multimer),
    )
    batch = UnifoldDataset.collater([batch])
    return batch


def main(args):
    config = model_config(args.model_name)
    config.data.common.max_recycling_iters = args.max_recycling_iters
-UU-:----F1  inference.py   13% (70,0)    Git-main  (Python ElDoc) ---------------------------------------------------------------------------------------------------------------------------------------------------

@ZiyaoLi
Copy link
Member

ZiyaoLi commented Oct 12, 2023

Yes. Uniprot msas are used for msa-pairing because they contain species information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants