Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate expansion support #419

Merged
merged 9 commits into from May 13, 2024

Conversation

ljarosch
Copy link
Collaborator

Adds support for expanding the downloaded and flattened RODA alignments to explicit duplicates for both the standard alignment dir and the alignment DB formats.

This makes it more straightforward to create an alignment database directly from the flattened RODA downloads
This adds support for duplicate chain expansion for the alignment dir format. This script can be run on the flattened non-redundant RODA alignments to add explicit directories for all of the duplicate chains in the duplicate_chains file, symlinked to their representative chain alignment directory.
Copy link
Collaborator

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, two minor suggestions about types for you to consider in line below.

"""Iterate over a list in chunks of size chunk_size."""
for i in range(0, len(lst), chunk_size):
yield lst[i : i + chunk_size]


def read_chain_dir(chain_dir) -> dict:
def read_chain_dir(chain_dir: Path) -> dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, you can also add typing to the dict, i.e. for this specific example, you could write dict[str, Tuple[str, bytes]]

They can sometimes be helpful to know what to expect. For this specific function I don't think it's as needed / helpful.

@@ -83,7 +84,7 @@ def create_index_default_dict() -> dict:


def create_shard(
shard_files: List[Path], output_dir: Path, output_name: str, shard_num: int
shard_files: list[Path], output_dir: Path, output_name: str, shard_num: int
) -> dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps having a type alias for the index dict could be helpful.

If the index file structure is also used in the main library code, then I would consider adding the alias to the main openfold library. But this can also be done in a later PR.

The previous data_dir_to_fasta.py script is very slow and requires fully reparsing mmCIF files. This new script is much faster and uses the sequence information from the alignment data instead. Note that this will not include chains for which alignments could not be generated, but we can't use those during training anyways.
@@ -0,0 +1,79 @@
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: For module level comments, it can be helpful to include a 1 sentence summary of the purpose of the module so the reader can quickly understand what the module is for.

Not sure how this will play with the __doc__ call you use in some of these scripts though. Feel free to keep the single paragraph version if you think that is nicer for the __doc__ call.

Generates a FASTA string from a chain directory.
"""
# take some alignment file
for alignment_file_type in [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider making the expected alignment_file_types a global variable, especially since they get used in multiple functions.

Does this mean that this code would not support having .sto alignment files? I don't think we need support for this now, but perhaps good to mention somewhere.

@jnwei jnwei merged commit 29b5823 into setup-improvements May 13, 2024
2 checks passed
@jnwei jnwei deleted the setup-improvements_additional-scripts branch May 13, 2024 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants