New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate expansion support #419
Duplicate expansion support #419
Conversation
This makes it more straightforward to create an alignment database directly from the flattened RODA downloads
This adds support for duplicate chain expansion for the alignment dir format. This script can be run on the flattened non-redundant RODA alignments to add explicit directories for all of the duplicate chains in the duplicate_chains file, symlinked to their representative chain alignment directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, two minor suggestions about types for you to consider in line below.
"""Iterate over a list in chunks of size chunk_size.""" | ||
for i in range(0, len(lst), chunk_size): | ||
yield lst[i : i + chunk_size] | ||
|
||
|
||
def read_chain_dir(chain_dir) -> dict: | ||
def read_chain_dir(chain_dir: Path) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want, you can also add typing to the dict, i.e. for this specific example, you could write dict[str, Tuple[str, bytes]]
They can sometimes be helpful to know what to expect. For this specific function I don't think it's as needed / helpful.
@@ -83,7 +84,7 @@ def create_index_default_dict() -> dict: | |||
|
|||
|
|||
def create_shard( | |||
shard_files: List[Path], output_dir: Path, output_name: str, shard_num: int | |||
shard_files: list[Path], output_dir: Path, output_name: str, shard_num: int | |||
) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps having a type alias for the index dict could be helpful.
If the index file structure is also used in the main library code, then I would consider adding the alias to the main openfold library. But this can also be done in a later PR.
The previous data_dir_to_fasta.py script is very slow and requires fully reparsing mmCIF files. This new script is much faster and uses the sequence information from the alignment data instead. Note that this will not include chains for which alignments could not be generated, but we can't use those during training anyways.
@@ -0,0 +1,79 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: For module level comments, it can be helpful to include a 1 sentence summary of the purpose of the module so the reader can quickly understand what the module is for.
Not sure how this will play with the __doc__
call you use in some of these scripts though. Feel free to keep the single paragraph version if you think that is nicer for the __doc__
call.
Generates a FASTA string from a chain directory. | ||
""" | ||
# take some alignment file | ||
for alignment_file_type in [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider making the expected alignment_file_types a global variable, especially since they get used in multiple functions.
Does this mean that this code would not support having .sto alignment files? I don't think we need support for this now, but perhaps good to mention somewhere.
Adds support for expanding the downloaded and flattened RODA alignments to explicit duplicates for both the standard alignment dir and the alignment DB formats.