Duplicate expansion support #419

ljarosch · 2024-03-20T04:09:20Z

Adds support for expanding the downloaded and flattened RODA alignments to explicit duplicates for both the standard alignment dir and the alignment DB formats.

This makes it more straightforward to create an alignment database directly from the flattened RODA downloads

This adds support for duplicate chain expansion for the alignment dir format. This script can be run on the flattened non-redundant RODA alignments to add explicit directories for all of the duplicate chains in the duplicate_chains file, symlinked to their representative chain alignment directory.

jnwei

Overall looks good, two minor suggestions about types for you to consider in line below.

jnwei · 2024-03-21T09:00:50Z

scripts/alignment_db_scripts/create_alignment_db_sharded.py

    """Iterate over a list in chunks of size chunk_size."""
    for i in range(0, len(lst), chunk_size):
        yield lst[i : i + chunk_size]


-def read_chain_dir(chain_dir) -> dict:
+def read_chain_dir(chain_dir: Path) -> dict:


If you want, you can also add typing to the dict, i.e. for this specific example, you could write dict[str, Tuple[str, bytes]]

They can sometimes be helpful to know what to expect. For this specific function I don't think it's as needed / helpful.

jnwei · 2024-03-21T09:12:41Z

scripts/alignment_db_scripts/create_alignment_db_sharded.py

@@ -83,7 +84,7 @@ def create_index_default_dict() -> dict:


 def create_shard(
-    shard_files: List[Path], output_dir: Path, output_name: str, shard_num: int
+    shard_files: list[Path], output_dir: Path, output_name: str, shard_num: int
 ) -> dict:


Perhaps having a type alias for the index dict could be helpful.

If the index file structure is also used in the main library code, then I would consider adding the alias to the main openfold library. But this can also be done in a later PR.

The previous data_dir_to_fasta.py script is very slow and requires fully reparsing mmCIF files. This new script is much faster and uses the sequence information from the alignment data instead. Note that this will not include chains for which alignments could not be generated, but we can't use those during training anyways.

jnwei · 2024-05-09T10:08:50Z

scripts/expand_alignment_duplicates.py

@@ -0,0 +1,79 @@
+"""


Nit: For module level comments, it can be helpful to include a 1 sentence summary of the purpose of the module so the reader can quickly understand what the module is for.

Not sure how this will play with the __doc__ call you use in some of these scripts though. Feel free to keep the single paragraph version if you think that is nicer for the __doc__ call.

jnwei · 2024-05-09T10:23:03Z

scripts/alignment_data_to_fasta.py

+    Generates a FASTA string from a chain directory.
+    """
+    # take some alignment file
+    for alignment_file_type in [


nit: consider making the expected alignment_file_types a global variable, especially since they get used in multiple functions.

Does this mean that this code would not support having .sto alignment files? I don't think we need support for this now, but perhaps good to mention somewhere.

ljarosch added 4 commits March 19, 2024 15:07

Improve type hints and formatting

77860bb

Add default shard number

e678050

Add duplicate chain file support to alignment DB script

ee0c5db

This makes it more straightforward to create an alignment database directly from the flattened RODA downloads

ljarosch requested review from christinaflo and jnwei March 20, 2024 04:09

jnwei reviewed Mar 21, 2024

View reviewed changes

ljarosch added 5 commits May 5, 2024 23:48

Give script more descriptive name

0b5c949

Slightly improve comment

244970b

Set CLI description to more informative module docstring

78b9706

Improve import formatting

04410d5

jnwei approved these changes May 9, 2024

View reviewed changes

jnwei merged commit 29b5823 into setup-improvements May 13, 2024
2 checks passed

jnwei deleted the setup-improvements_additional-scripts branch May 13, 2024 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate expansion support #419

Duplicate expansion support #419

ljarosch commented Mar 20, 2024

jnwei left a comment

jnwei Mar 21, 2024

jnwei Mar 21, 2024

jnwei May 9, 2024

jnwei May 9, 2024

Duplicate expansion support #419

Duplicate expansion support #419

Conversation

ljarosch commented Mar 20, 2024

jnwei left a comment

Choose a reason for hiding this comment

jnwei Mar 21, 2024

Choose a reason for hiding this comment

jnwei Mar 21, 2024

Choose a reason for hiding this comment

jnwei May 9, 2024

Choose a reason for hiding this comment

jnwei May 9, 2024

Choose a reason for hiding this comment