Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge_cutup_clustering.py truncates contig name when three zeros follow period #311

Open
ccgallen opened this issue Apr 12, 2022 · 2 comments

Comments

@ccgallen
Copy link

ccgallen commented Apr 12, 2022

Hello, I am using concoct 1.0.0 and recently discovered an odd phenomenon when the file clustering_gt1000.csv is processed with merge_cutup_clustering.py to generate clustering_gt1000_merged.csv

Examples of lines in clustering_gt1000.csv are as follows:
<contig_name>,<cluster_id>

NODE_1_length_2595161_cov_8.709327.37,104
NODE_1_length_2595161_cov_8.709327.38,65
NODE_1_length_2595161_cov_8.709327.39,104
NODE_114750_length_1831_cov_0.514671,38
NODE_231037_length_1147_cov_1.000980,144

longer contigs have been split into fragments and each fragment is assigned to a cluster (after the comma). For those, the original name has a ".\d+" added to the contig name to identify the contig fragment (first three lines). The last two contigs were not broken into fragments and do not have an extra ".\d+". so far so good.

after processing with merge_cutup_clustering.py, each contig is assigned a single cluster. Here, the odd part is when the name has a period followed by three or more 0s. In this case, the name is clipped up to the period, when the others remain as they should. Here are the results from my example:

NODE_1_length_2595161_cov_8.709327,104
NODE_114750_length_1831_cov_0.514671,38
NODE_231037_length_1147_cov_1,144 (and not NODE_231037_length_1147_cov_1.000980,144)

This is messing up downstream analysis because the contig names in the .fasta file are not matching my cluster assignments. Any idea why this might be happening? I have attached the merge_cutup_clustering.py code that I have installed below.

Thanks!!

#!/data/ccallen/miniconda/envs/metawrap-env/bin/python
"""
With contigs cutup with cut_up_fasta.py as input, sees to that the consequtive
parts of the original contigs are merged.

prints result to stdout.

@author: alneberg
"""
from __future__ import print_function
import sys
import os
import argparse
from collections import defaultdict, Counter

def original_contig_name_special(s):
    n = s.split(".")[-1]
    try:
        int(n)
    except:
        return s, 0
    # Only small integers are likely to be 
    # indicating a cutup part.
    if int(n) < 1000:

        return ".".join(s.split(".")[:-1]), int(n)
    else:
        # A large n indicates that the integer
        # was part of the original contig
        return s, 0

def main(args):
    all_seqs = {}
    all_originals = defaultdict(dict)
    first = True
    with open(args.cutup_clustering_result, 'r') as ifh:
        for line in ifh:
            if first:
                first=False
                continue
            line = line.strip()
            contig_id, cluster_id = line.split(',')
            original_contig_name, part_id = original_contig_name_special(contig_id)
        
            all_originals[original_contig_name][part_id] = cluster_id

    merged_contigs_stack = []
    
    sys.stdout.write("contig_id,cluster_id\n")
    for original_contig_id, part_ids_d in all_originals.items():
        if len(part_ids_d) > 1:
            c = Counter(part_ids_d.values())
            cluster_id = c.most_common(1)[0][0]
            c_string = [(a,b) for a, b in c.items()]
            if len(c.values()) > 1:
                sys.stderr.write("{}\t{}, chosen: {}\n".format(original_contig_id, c_string, cluster_id))
            else:
                sys.stderr.write("{}\t{}\n".format(original_contig_id, c_string))
        else:
            cluster_id = list(part_ids_d.values())[0]

        sys.stdout.write("{},{}\n".format(original_contig_id, cluster_id))

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("cutup_clustering_result", help=("Input cutup clustering result."))
    args = parser.parse_args()

    main(args)
@INFINITY1993
Copy link

It may give you the clue
#247

@ccgallen
Copy link
Author

ccgallen commented May 1, 2022

Thank you @INFINITY1993 for the tip!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants