merge_cutup_clustering.py truncates contig name when three zeros follow period #311

ccgallen · 2022-04-12T18:37:52Z

Hello, I am using concoct 1.0.0 and recently discovered an odd phenomenon when the file clustering_gt1000.csv is processed with merge_cutup_clustering.py to generate clustering_gt1000_merged.csv

Examples of lines in clustering_gt1000.csv are as follows:
<contig_name>,<cluster_id>

NODE_1_length_2595161_cov_8.709327.37,104
NODE_1_length_2595161_cov_8.709327.38,65
NODE_1_length_2595161_cov_8.709327.39,104
NODE_114750_length_1831_cov_0.514671,38
NODE_231037_length_1147_cov_1.000980,144

longer contigs have been split into fragments and each fragment is assigned to a cluster (after the comma). For those, the original name has a ".\d+" added to the contig name to identify the contig fragment (first three lines). The last two contigs were not broken into fragments and do not have an extra ".\d+". so far so good.

after processing with merge_cutup_clustering.py, each contig is assigned a single cluster. Here, the odd part is when the name has a period followed by three or more 0s. In this case, the name is clipped up to the period, when the others remain as they should. Here are the results from my example:

NODE_1_length_2595161_cov_8.709327,104
NODE_114750_length_1831_cov_0.514671,38
NODE_231037_length_1147_cov_1,144 (and not NODE_231037_length_1147_cov_1.000980,144)

This is messing up downstream analysis because the contig names in the .fasta file are not matching my cluster assignments. Any idea why this might be happening? I have attached the merge_cutup_clustering.py code that I have installed below.

Thanks!!

#!/data/ccallen/miniconda/envs/metawrap-env/bin/python
"""
With contigs cutup with cut_up_fasta.py as input, sees to that the consequtive
parts of the original contigs are merged.

prints result to stdout.

@author: alneberg
"""
from __future__ import print_function
import sys
import os
import argparse
from collections import defaultdict, Counter

def original_contig_name_special(s):
    n = s.split(".")[-1]
    try:
        int(n)
    except:
        return s, 0
    # Only small integers are likely to be 
    # indicating a cutup part.
    if int(n) < 1000:

        return ".".join(s.split(".")[:-1]), int(n)
    else:
        # A large n indicates that the integer
        # was part of the original contig
        return s, 0

def main(args):
    all_seqs = {}
    all_originals = defaultdict(dict)
    first = True
    with open(args.cutup_clustering_result, 'r') as ifh:
        for line in ifh:
            if first:
                first=False
                continue
            line = line.strip()
            contig_id, cluster_id = line.split(',')
            original_contig_name, part_id = original_contig_name_special(contig_id)
        
            all_originals[original_contig_name][part_id] = cluster_id

    merged_contigs_stack = []
    
    sys.stdout.write("contig_id,cluster_id\n")
    for original_contig_id, part_ids_d in all_originals.items():
        if len(part_ids_d) > 1:
            c = Counter(part_ids_d.values())
            cluster_id = c.most_common(1)[0][0]
            c_string = [(a,b) for a, b in c.items()]
            if len(c.values()) > 1:
                sys.stderr.write("{}\t{}, chosen: {}\n".format(original_contig_id, c_string, cluster_id))
            else:
                sys.stderr.write("{}\t{}\n".format(original_contig_id, c_string))
        else:
            cluster_id = list(part_ids_d.values())[0]

        sys.stdout.write("{},{}\n".format(original_contig_id, cluster_id))

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("cutup_clustering_result", help=("Input cutup clustering result."))
    args = parser.parse_args()

    main(args)

The text was updated successfully, but these errors were encountered:

INFINITY1993 · 2022-05-01T12:29:20Z

It may give you the clue
#247

ccgallen · 2022-05-01T13:59:47Z

Thank you @INFINITY1993 for the tip!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge_cutup_clustering.py truncates contig name when three zeros follow period #311

merge_cutup_clustering.py truncates contig name when three zeros follow period #311

ccgallen commented Apr 12, 2022 •

edited

INFINITY1993 commented May 1, 2022

ccgallen commented May 1, 2022

merge_cutup_clustering.py truncates contig name when three zeros follow period #311

merge_cutup_clustering.py truncates contig name when three zeros follow period #311

Comments

ccgallen commented Apr 12, 2022 • edited

INFINITY1993 commented May 1, 2022

ccgallen commented May 1, 2022

ccgallen commented Apr 12, 2022 •

edited