add check to ignore genome(s) that cannot be up downloaded #277

bcpd · 2023-04-16T21:15:35Z

Re-opening the issue below as a new issue. I am having the same issue. Help would be greatly appreciated.

         I am still having issues with this error. I think that there are still some rules that need to incorporate a check to ignore genomes that cannot be downloaded.

These rules correctly ignore the missing genome specified in the yaml:

download_matching_genome
make_genbank_info_csv
bam_to_depth_wc
minimap_wc
samtools_mpileup_wc
samtools_count_wc
bam_to_fastq_wc

The first rule that is creating an error is extract_leftover_reads_wc. I checked its code and it seems that it uses as input the gather_csv file but it does not check for the flagged genomes in the python script substract_gather.py

   input:
        csv = f'{outdir}/gather/{{sample}}.gather.csv.gz',
        mapped = Checkpoint_GatherResults(f"{outdir}/mapping/{{sample}}.x.{{ident}}.mapped.fq.gz"),

These other rules also used that csv as input
make_gather_notebook_wc - > Uses papermill and report-gather.ipynb
make_mapping_notebook_wc -> Uses papermill and report-mapping.ipynb
.

A possible solution would be to pass as an argument the list of flagged genomes (IGNORE_IDENTS) to the python script when it is loading the list of genomes from the csv

Line 29:

    with gzip.open(args.gather_csv, "rt") as fp:
        r = csv.DictReader(fp)
        for row in r:
            rows.append(row)
    print(f"...loaded {len(rows)} results total.")

    print("checking input/output pairs:")
    pairs = []
    fail = False
    for row in rows:
        acc = row["name"].split()[0]
>>>if acc in IGNORE_IDENTS:
>>>   continue
>>>   print("Ignoring {acc} ")
>>>else:
            filename = f"{outdir}/mapping/{sample_id}.x.{acc}.mapped.fq.gz"
            overlapping = f"{outdir}/mapping/{sample_id}.x.{acc}.overlap.fq.gz"
            leftover = f"{outdir}/mapping/{sample_id}.x.{acc}.leftover.fq.gz"
            if not os.path.exists(filename):
                print(f"ERROR: input filename {filename} does not exist. Will exit.")
                 fail = True
            pairs.append((acc, filename, overlapping, leftover))

I don't know enough about python notebooks to suggest a solution there.

Originally posted by @carden24 in #241 (comment)

The text was updated successfully, but these errors were encountered:

ctb · 2023-04-20T14:26:54Z

see also same problem in two other pipelines 😭 -

dib-lab/charcoal#235
dib-lab/2022-dominating-set-differential-abundance-example#8 (comment)

carden24 · 2023-04-21T16:50:34Z

Apparently the error is solved if you delete the row with the problematic genome from the {output_folder}/gather/{sample}.gather.csv.gz. Just need to gunzip it, remove the row, and gzip it again.

ctb · 2023-04-22T13:47:17Z

yep. however, that then has the effect of ignoring any other genome (or genomes) that would have been chosen in the absence of the problematic genome. e.g. if there's a specific E. coli genome that is no longer available, by removing it from the gather output, you are probably eliminating all the E. coli genomes.

the fix I have in mind (but need to find time to implement robustly, and test) would exclude the specific problematic genome from the search, while allowing other related genomes that are NOT problematic to be included.

I believe you could mimic that here by removing the problematic genome from the prefetch file, rather than the gather file.

carden24 · 2023-05-03T21:10:44Z

I have come with a solution to ignore the problematic genomes.

I run grist asking for an intermediate target, instead of the full pipeline. This creates the signatures and a list of prefetched genomes.

genome-grist run {grist_config.yaml} gather_reads

I run a script that parses through the prefetch files and looks at the accession ids shown there and check if each genomes is available at the ftp. If the genome is not available, it remove its corresponding row from the prefetch files. repair_grist_gather_files.zip

Usage:
python repair_grist_gather_files --grist_output_folder <grist_folder>

I run grist again with the full target

genome-grist run grist_config.yaml summarize_gather summarize_mapping

An advantage is that we do not need to specify the ignore genomes parameter in the configuration, if will never run into this problem when downloading them. The process take a few minutes as it need to check each prefetch genome separately.

My assumption is that if a genome is present in the prefetch list, there is likely another closely related genome also in the list. so even if we don't get the best match, we will have a decent match. This assumption probably holds better when using the full database and not a dereplicated one.

I have used the same python libraries that grist scripts use so there should not be a major compatibility issue.

ctb mentioned this issue Apr 20, 2023

GenBank genomes not found dib-lab/charcoal#235

Open

ctb mentioned this issue Apr 20, 2023

01_perform_dda.snakefile shuts down without error message on rule download_query_genome dib-lab/2022-dominating-set-differential-abundance-example#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add check to ignore genome(s) that cannot be up downloaded #277

add check to ignore genome(s) that cannot be up downloaded #277

bcpd commented Apr 16, 2023 •

edited

ctb commented Apr 20, 2023

carden24 commented Apr 21, 2023 •

edited

ctb commented Apr 22, 2023

carden24 commented May 3, 2023

add check to ignore genome(s) that cannot be up downloaded #277

add check to ignore genome(s) that cannot be up downloaded #277

Comments

bcpd commented Apr 16, 2023 • edited

ctb commented Apr 20, 2023

carden24 commented Apr 21, 2023 • edited

ctb commented Apr 22, 2023

carden24 commented May 3, 2023

bcpd commented Apr 16, 2023 •

edited

carden24 commented Apr 21, 2023 •

edited