Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add check to ignore genome(s) that cannot be up downloaded #277

Open
bcpd opened this issue Apr 16, 2023 · 4 comments
Open

add check to ignore genome(s) that cannot be up downloaded #277

bcpd opened this issue Apr 16, 2023 · 4 comments

Comments

@bcpd
Copy link

bcpd commented Apr 16, 2023

Re-opening the issue below as a new issue. I am having the same issue. Help would be greatly appreciated.

         I am still having issues with this error. I think that there are still some rules that need to incorporate a check to ignore genomes that cannot be downloaded. 

These rules correctly ignore the missing genome specified in the yaml:

download_matching_genome
make_genbank_info_csv
bam_to_depth_wc
minimap_wc
samtools_mpileup_wc
samtools_count_wc
bam_to_fastq_wc

The first rule that is creating an error is extract_leftover_reads_wc. I checked its code and it seems that it uses as input the gather_csv file but it does not check for the flagged genomes in the python script substract_gather.py

   input:
        csv = f'{outdir}/gather/{{sample}}.gather.csv.gz',
        mapped = Checkpoint_GatherResults(f"{outdir}/mapping/{{sample}}.x.{{ident}}.mapped.fq.gz"),

These other rules also used that csv as input
make_gather_notebook_wc - > Uses papermill and report-gather.ipynb
make_mapping_notebook_wc -> Uses papermill and report-mapping.ipynb
.

A possible solution would be to pass as an argument the list of flagged genomes (IGNORE_IDENTS) to the python script when it is loading the list of genomes from the csv

Line 29:

    with gzip.open(args.gather_csv, "rt") as fp:
        r = csv.DictReader(fp)
        for row in r:
            rows.append(row)
    print(f"...loaded {len(rows)} results total.")

    print("checking input/output pairs:")
    pairs = []
    fail = False
    for row in rows:
        acc = row["name"].split()[0]
>>>if acc in IGNORE_IDENTS:
>>>   continue
>>>   print("Ignoring {acc} ")
>>>else:
            filename = f"{outdir}/mapping/{sample_id}.x.{acc}.mapped.fq.gz"
            overlapping = f"{outdir}/mapping/{sample_id}.x.{acc}.overlap.fq.gz"
            leftover = f"{outdir}/mapping/{sample_id}.x.{acc}.leftover.fq.gz"
            if not os.path.exists(filename):
                print(f"ERROR: input filename {filename} does not exist. Will exit.")
                 fail = True
            pairs.append((acc, filename, overlapping, leftover))

I don't know enough about python notebooks to suggest a solution there.

Originally posted by @carden24 in #241 (comment)

@ctb
Copy link
Member

ctb commented Apr 20, 2023

see also same problem in two other pipelines 😭 -

dib-lab/charcoal#235
dib-lab/2022-dominating-set-differential-abundance-example#8 (comment)

@carden24
Copy link

carden24 commented Apr 21, 2023

Apparently the error is solved if you delete the row with the problematic genome from the {output_folder}/gather/{sample}.gather.csv.gz. Just need to gunzip it, remove the row, and gzip it again.

@ctb
Copy link
Member

ctb commented Apr 22, 2023

yep. however, that then has the effect of ignoring any other genome (or genomes) that would have been chosen in the absence of the problematic genome. e.g. if there's a specific E. coli genome that is no longer available, by removing it from the gather output, you are probably eliminating all the E. coli genomes.

the fix I have in mind (but need to find time to implement robustly, and test) would exclude the specific problematic genome from the search, while allowing other related genomes that are NOT problematic to be included.

I believe you could mimic that here by removing the problematic genome from the prefetch file, rather than the gather file.

@carden24
Copy link

carden24 commented May 3, 2023

I have come with a solution to ignore the problematic genomes.

  1. I run grist asking for an intermediate target, instead of the full pipeline. This creates the signatures and a list of prefetched genomes.
genome-grist run {grist_config.yaml} gather_reads
  1. I run a script that parses through the prefetch files and looks at the accession ids shown there and check if each genomes is available at the ftp. If the genome is not available, it remove its corresponding row from the prefetch files. repair_grist_gather_files.zip

Usage:
python repair_grist_gather_files --grist_output_folder <grist_folder>

  1. I run grist again with the full target
genome-grist run grist_config.yaml summarize_gather summarize_mapping

An advantage is that we do not need to specify the ignore genomes parameter in the configuration, if will never run into this problem when downloading them. The process take a few minutes as it need to check each prefetch genome separately.

My assumption is that if a genome is present in the prefetch list, there is likely another closely related genome also in the list. so even if we don't get the best match, we will have a decent match. This assumption probably holds better when using the full database and not a dereplicated one.

I have used the same python libraries that grist scripts use so there should not be a major compatibility issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants