Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running with tutorial dataset giving all empty file #119

Open
wasimbt opened this issue Jan 21, 2022 · 6 comments
Open

Running with tutorial dataset giving all empty file #119

wasimbt opened this issue Jan 21, 2022 · 6 comments

Comments

@wasimbt
Copy link

wasimbt commented Jan 21, 2022

Hi,

I have tried to run the tutorial dataset (SARS-CoV-2) with the command mentioned on the webpage,
https://cbg-ethz.github.io/V-pipe/tutorial/sars-cov2/

However, after running successfully the whole script, the generated output files are all empty. Could anyone help me find out where is the problem?

Is it happeining in prinseq? Below is the output of prinseq;


The length cutoff is: 200
[prinseq-lite-0.20.4] [01/21/2022 23:24:18] Executing PRINSEQ with command: "perl prinseq-lite.pl -fastq samples/SRR10903401/20200102/extracted_data/R1.fastq -fastq2 samples/SRR10903401/20200102/extracted_data/R2.fastq -out_format 3 -trim_qual_right 30 -trim_qual_left 30 -trim_qual_type min -trim_qual_rule lt -trim_qual_window 10 -trim_qual_step 1 -log samples/SRR10903401/20200102/preprocessed_data/prinseq.out.log -min_len 200 -min_qual_mean 30 -ns_max_n 4 -out_good samples/SRR10903401/20200102/preprocessed_data/R -out_bad null"
[prinseq-lite-0.20.4] [01/21/2022 23:24:18] Parsing and processing input data: "samples/SRR10903401/20200102/extracted_data/R1.fastq" and "samples/SRR10903401/20200102/extracted_data/R2.fastq"
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Done parsing and processing input data
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input sequences (file 1): 476,632
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input bases (file 1): 71,758,102
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 1): 150.55
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input sequences (file 2): 476,632
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input bases (file 2): 71,807,572
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 2): 150.66
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Good sequences (pairs): 0
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Good sequences (singletons file 1): 0 (0.00%)
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Good sequences (singletons file 2): 0 (0.00%)
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad sequences (file 1): 476,632 (100.00%)
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad bases (file 1): 71,758,102
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad mean length (file 1): 150.55
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Bad sequences (file 2): 0 (0.00%)
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Sequences filtered by specified parameters:
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] trim_qual_left: 180134
[prinseq-lite-0.20.4] [01/21/2022 23:25:05] min_len: 773130

Please let me know,

Best regards,
Wasim

@DrYak
Copy link
Member

DrYak commented Jan 21, 2022

Could anyone help me find out where is the problem?

You skipped this part of the the tutorial:

As it is your first run of V-pipe, this will also generate the sample collection table. Check samples.tsv in your editor.

Note that the demo files you downloaded have reads of length 150 only. V-pipe’s default parameters are optimized for reads of > length 250 ; add the third column in the tab-separated file:

SRR10903401	20200102	150
SRR10903402	20200102	150

For more information, in the latest master branch of V-pipe, you can look in file V-pipe/config/config.html, see section "input", parameter "read_length".

Is it happeining in prinseq? Below is the output of prinseq;

Yes.
The relevant bits are:

The length cutoff is: 200

[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 1): 150.55

[prinseq-lite-0.20.4] [01/21/2022 23:25:05] Input mean length (file 2): 150.66

Because there's no length specified (e.g.: in the third column as suggested in the tutorial), the cut-off value is computed from the default length (based on 250bp) and set to ≥200bp.

The read's length (150bp) is shorter than this cut-off, every single read is considered "too short" and filtered out. There are no remaining reads after quality filtering.

Future version of V-pipe will be able to illustrate the stage at which reads were rejected (we currently have custom scripts that haven't been incorporated into V-pipe's master yet).

@wasimbt
Copy link
Author

wasimbt commented Jan 25, 2022

Hi,

Thanks a lot for pointing it out! I somehow missed the information in the tutorial.

Now I have run the tutorial dataset with success. However, when I tried to run my sewage samples dataset (with just two samples), it keep running without an error message and I had to stop it after a few hours. It seems like there is a problem during alignment. Do you think that the pipeline will work for read length of 101 bp ? and where could be the problem according to you?

Best regards,
Wasim

@DrYak
Copy link
Member

DrYak commented Jan 25, 2022

regarding alignments

Check if the previous step (running prinseq in rule preprocessing) has successfully completed and you successfully got the /preprocessed_data/R1.fastq.gz and R2.fastq.gz files.

Which aligner are you using?

Also if you have extremely deep coverage as happens e.g. with Illumina NovaSeq, due to the sheer amount of reads the alignment can take some time, and the SNV calling with ShoRAH can be extremely slow (we are aware of these shortcomings, but it will take quite some time to fix. In the meantime you can fall back to LoFreq as an SNV caller).

regarding the read-length

Also, although we have run V-pipe successfully even on 50bp or 75bp read-length (some of the labs that sequence clinical swab samples do shotgun sequencing), you do need to keep in mind with sewage sequencing is that read-length matters: source: doi:10.1101/2021.05.26.21257861

Other groups have indeed shown that the optimal amplicon length for wastewater seem to be around ~400bp as done by, e.g., the Artic protocols (our monitoring used V3 in the past and has now switched to V4 which has better performance on Omicron and plans to switch to V4.1 as soon as it is available in commercial ready-made kits).

With the much shorter amplicons (e.g. like Nimagen), you might have fewer opportunities for cooccurrences of mutations. (and with much longer amplicons than Artic (e.g. long reads) there might be fewer fragments as large that manage to survive through the wastewater all the way to the sequencing).

This is important to keep in mind if you would want to later run our tool cojac on the output to look for mutation cooccurrences. (note: as of the time being, check its dev branch as we haven't released version 0.2 yet).

Looking for individual mutations:

  • simply counting per position-base counts (either using samtools, or the output of V-pipe in alignment/basecnt.tsv.gz)
  • or analysing the output of a full-fledged SNV caller (the VCF produces by ShoRAH, lofreq, etc.)
    is less affected by smaller fragment (though internally ShoRAH also uses cooccurrences to rise the confidence in its per-window local haplotypes)

For all our wastewater sequencing, we have used Artic protocols that produce ~400bp amplicon and sequenced those with paired-end 250bp sequencing in order to cover well each whole amplicon.

@DrYak
Copy link
Member

DrYak commented Jan 26, 2022

For more information, in the latest master branch of V-pipe, you can look in file V-pipe/config/config.html, see section "input", parameter "read_length".

Because there's no length specified (e.g.: in the third column as suggested in the tutorial), the cut-off value is computed from the default length (based on 250bp) and set to ≥200bp.

The read's length (150bp) is shorter than this cut-off, every single read is considered "too short" and filtered out. There are no remaining reads after quality filtering.

Hi, we have updated the config's readme section about the samples TSV to make this more clear.

@wasimbt
Copy link
Author

wasimbt commented Jan 27, 2022

Hi,

Many thanks for your answers and suggestions!
I am using bwa following the SARS-CoV-2 tutorial. Now I could run the whole snakemake script on a sewage sample. The prinseq runs fine (please see the output file attached) however, there is an error during alignment (please see the attached shorah output file). Coverage.TSV file showed more zeros than other digits at positions.
Any suggestion/s on how I can improve the coverage/alignment?

Best regards,
Wasim

prinseq.out.log
shorah.out.log
stdout.log

@DrYak
Copy link
Member

DrYak commented Jan 27, 2022

First, making sure that the alignment step (bwa) went well.

In the output results/{sample}/{date}/alignments (or if you have set your output directory to emulate old V-pipe 1.x/2.x/sars-cov2 behaviour samples/{sample}/{date}/alignments):

  • check the content of REF_aln_stats.yaml: it contains useful statistics collected after alignement:
    • the mean and median coverage
    • the number of reads aligned
    • the size of the region covered by read pairs (insert_mean, corresponds to amplicons or to shotgun fragment depending on the wetlab library preparation method)
  • check the content of coverage.tsv.gz
    (warning, this is 0-based like all python software and the old central coverage.tsv file of V-pipe, it is not 1-based like all genomic standards and the output of samtools)
    • this will give an idea of the per-position coverage in your sample.
    • ideally there should be coverage (>100 reads) at most positions even if it is uneven.

If you only have one region that is highly covered (i.e.: only a few fragment of the virus reached all the way to sequencing) there might by some wet-lab problems (in our experience, the RNA extraction and concentration is a very difficult task. Your lab might want to discuss with our coauthors from Eawag (Swiss Federal Institute of Aquatic Science and Technology) on our paper. Some labs we work with have reported success using the RNA extraction kits produced by Promega).


As soon as you have some good aligner reads, you might want to give a try to our tool cojac.

Use the dev branch as we haven't released a package 0.2 on bioconda yet.


Now regarding SNV calling:
what is the exact wet-lab protocol used?

  • shotgun? (there are a lot of fragment with a random distribution of their position along the genome)
  • tiled-amplicon? (all the reads begin at fixed position that corresponds to the amplicon from the multiplex-PCR).

If it's the later case: there is a corner case in ShoRAH that handles poorly when the local-windows for the local-haplotypes aren't aligned to the boundaries of amplicons, leading to a lot of the reads being lost.
The shorter the amplicon, the more this problem is exacerbated.
We are aware of that limitation and there's undergoing work by a master student to address it.
In the meantime I would suggest using lofreq instead (in your configuration file, section general -> property snv_caller) and disabling local haplotypes (in your configuration file, section output -> property local). You can find out more about these options in the file config/config.html in your local installation of V-pipe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants