New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
samtools index: failed to create index #1
Comments
I'll take a look, can you send me a piece of (or your entire) your contigs
fasta? That part of the log looks like the bam handling, specifically the
samtools sort command failing. But if there is no index that would all fail
as well.
Logan
…On Tue, Apr 14, 2020 at 12:44 PM michaeldonaldson ***@***.***> wrote:
Hello,
I have followed the BaitSTR workflow to create 'contigs.str.fa' for one of
my datasets. I am now trying to use BaitSTR_type and it looks to be getting
hung up on creating an index. I think that it is failing to sort the bam
file and then the index fails.
Here is the command I used:
perl ../baitSTR_type/BaitSTR_type.pl --index --index_prefix contig.str
--stem run1 --mem --full --target ./contigs.str.fa --path_to_lobSTR
~/workspace/SNP_caller/tools/lobSTR-bin-Linux-x86_64-4.0.6/bin --r1
/media/user/fastq/reads.r1.100.fastq,100 --r2
/media/user/fastq/reads.r2.100.fastq,100
Any thoughts would be appreciated! Here's the last of the log information:
.....
[bam_rmdup_core] processing reference Block204...
[bam_rmdup_core] 7 / 58501 = 0.0001 in library 'lib_100'
[bam_rmdupse_core] 13 / 5566 = 0.0023 in library 'lib_100'
[W::bam_merge_core2] No @hd <https://github.com/hd> tag found.
[bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output
files
Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-n Sort by read name
-t TAG Sort by value of TAG. Uses position as secondary index (or read
name if -n is set)
-o FILE Write final output to FILE rather than standard output
-T PREFIX Write temporary files to PREFIX.nnnn.bam
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
[bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output
files
Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-n Sort by read name
-t TAG Sort by value of TAG. Uses position as secondary index (or read
name if -n is set)
-o FILE Write final output to FILE rather than standard output
-T PREFIX Write temporary files to PREFIX.nnnn.bam
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
[E::hts_idx_push] Chromosome blocks not continuous
samtools index: failed to create index for "run1.sample_100.aligned.bam"
CMD:
/home/user/workspace/SNP_caller/tools/lobSTR-bin-Linux-x86_64-4.0.6/bin/allelotype
--command classify --strinfo contig.str.lobSTRindex/strinfo.tab --out run1
--index-prefix contig.str.lobSTRindex/lobSTR_ --regions
contig.str.lobSTRindex/lobSTR_mergedref.targets.bed --realign
--filter-clipped --min-read-end-match 10 --filter-mapq0
--max-repeats-in-ends 3 --no-rmdup --noise_model run1.noisetmp --bam
run1.sample_100.aligned.bam
[allelotype-4.0.6] 2020-04-14.12:34:25 ProgressMeter: Getting run info
[allelotype-4.0.6] 2020-04-14.12:34:25 ERROR: Could not open index files
[allelotype-4.0.6] 2020-04-14.12:34:25 ProgressMeter: Outputting run
statistics
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADASQRRCHTNAW72TOED52PDRMSHIVANCNFSM4MH4MEYA>
.
|
Thank you for looking into this, I am eager to see if your scripts can help our project! I guess I should mention the following files get produced: and a folder "contig.str.lobSTRindex" with BWA/lobSTR index files here's a chunk of the contigs fasta:
.....
|
Great, thanks. What are file sizes in the contig.str.lobSTRindex directory?
I see right offhand that the fasta headers are missing ">", which is
probably a problem, although I don't have the code in front of me. I'll
check into it now.
…On Tue, Apr 14, 2020 at 12:56 PM michaeldonaldson ***@***.***> wrote:
Thank you for looking into this, I am eager to see if your scripts can
help our project!
I guess I should mention the following files get produced:
run1.sample_100.aligned.bam
run1.allelotype.stats
contig.str.probes.fasta
and a folder "contig.str.lobSTRindex" with BWA/lobSTR index files
here's a chunk of the contigs fasta:
Block2 TC:2:209:213
ACTAAAATGTAGCAACTGTGAGCCTTTTCCAGATGGAGCCACAAACAACCTCTCTAGATCTAATTCAGATGAGAGTTATTTCTCTGAAAAACGGAGAGTGTCCACTATGTACCATCCCGAAGGAGAATCCAGCACAGCCCCCTTTTTTTCTACTGATTCATCTCTGAATTTGCCTGTCCTAGAAGTAGGCAAAACTGAAAACCCTACATTCTCTTCAACTACACTTCCCAGACCTGGGGACCCTGGGGCTCCTCCTTTGCCCCCGGACTTGCAGCTAGACGAAGAAACTTGTGGA
Block3 GA:2:55:59
GGACAACTACTTGGCCTTCTTCAACTGGAGCAGCCTGACCCTCCTGCCCCGGCTGGAGAGCCTGGACCTGGCGGGGAACCAGCTGAAGGCCCTGACCAACGGCAGCCTCCCCGCGGACAGCAGGCTCCAGAGGCTG
Block4 GA:5:31:41
TATTGACTTCAGAGCAGAAGGGAGAGGGAGAGAGAGAGAGAAACATCAGTGCTGAGAGAGAATCACGGATCAGCTGCCTCCTGCACACCCTTTACTGGGGATGTGCCCGCAACCAAGG
Block5 TA:2:146:150
CCCGTGAGCTGCCGGTCTCCTCCACCTTCTGCTTGCAAATAGGCAAGTTCAGGATGCACCAAAAGTCCGGGATTATAGCCCCTAATGCATGCATCTGTCATTCGTGCTTCCAGTGTTTCAAATACTTTTTTCTTTGTCACATGAAGTATACCTAGGTTTGCAAAGCTGGATAAATCAAAAACAACCAAAGGTTAGGCAGTCAATGACTGGAATATATGGTTTCACTTGAGCACAGAGAATTAAAACACACACACACAGGTCCTCAATCACTGGGGCCCACACCATAGTTAAACACATTAGTATTTCTGCAGAAATA
.....
Block15748 AT:3:108:114
GGGCAAAAGTAGGTTTACAGTTGTAATACAAATAATACAATAATACATAATAATATAATAATAAGATGTGCTGCACATGCTCACAACTGTAAACCTACATTTACTCACATATATTTAAAGAAGTACAGAAATAATACACCAAATGATATTGATTATTAGTATTCCTAGGTTTTTTCATGTTTTTCTTTTTTAATTGCCTGAATCACTTATAATAAACACATTGTTTTTATAATTAAGATCGGAAGAGCACACGTCTGAACTCAGTCACAGCGATAGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAATAACAAAAGAAAATAAAAAATAACAGATAAAAAAAAACACAAC
Block15749 GC:2:65:69
GCCGGCTGTGTGAGGGGCAGTGGGCACACATGCAACTCGCTCACTCTCTCCCTGGAAAATCCAGGGCGCACCTCGGCTCCTGCAACATGCAGAATTAGAAATCGAACCGTTTATTGTCTCCTAACTTTTCTCT
Block15750 CCG:2:86:92
CAGCACCGCCGCTGACAAGTGGGGCTTCGGTGCCACCCTCCTGGAGATCTGCTTCGATGGGGAGGCCCCCCTGCAGGACCGCAGCCCCGCCGAGGTACATGTGGGTGACCCGTGGGCCTTCTCACAAAGGGGCCAGCACCTCCGAGGGGTCGAGCGGTCTGGGTCCGGAGCTGCCCCCTCTGGCCTGTGGCCTCACATATGTCGCCTCA
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADASQRQJQLX2SWEHOGRVTYLRMSIULANCNFSM4MH4MEYA>
.
|
BWA_ref.fasta 1.0 MB |
Thanks, and your bam files have significant numbers of reads mapped? Are
you running this pipeline on shotgun data with medium coverage?
Yes I'm not totally sure how you got index files formed if your fastas are
missing the ">" character, were they not included on the output from
BaitSTR. I would add those back and start from the beginning: awk '{if ($0
~ /Block/) {$0 = ">" $0} print $0}' oldfile.fa > newfile.fa
But also, there is an issue with samtools sort, the syntax in the script
appears to be deprecated. I've fixed it in the attached, so that should
solve that problem. Let me know what happens running it with this version.
Also something you should know about using BaitSTR that we have learned
through experience since writing the paper. We suggested PCR free libraries
could help with STR stutter, but it turns out you SHOULD NOT do a capture
of any kind on PCR-free libraries, it really hurts the capture efficiency.
Better to use amplified libraries and correct the stutter with our
follow-up approach: https://www.ncbi.nlm.nih.gov/pubmed/29931218
Logan
…On Tue, Apr 14, 2020 at 1:27 PM michaeldonaldson ***@***.***> wrote:
BWA_ref.fasta 1.0 MB
BWA_ref.fasta.amb 12.7 kB
BWA_ref.fasta.ann 16.3 kB
BWA_ref.fasta.bwt 1.0 MB
BWA_ref.fasta.pac 258.3 kB
BWA_ref.fasta.sa 516.6 kB
lobSTR_chromsizes.tab 7.0 kB
lobSTR_mergedref.bed 16/2 kB
lobSTR_mergedref.targets.bed 10.0 kB
lobSTR_ref.fasta 1.0 mB
lobSTR_ref_map.tab 14.3 kB
strinfo.tab 28.8 kB
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADASQRWCGBTX7A37USKPCW3RMSMJVANCNFSM4MH4MEYA>
.
|
Hi Logan, I just noticed the fasta file does have those headers. Something went wrong when I pasted them into this forum, sorry. Yes, there are significant numbers of reads mapped. I'm actually running this pipeline on capture-data. We had previously designed probes to target microsatellite regions (and other specific genes of interest) and I have been having trouble creating genotypes from capture data (amplicon sequencing is rather straight-forward when you have known primers and filler-regions). However, in this case we used probes from a species with a genome to target regions in a species without a target genome. So I thought I'd give your pipeline a go. Our capture libraries followed the Roche EZ-developer protocols so the libraries were sheared, indexed, pooled, captured, and amplified, if I recall correctly. I don't see the attached script? Thank you! |
Try this:
https://www.dropbox.com/s/zl6er7f68r26uer/BaitSTR_type.14042020.pl?dl=0
Ah makes sense! In that case your coverage might be quite high? Did you
adjust the kmer coverage parameters in BaitSTR accordingly? Also if you
actually have all your starting sequences and don't need to assemble them,
you can also bypass BaitSTR and align directly to those, then go into
lobSTR allelotype directly, just a thought.
Logan
…On Tue, Apr 14, 2020 at 1:52 PM michaeldonaldson ***@***.***> wrote:
Hi Logan,
I just noticed the fasta file does have those headers. Something went
wrong when I pasted them into this forum, sorry.
Yes, there are significant numbers of reads mapped. I'm actually running
this pipeline on capture-data. We had previously designed probes to target
microsatellite regions (and other specific genes of interest) and I have
been having trouble creating genotypes from capture data (amplicon
sequencing is rather straight-forward when you have known primers and
filler-regions). However, in this case we used probes from a species with a
genome to target regions in a species without a target genome. So I thought
I'd give your pipeline a go. Our capture libraries followed the Roche
EZ-developer protocols so the libraries were sheared, indexed, pooled,
captured, and amplified, if I recall correctly.
I don't see the attached script?
Thank you!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADASQRTVAA4JZ62DGYLRLUTRMSPGRANCNFSM4MH4MEYA>
.
|
Hi Logan, Thanks for the advice, I am looking into those possibilities as well. Sorry to be a bother, but here's the tail end of the output that resulted in another error using the script you provided. Any advice would be appreciated! [main] CMD: bwa mem -aM -R @rg\tID:lobSTR;sample_S100;lib_S100\tLB:lib_S100\tSM:sample_S100 contig.str.lobSTRindex/BWA_ref.fasta /media/user/fastq/100_R1.fq /media/user/fastq/100_R2.fq CMD: /home/user/workspace/SNP_caller/tools/lobSTR-bin-Linux-x86_64-4.0.6/bin/allelotype --command classify --strinfo contig.str.lobSTRindex/strinfo.tab --out run1 --index-prefix contig.str.lobSTRindex/lobSTR_ --regions contig.str.lobSTRindex/lobSTR_mergedref.targets.bed --realign --filter-clipped --min-read-end-match 10 --filter-mapq0 --max-repeats-in-ends 3 --no-rmdup --noise_model run1.noisetmp --bam run1.sample_S100.aligned.bam [allelotype-4.0.6] 2020-04-15.16:57:31 ProgressMeter: Getting run info |
Hello,
I have followed the BaitSTR workflow to create 'contigs.str.fa' for one of my datasets. I am now trying to use BaitSTR_type and it looks to be getting hung up on creating an index. I think that it is failing to sort the bam file and then the index fails.
Here is the command I used:
perl ../baitSTR_type/BaitSTR_type.pl --index --index_prefix contig.str --stem run1 --mem --full --target ./contigs.str.fa --path_to_lobSTR ~/workspace/SNP_caller/tools/lobSTR-bin-Linux-x86_64-4.0.6/bin --r1 /media/user/fastq/reads.r1.100.fastq,100 --r2 /media/user/fastq/reads.r2.100.fastq,100
Any thoughts would be appreciated! Here's the last of the log information:
.....
[bam_rmdup_core] processing reference Block204...
[bam_rmdup_core] 7 / 58501 = 0.0001 in library 'lib_100'
[bam_rmdupse_core] 13 / 5566 = 0.0023 in library 'lib_100'
[W::bam_merge_core2] No @hd tag found.
[bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files
Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-n Sort by read name
-t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
-o FILE Write final output to FILE rather than standard output
-T PREFIX Write temporary files to PREFIX.nnnn.bam
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
[bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files
Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-n Sort by read name
-t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
-o FILE Write final output to FILE rather than standard output
-T PREFIX Write temporary files to PREFIX.nnnn.bam
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
[E::hts_idx_push] Chromosome blocks not continuous
samtools index: failed to create index for "run1.sample_100.aligned.bam"
CMD: /home/user/workspace/SNP_caller/tools/lobSTR-bin-Linux-x86_64-4.0.6/bin/allelotype --command classify --strinfo contig.str.lobSTRindex/strinfo.tab --out run1 --index-prefix contig.str.lobSTRindex/lobSTR_ --regions contig.str.lobSTRindex/lobSTR_mergedref.targets.bed --realign --filter-clipped --min-read-end-match 10 --filter-mapq0 --max-repeats-in-ends 3 --no-rmdup --noise_model run1.noisetmp --bam run1.sample_100.aligned.bam
[allelotype-4.0.6] 2020-04-14.12:34:25 ProgressMeter: Getting run info
[allelotype-4.0.6] 2020-04-14.12:34:25 ERROR: Could not open index files
[allelotype-4.0.6] 2020-04-14.12:34:25 ProgressMeter: Outputting run statistics
The text was updated successfully, but these errors were encountered: