Skip to content

Commit 0873233

Browse files
janetxinlilcoombe
andauthored
Prepare for Release Version 1.2.2 (#43)
* Allow genome size to be given in scientific notation * Update genome size parameter info * Bump up version numbers * Don't require genome_size argument * Calculate dist using only read lengths > min_size * Use 2 * cut length as lower limit for dist * Add time commands to long-to-linked and minimap2 steps for tigmint-long * use 1000 as lower bound for dist * create fastq file for cut reads * simplify get_genome_size * create argument for dist read length percentile * v1.2.2 * update pipeline flow chart * change string formatting * change expected long-to-linked outputs to fastq * new expected output for long-to-linked * Use p50 for dist calculation * Update tests to use lower-bounded read length p50 as auto dist * Don't upgrade pip installations * Use python3.8 and add upgrade command back * Fix python paths * Update docstring Co-authored-by: Lauren Coombe <lauren.e.coombe@gmail.com>
1 parent ab2a39b commit 0873233

14 files changed

+136
-119
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -118,10 +118,10 @@ tigmint-make metrics draft=myassembly reads=myreads ref=GRCh38 G=3088269832
118118
```
119119
***
120120

121-
To run Tigmint with long reads in fasta or fastq format (`myreads.fa.gz` or `myreads.fq.gz`) on the draft assembly `myassembly.fa`:
121+
To run Tigmint with long reads in fasta or fastq format (`myreads.fa.gz` or `myreads.fq.gz`) on the draft assembly `myassembly.fa` for an organism with a genome size of gsize:
122122

123123
```sh
124-
tigmint-make tigmint-long draft=myassembly reads=myreads span=auto G=genomesize dist=auto
124+
tigmint-make tigmint-long draft=myassembly reads=myreads span=auto G=gsize dist=auto
125125
```
126126

127127
- `minimap2 map-ont` is used to align long reads from the Oxford Nanopore Technologies (ONT) platform, which is the default input for Tigmint. To use PacBio long reads specify the parameter `longmap=pb`
@@ -145,7 +145,7 @@ tigmint-make tigmint-long draft=myassembly reads=myreads span=auto G=genomesize
145145

146146
+ `draft`: Name of the draft assembly, `myassembly.fa`
147147
+ `reads`: Name of the reads, `myreads.fq.gz`
148-
+ `G`: Haploid genome size of the draft assembly organism. Used to calculate `span` parameter automatically
148+
+ `G`: Haploid genome size of the draft assembly organism. Used to calculate `span` parameter automatically. Can be given as an integer or in scientific notation (e.g. '3e9' for human)
149149
+ `span=20`: Number of spanning molecules threshold. Set `span=auto` to automatically select span parameter (currently only recommended for `tigmint-long`)
150150
+ `cut=500`: Cut length for long reads (`tigmint-long` only)
151151
+ `longmap=ont`: Long read platform; `ont` for Oxford Nanopore Technologies (ONT) long reads, `pb` for PacBio long reads (`tigmint-long` only)

azure-pipelines.yml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ jobs:
1010
versionSpec: "3.8"
1111
architecture: x64
1212
- script: |
13-
sudo HOMEBREW_NO_AUTO_UPDATE=1 /home/linuxbrew/.linuxbrew/bin/brew install python
14-
sudo /home/linuxbrew/.linuxbrew/bin/pip3 install --upgrade setuptools \
13+
sudo HOMEBREW_NO_AUTO_UPDATE=1 /home/linuxbrew/.linuxbrew/bin/brew install python@3.8
14+
sudo /home/linuxbrew/.linuxbrew/opt/python@3.8/bin/pip3 install --upgrade setuptools \
1515
-U pip --no-cache-dir \
1616
pylint .
1717
displayName: Install Python packages
@@ -23,14 +23,14 @@ jobs:
2323
cd ../
2424
displayName: Run pylint
2525
- script: |
26-
export PATH=/home/linuxbrew/.linuxbrew/bin:$PATH
27-
echo path:
28-
echo $PATH
29-
sudo /home/linuxbrew/.linuxbrew/bin/pip3 install pytest
26+
sudo /home/linuxbrew/.linuxbrew/opt/python@3.8/bin/pip3 install pytest
3027
sudo HOMEBREW_NO_AUTOUPDATE=1 /home/linuxbrew/.linuxbrew/bin/brew tap brewsci/bio
3128
sudo HOMEBREW_NO_AUTOUPDATE=1 /home/linuxbrew/.linuxbrew/bin/brew install bedtools bwa samtools gnu-time minimap2
32-
cd tests/
29+
export PATH=/home/linuxbrew/.linuxbrew/bin:$PATH
30+
export PATH=/home/linuxbrew/.linuxbrew/opt/python@3.8/bin:$PATH
31+
echo $PATH
32+
cd tests
3333
pytest -vs tigmint_test.py
3434
env:
3535
HOMEBREW_NO_AUTO_UPDATE: 1
36-
displayName: Run pytests
36+
displayName: Run pytests

bin/long-to-linked

Lines changed: 38 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
#!/usr/bin/env python3
22
"""
3-
Cut long reads and assign barcodes to the subreads. Optionally calculate span parameter for tigmint-cut.
3+
Cuts long reads and assigns barcodes to the subreads, and optionally calculates tigmint-long span and dist parameters.
44
Usage: gunzip -c reads.fa.gz (or reads.fq.gz) | python3 long-to-linked -l length -m min_size -s -g genome_size | gzip > reads.cutlength.fa.gz
55
@author: Janet Li
66
"""
@@ -22,44 +22,58 @@ def cut_reads(length, min_size, auto_span, genome_size, param_outfile, auto_dist
2222
"""
2323
# Span automatically calculated as 1/4 of sequence coverage
2424
cov_to_span = 0.25
25-
# Dist automatically calculated as p5 of read length
26-
dist_read_perc = 5
25+
# Lower bound for dist
26+
dist_lower_bound = 1000
27+
# Dist parameter read percentile
28+
dist_perc = 50
29+
2730
total_bases = 0
2831
if auto_dist:
2932
read_lengths = []
3033
with open("/dev/stdin", 'rt') as reads:
3134
barcode = 0
3235
if not reads.isatty(): # Check that reads are being piped in from stdin
33-
for header, seq, _, _ in read_fasta(reads):
36+
for header, seq, _, qual in read_fasta(reads):
3437
read_length = len(seq)
3538
total_bases += read_length
36-
if auto_dist:
39+
if auto_dist and read_length >= dist_lower_bound:
3740
read_lengths.append(read_length)
3841
if read_length >= min_size:
3942
split_reads = split_long_read(seq, length)
43+
if qual:
44+
split_quals = split_long_read(qual, length)
45+
else:
46+
split_quals = split_long_read(read_length * "#", length)
4047
for i, read in enumerate(split_reads):
41-
print(">" + header + "_" + str(i) + " BX:Z:" + str(barcode), file=sys.stdout)
42-
print(read, file=sys.stdout)
48+
outheader = "@{0}_{1} BX:Z:{2}".format(header, i, barcode)
49+
print(outheader, read, "+", split_quals[i], sep="\n", file=sys.stdout)
4350
barcode += 1
4451
if auto_span or auto_dist:
4552
with open(param_outfile, "w") as param_file:
4653
if auto_span:
4754
span = int(total_bases / genome_size * cov_to_span)
48-
print("span\t%s" % str(span), file=param_file)
55+
print("span\t{}".format(span), file=param_file)
4956
if auto_dist:
50-
dist = int(np.percentile(read_lengths, dist_read_perc) // 1)
51-
print("read_p%i\t%i" % (dist_read_perc, dist), file=param_file)
52-
57+
dist = int(np.percentile(read_lengths, dist_perc))
58+
print("read_p%i\t%i" % (dist_perc, dist), file=param_file)
5359
else:
5460
print("long-to-linked: error: long reads must be piped from stdin", file=sys.stderr)
5561
sys.exit(1)
5662

63+
def get_genome_size(size_string):
64+
"""Read genome size from a string."""
65+
try:
66+
return float(size_string)
67+
except ValueError:
68+
print("long-to-linked: error: improper format for genome size", file=sys.stderr)
69+
sys.exit(1)
70+
5771
def main():
5872
"""Parse arguments and cut long reads."""
5973
parser = argparse.ArgumentParser(description="Segment long reads into short reads and calculate tigmint-long parameters.")
6074
parser.add_argument("-v", "--version",
6175
action="version",
62-
version="long-to-linked 1.2.1")
76+
version="long-to-linked 1.2.2")
6377
parser.add_argument("-l", "--length",
6478
type=int,
6579
default=500,
@@ -75,12 +89,12 @@ def main():
7589
parser.add_argument("-d", "--auto_dist",
7690
default=False,
7791
action="store_true",
78-
help="Option to calculate read length p5 for dist parameter")
92+
help="Option to calculate read length p5 for dist parameter")
7993
parser.add_argument("-g", "--genome_size",
80-
type=int,
81-
default=-1,
82-
help="Genome size (bp) for calculating sequence coverage and \
83-
minimum spanning molecules parameter automatically")
94+
type=str,
95+
default=None,
96+
help="Genome size for calculating sequence coverage and \
97+
minimum spanning molecules parameter (e.g. 3e9)")
8498
parser.add_argument("-o", "--param_outfile",
8599
type=str,
86100
default="tigmint-long.params.tsv",
@@ -89,11 +103,14 @@ def main():
89103
args = parser.parse_args()
90104

91105
if args.auto_span:
92-
if args.genome_size == -1:
93-
raise ValueError("Genome size (bp) must be provided to calculate span parameter.")
106+
if args.genome_size is None:
107+
print("long-to-linked: error: genome size must be given to calculate span",
108+
file=sys.stderr)
109+
sys.exit(1)
110+
args.genome_size = get_genome_size(args.genome_size)
94111

95-
cut_reads(args.length, args.min_size, args.auto_span,
96-
args.genome_size, args.param_outfile, args.auto_dist)
112+
cut_reads(args.length, args.min_size, args.auto_span, args.genome_size,
113+
args.param_outfile, args.auto_dist)
97114

98115
if __name__ == "__main__":
99116
main()

bin/tigmint-cut

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -241,7 +241,7 @@ def get_span(filename):
241241

242242
def main():
243243
parser = argparse.ArgumentParser(description="Find misassembled regions in assembly using Chromium molecule extents")
244-
parser.add_argument("--version", action="version", version="tigmint-cut 1.2.1")
244+
parser.add_argument("--version", action="version", version="tigmint-cut 1.2.2")
245245
parser.add_argument("fasta", type=str, help="Reference genome fasta file (must have FAI index generated)")
246246
parser.add_argument("bed", type=str, help="Sorted bed file of molecule extents")
247247
parser.add_argument("-o", "--fastaout", type=str, help="The output FASTA file.", required=True)

bin/tigmint-make

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,7 @@ help:
123123
@echo 'For more information see https://bcgsc.github.io/tigmint/'
124124

125125
version:
126-
@echo "Tigmint 1.2.1"
126+
@echo "Tigmint 1.2.2"
127127
@echo "Written by Shaun Jackman @sjackman."
128128

129129
all: tigmint arcs
@@ -194,21 +194,21 @@ $(draft).%.sortbx.bam: %.fq.gz $(draft).fa.bwt
194194

195195
# Align cut long reads to the draft genome and output primary alignments sorted by BX tag.
196196
$(draft).%.cut$(cut).sortbx.bam: %.cut$(cut).fa.gz $(draft).fa
197-
minimap2 -y -t$t -ax map-$(longmap) --secondary=no $(draft).fa $< | samtools view -b -u -F4 | samtools sort -@$t -tBX -T$$(mktemp -u -t $@.XXXXXX) -o $@
197+
$(gtime) minimap2 -y -t$t -ax map-$(longmap) --secondary=no $(draft).fa $< | samtools view -b -u -F4 | samtools sort -@$t -tBX -T$$(mktemp -u -t $@.XXXXXX) -o $@
198198

199199
# Segment long reads from gzipped fasta file, optionally calculating tigmint-long parameters.
200-
$(reads).cut$(cut).fa.gz: $(longreads)
200+
$(reads).cut$(cut).fq.gz: $(longreads)
201201
ifeq ($(span), auto)
202202
ifeq ($(dist), auto)
203-
$(gzip) -dc $< | $(bin)/long-to-linked -l$(cut) -m$(minsize) -g$(G) -s -d -o $(reads).tigmint-long.params.tsv | $(gzip) > $@
203+
$(gtime) $(gzip) -dc $< | $(bin)/long-to-linked -l$(cut) -m$(minsize) -g$(G) -s -d -o $(reads).tigmint-long.params.tsv | $(gzip) > $@
204204
else
205-
$(gzip) -dc $< | $(bin)/long-to-linked -l$(cut) -m$(minsize) -g$(G) -s -o $(reads).tigmint-long.params.tsv | $(gzip) > $@
205+
$(gtime) $(gzip) -dc $< | $(bin)/long-to-linked -l$(cut) -m$(minsize) -g$(G) -s -o $(reads).tigmint-long.params.tsv | $(gzip) > $@
206206
endif
207207
else
208208
ifeq ($(dist), auto)
209-
$(gzip) -dc $< | $(bin)/long-to-linked -l$(cut) -m$(minsize) -d -o $(reads).tigmint-long.params.tsv | $(gzip) > $@
209+
$(gtime) $(gzip) -dc $< | $(bin)/long-to-linked -l$(cut) -m$(minsize) -d -o $(reads).tigmint-long.params.tsv | $(gzip) > $@
210210
else
211-
$(gzip) -dc $< | $(bin)/long-to-linked -l$(cut) -m$(minsize) | $(gzip) > $@
211+
$(gtime) $(gzip) -dc $< | $(bin)/long-to-linked -l$(cut) -m$(minsize) | $(gzip) > $@
212212
endif
213213
endif
214214

@@ -223,7 +223,7 @@ endif
223223
$(gtime) $(bin)/tigmint-molecule -a$(as) -n$(nm) -q$(mapq) -d$(dist) -s$(minsize) $< | sort -k1,1 -k2,2n -k3,3n >$@
224224

225225
# Create molecule extents BED using cut long reads
226-
$(draft).$(reads).cut$(cut).as$(as).nm$(cut).molecule.size$(minsize).bed: $(reads).cut$(cut).fa.gz $(draft).fa
226+
$(draft).$(reads).cut$(cut).as$(as).nm$(cut).molecule.size$(minsize).bed: $(reads).cut$(cut).fq.gz $(draft).fa
227227
ifeq ($(dist), auto)
228228
$(gtime) minimap2 -y -t$t -as map-$(longmap) --secondary=no $(draft).fa $< | samtools view -b -u -F4 | samtools sort -@$t -tBX -T$$(mktemp -u -t $@.XXXXXX) | \
229229
$(bin)/tigmint-molecule -a$(as) -n$(cut) -q$(mapq) -s$(minsize) -p $(reads).tigmint-long.params.tsv - | sort -k1,1 -k2,2n -k3,3n > $@

bin/tigmint-molecule

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -243,7 +243,7 @@ class MolecIdentifier:
243243
"Read a SAM/BAM file and output a TSV file. "
244244
"The SAM/BAM file must be sorted by BX tag and then by position.")
245245
parser.add_argument(
246-
'--version', action='version', version='tigmint-molecule 1.2.1')
246+
'--version', action='version', version='tigmint-molecule 1.2.2')
247247
parser.add_argument(
248248
metavar="BAM", dest="in_bam_filename",
249249
help="Input BAM file sorted by BX tag then position, - for stdin")

pipeline.gv

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ digraph {
1616
subgraph {
1717
node [width=5]
1818

19-
cut [label="long-to-linked\nSegment long reads (FASTA)"]
19+
cut [label="long-to-linked\nSegment long reads (FASTQ)"]
2020
minimap2 [label="Minimap2\nAlign segmented reads to the draft assembly (BAM)"]
2121

2222
}

pipeline.gv.png

337 KB
Loading

0 commit comments

Comments
 (0)