Skip to content

Commit

Permalink
V0.6.3 (#19)
Browse files Browse the repository at this point in the history
* modify scripts to include log file option

* fix parsing and output

* fix issues with logging

* fixed output

* remove -c from scripts, update README

* little fixes

* upversion pip, show compatibility with python 2.7
  • Loading branch information
kescobo committed Mar 16, 2017
1 parent 079be8c commit ed76758
Show file tree
Hide file tree
Showing 14 changed files with 545 additions and 64 deletions.
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
[![DOI](https://zenodo.org/badge/22309/kescobo/kvasir.svg)](https://zenodo.org/badge/latestdoi/22309/kescobo/kvasir) [![Join the chat at https://gitter.im/kescobo/kvasir](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/kescobo/kvasir?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

Dependencies:
* Python 3+
* Python 2.7 or 3.6
* MongoDB
* pymongo
* BioPython
Expand Down Expand Up @@ -89,15 +89,15 @@ To make the next part work you need one more thing - the [BLAST+](https://www.nc
Assuming this is working (you've still got `mongod` running right?), it's time to start blasting. There are a couple of commands in the `blast.py` script. First, you've gotta make a blast database with all of the the genes in your database using `makedb`. This command will make 3 files, in the current directory by default, or in the directory you specify with `-b`

```
# python3 bin/kv_blast.py cheese1 -c makedb -b ~/blastdbs
# kv_blast.py cheese1 makedb -b ~/blastdbs
INFO:root:making BLAST database with protein coding sequences from cheese1 at /Users/ksb/blastdbs
INFO:root:BLAST db created at /Users/ksb/blastdbs
```

Next, blastall will go through each species in your database and blast it against that database. Each hit will then be stored in your MongoDB with some metadata.

```sh
$ python3 bin/kv_blast.py cheese1 -c blastall -b ~/blastdbs
$ kv_blast.py cheese1 blastall -b ~/blastdbs
INFO:root:blasting Awesomeus speciesus strain A
INFO:root:Blasting all by all
INFO:root:Getting Blast Records
Expand Down Expand Up @@ -131,7 +131,7 @@ There are a few ways to approach this, and in kvasir they're unified under the i
If the genbank files you imported have a `SOURCE` line for the name of your organism, and it's formatted as `<genus> <species> <strain>`, you can use the script to calculate the distance based on Average Nucleotide Identity (ANI). This is calculated using a ruby script from [enveomics](https://github.com/lmrodriguezr/enveomics/blob/master/Scripts/ani.rb). Measuring ANI using this method is only appropriate for closely related genomes, and as it stands, the script grabs all species in your MongoDB and calls the ANI script on pairs that have the same genus.

```sh
$ kv_distance.py cheese1 -c ani
$ kv_distance.py cheese1 ani
```

This script pulls species from the MongoDB and then looks for all pairs of names to see if they're the same genus. It does that a bit naively, using `split()[0].split("_")[0]`, which will split the species name at spaces or `_`, and then take the first thing. So if you're looking at "Awesomeus speciesus strain A", this command will check if you have anything else with "Awesomeus" as the genus, like "Awesomeus_speciesus_strain_B", and will calculate the ANI between them and add them to a "species_distance" database in the MongoDB.
Expand All @@ -154,13 +154,13 @@ If ANI is not appropriate, or you have a different way to measure species distan
Assuming this is saved in a file `~/Desktop/ssu_dm.csv`, you can do:

```sh
$ kv_distance.py cheese1 -c distance_matrix -i ~/Desktop/ssu_dm.csv -t ssu
$ kv_distance.py cheese1 distance_matrix -i ~/Desktop/ssu_dm.csv -t ssu
```

The `-t` parameter is the "distance type", which can be anything you want. The ANI script above uses `-t ani` by default. If you want to get your ssu distance matrix out at a different time, use:

```sh
$ kv_distance.py cheese1 -c distance_matrix -o ~/Desktop/returned_ssu_dm.csv -t ssu
$ kv_distance.py cheese1 distance_matrix -o ~/Desktop/returned_ssu_dm.csv -t ssu
```

Be sure your distance is actually a distance, and is between 0:1. So if two species have a 16S gene that is 85% identical, the distance should be 0.15.
Expand All @@ -171,7 +171,7 @@ In the next section, the default is to not use the species distance parameter or
Now the fun stuff!

```sh
$ kv_analysis.py cheese1 -c groups -o ~/Desktop/groups.csv
$ kv_analysis.py cheese1 groups -o ~/Desktop/groups.csv
```

This could take some time, depending on the number of species and how much HGT there is.
Expand Down
39 changes: 27 additions & 12 deletions bin/kv_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@

parser.add_argument("mongodb", help="The name of MongoDB database")

parser.add_argument("-c", "--command",
parser.add_argument("command",
help="which analysis command to run (groups, species_hits)",
choices=["groups"], required=True)
choices=["groups"])

parser.add_argument("-v", "--verbose", help="Display debug status messages", action="store_true")
parser.add_argument("-q", "--quiet", help="Suppress most output", action="store_true")
Expand All @@ -21,13 +21,13 @@
help="minimum identity for BLAST hits",
default="0.99", required=True)

parser.add_argument("-s", "spacer",
parser.add_argument("-s", "--spacer",
help="Maximum distance between genes to be considered in the same group",
default="5000")
parser.add_argument("-l", "length",
parser.add_argument("--length",
help="Minimum length (nt) for each protein coding gene",
default="500")
parser.add_argument("-d", "species-distance",
parser.add_argument("-d", "--species-distance",
help="Minimum distance between species (0-1)",
default=0)
parser.add_argument("-g", "--group-size",
Expand All @@ -38,21 +38,36 @@
default="ani")
parser.add_argument("-o", "--output",
help="File path for output (usable with distance_matrix)", default="./")
parser.add_argument("-l", "--log",
help="File path for log file")

parser.add_argument("--debug",
help="set logging to debug", action="store_true")

args = parser.parse_args()

if args.verbose:
logging.basicConfig(level=logging.DEBUG)
logpath = None
if args.log:
logpath = os.path.abspath(args.log)
if os.path.isdir(logpath):
logpath = os.path.join(logpath, "kvasir.log")

if args.debug:
logging.basicConfig(level=logging.DEBUG, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
elif args.verbose:
logging.basicConfig(level=logging.INFO, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
elif args.quiet:
logging.basicConfig(level=logging.WARNING)
logging.basicConfig(level=logging.ERROR, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
else:
logging.basicConfig(level=logging.INFO)
logging.basicConfig(level=logging.WARNING, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)



DB = pymongo.MongoClient()[args.mongodb]

if args.command == "groups":
g = hgt_groups(float(args.identity), DB, args.length, args.spacer,
args.species_distance, args.distance_type)
g = hgt_groups(float(args.identity), DB, int(args.length), int(args.spacer),
float(args.species_distance), args.distance_type)

if os.path.isdir(args.output):
out = os.path.join(args.output, "{}-{}-{}_groups.csv".format(
Expand All @@ -63,4 +78,4 @@
else:
out = os.path.abspath(args.output)

output_groups(g, out, db, args.group_size)
output_groups(g, out, DB, args.group_size)
26 changes: 19 additions & 7 deletions bin/kv_blast.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,27 +9,39 @@
from kvasir.run_blast import blast_all, parse_blast_results_xml
from tempfile import NamedTemporaryFile

logging.basicConfig(level=logging.INFO)

parser = argparse.ArgumentParser(description='Kvasir BLAST commands')

parser.add_argument("mongodb", help="The name of MongoDB database")
parser.add_argument("-b", "--blastpath", help="path to BLAST database", default="./")
parser.add_argument("-c", "--command", help="which blast command to run (makedb, blastall, blastone)",
choices=["makedb", "blastall", "blastone"], required=True)
parser.add_argument("command", help="which blast command to run (makedb, blastall, blastone)",
choices=["makedb", "blastall", "blastone"])
parser.add_argument("-f", "--force", help="Override errors (eg re-importing blast for same species)", action="store_true")

parser.add_argument("-v", "--verbose", help="Display debug status messages", action="store_true")
parser.add_argument("-q", "--quiet", help="Suppress most output", action="store_true")
parser.add_argument("--debug", help="set logging to debug", action="store_true")

parser.add_argument("-l", "--log",
help="File path for log file")

args = parser.parse_args()

if args.verbose:
logging.basicConfig(level=logging.DEBUG)
logpath = None
if args.log:
logpath = os.path.abspath(args.log)
if os.path.isdir(logpath):
logpath = os.path.join(logpath, "kvasir.log")

if args.debug:
logging.basicConfig(level=logging.DEBUG, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
elif args.verbose:
logging.basicConfig(level=logging.INFO, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
elif args.quiet:
logging.basicConfig(level=logging.WARNING)
logging.basicConfig(level=logging.ERROR, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
else:
logging.basicConfig(level=logging.INFO)
logging.basicConfig(level=logging.WARNING, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)


DB = pymongo.MongoClient()[args.mongodb]
BLASTPATH = os.path.abspath(args.blastpath)
Expand Down
28 changes: 20 additions & 8 deletions bin/kv_distance.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,11 @@
from kvasir.mongo_import import mongo_import_distance, mongo_import_distance_matrix
from kvasir.distance import get_ani, get_distance_matrix

logging.basicConfig(level=logging.DEBUG)

parser = argparse.ArgumentParser(description='Kvasir Analysis commands')

parser.add_argument("mongodb", help="The name of MongoDB database")
parser.add_argument("-c", "--command", help="which analysis command to run (ani, distance_matrix)",
choices=["ani", "distance_matrix"], required=True)
parser.add_argument("command", help="which analysis command to run (ani, distance_matrix)",
choices=["ani", "distance_matrix"])

parser.add_argument("-o", "--output", help="File path for output (usable with distance_matrix)", default="./")
parser.add_argument("-i", "--input", help="File path for input (usable with distance_matrix)")
Expand All @@ -25,15 +23,29 @@

parser.add_argument("-v", "--verbose", help="Display debug status messages", action="store_true")
parser.add_argument("-q", "--quiet", help="Suppress most output", action="store_true")
parser.add_argument("--debug", help="set logging to debug", action="store_true")

parser.add_argument("-l", "--log",
help="File path for log file")

args = parser.parse_args()

if args.verbose:
logging.basicConfig(level=logging.DEBUG)

logpath = None
if args.log:
logpath = os.path.abspath(args.log)
if os.path.isdir(logpath):
logpath = os.path.join(logpath, "kvasir.log")

if args.debug:
logging.basicConfig(level=logging.DEBUG, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
elif args.verbose:
logging.basicConfig(level=logging.INFO, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
elif args.quiet:
logging.basicConfig(level=logging.WARNING)
logging.basicConfig(level=logging.ERROR, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
else:
logging.basicConfig(level=logging.INFO)
logging.basicConfig(level=logging.WARNING, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)


DB = pymongo.MongoClient()[args.mongodb]

Expand Down
27 changes: 19 additions & 8 deletions bin/kv_import.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,30 +6,41 @@
import logging
from kvasir.mongo_import import mongo_import_genbank

logging.basicConfig(level=logging.INFO)

parser = argparse.ArgumentParser(description='Import genbank files')

parser.add_argument("mongodb", help="The name of MongoDB database")
parser.add_argument("-i", "--input", help="File or directory to import", required=True)

parser.add_argument("-v", "--verbose", help="Display debug status messages", action="store_true")
parser.add_argument("-q", "--quiet", help="Suppress most output", action="store_true")
parser.add_argument("--debug", help="set logging to debug", action="store_true")

parser.add_argument("-l", "--log",
help="File path for log file")

args = parser.parse_args()

if args.verbose:
logging.basicConfig(level=logging.DEBUG)
logpath = None
if args.log:
logpath = os.path.abspath(args.log)
if os.path.isdir(logpath):
logpath = os.path.join(logpath, "kvasir.log")

if args.debug:
logging.basicConfig(level=logging.DEBUG, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
elif args.verbose:
logging.basicConfig(level=logging.INFO, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
elif args.quiet:
logging.basicConfig(level=logging.WARNING)
logging.basicConfig(level=logging.ERROR, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
else:
logging.basicConfig(level=logging.INFO)
logging.basicConfig(level=logging.WARNING, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)

INPUT = args.input
DB = pymongo.MongoClient()[args.mongodb]

success = 0

INPUT = os.path.abspath(args.input)
DB = pymongo.MongoClient()[args.mongodb]

if os.path.isdir(INPUT):
for f in os.listdir(INPUT):
# TODO: figure out way to throw an error if it looks like the same gene/genome is being imported
Expand Down
43 changes: 43 additions & 0 deletions kvasir/Analysis/ani.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import os
import pandas as pd
from settings import MONGODB as db
import re
from subprocess import Popen, PIPE

species = db["genes"].distinct("species")
name_dict = {sub(r"[. -\(\)]", "_", x):x for x in species}

l = len(species)
data = {s:[0 for x in range(l)] for s in species}
dm = pd.DataFrame(data, index=species)
dm = pd.read_csv("/Users/KBLaptop/Desktop/ani_matrix.csv", index_col=0)

d = "/Users/KBLaptop/Desktop/ani"
for f1 in os.listdir(d):
for f2 in os.listdir(d):
if f1.endswith(".fna") and f2.endswith(".fna"):
s1 = name_dict[f1[:-4]]
s2 = name_dict[f2[:-4]]
if s1 == "Glutamicibacter arilaitensis Re117" or s2 == "Glutamicibacter arilaitensis Re117":
if s1.split()[0] == s2.split()[0]:
print(s1)
print(s2)
if s1 == s2:
dm.loc[s1, s2] = 100.0
pass
else:
dist = Popen(
['ruby', '/Users/KBLaptop/computation/Science/enveomics/Scripts/ani.rb',
'--seq1', os.path.join(d, f1),
'--seq2', os.path.join(d, f2),
'--auto', '--quiet'],
stdout=PIPE # xml output
).communicate()[0]

try:
print(float(dist.strip()))
dm.loc[s1, s2] = float(dist.strip())
except ValueError:
pass

dm.to_csv("/Users/KBLaptop/Desktop/ani_matrix.csv")

0 comments on commit ed76758

Please sign in to comment.