V0.6.3 (#19)

* modify scripts to include log file option * fix parsing and output * fix issues with logging * fixed output * remove -c from scripts, update README * little fixes * upversion pip, show compatibility with python 2.7
DuttonLab · Mar 16, 2017 · ed76758 · ed76758
1 parent 079be8c
commit ed76758
Show file tree

Hide file tree

Showing 14 changed files with 545 additions and 64 deletions.
diff --git a/README.md b/README.md
@@ -3,7 +3,7 @@
 [![DOI](https://zenodo.org/badge/22309/kescobo/kvasir.svg)](https://zenodo.org/badge/latestdoi/22309/kescobo/kvasir) [![Join the chat at https://gitter.im/kescobo/kvasir](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/kescobo/kvasir?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
 
 Dependencies:
-* Python 3+
+* Python 2.7 or 3.6
 * MongoDB
   * pymongo
 * BioPython
@@ -89,15 +89,15 @@ To make the next part work you need one more thing - the [BLAST+](https://www.nc
 Assuming this is working (you've still got `mongod` running right?), it's time to start blasting. There are a couple of commands in the `blast.py` script. First, you've gotta make a blast database with all of the the genes in your database using `makedb`. This command will make 3 files, in the current directory by default, or in the directory you specify with `-b`
 
 ```
-# python3 bin/kv_blast.py cheese1 -c makedb -b ~/blastdbs
+# kv_blast.py cheese1 makedb -b ~/blastdbs
 INFO:root:making BLAST database with protein coding sequences from cheese1 at /Users/ksb/blastdbs
 INFO:root:BLAST db created at /Users/ksb/blastdbs
 ```
 
 Next, blastall will go through each species in your database and blast it against that database. Each hit will then be stored in your MongoDB with some metadata.
 
 ```sh
-$ python3 bin/kv_blast.py cheese1 -c blastall -b ~/blastdbs
+$ kv_blast.py cheese1 blastall -b ~/blastdbs
 INFO:root:blasting Awesomeus speciesus strain A
 INFO:root:Blasting all by all
 INFO:root:Getting Blast Records
@@ -131,7 +131,7 @@ There are a few ways to approach this, and in kvasir they're unified under the i
 If the genbank files you imported have a `SOURCE` line for the name of your organism, and it's formatted as `<genus> <species> <strain>`, you can use the script to calculate the distance based on Average Nucleotide Identity (ANI). This is calculated using a ruby script from [enveomics](https://github.com/lmrodriguezr/enveomics/blob/master/Scripts/ani.rb). Measuring ANI using this method is only appropriate for closely related genomes, and as it stands, the script grabs all species in your MongoDB and calls the ANI script on pairs that have the same genus.
 
 ```sh
-$ kv_distance.py cheese1 -c ani
+$ kv_distance.py cheese1 ani
 ```
 
 This script pulls species from the MongoDB and then looks for all pairs of names to see if they're the same genus. It does that a bit naively, using `split()[0].split("_")[0]`, which will split the species name at spaces or `_`, and then take the first thing. So if you're looking at "Awesomeus speciesus strain A", this command will check if you have anything else with "Awesomeus" as the genus, like "Awesomeus_speciesus_strain_B", and will calculate the ANI between them and add them to a "species_distance" database in the MongoDB.
@@ -154,13 +154,13 @@ If ANI is not appropriate, or you have a different way to measure species distan
 Assuming this is saved in a file `~/Desktop/ssu_dm.csv`, you can do:
 
 ```sh
-$ kv_distance.py cheese1 -c distance_matrix -i ~/Desktop/ssu_dm.csv -t ssu
+$ kv_distance.py cheese1 distance_matrix -i ~/Desktop/ssu_dm.csv -t ssu
 ```
 
 The `-t` parameter is the "distance type", which can be anything you want. The ANI script above uses `-t ani` by default. If you want to get your ssu distance matrix out at a different time, use:
 
 ```sh
-$ kv_distance.py cheese1 -c distance_matrix -o ~/Desktop/returned_ssu_dm.csv -t ssu
+$ kv_distance.py cheese1 distance_matrix -o ~/Desktop/returned_ssu_dm.csv -t ssu
 ```
 
 Be sure your distance is actually a distance, and is between 0:1. So if two species have a 16S gene that is 85% identical, the distance should be 0.15.
@@ -171,7 +171,7 @@ In the next section, the default is to not use the species distance parameter or
 Now the fun stuff!
 
 ```sh
-$ kv_analysis.py cheese1 -c groups -o ~/Desktop/groups.csv
+$ kv_analysis.py cheese1 groups -o ~/Desktop/groups.csv
 ```
 
 This could take some time, depending on the number of species and how much HGT there is.

diff --git a/bin/kv_analysis.py b/bin/kv_analysis.py
@@ -10,9 +10,9 @@
 
 parser.add_argument("mongodb", help="The name of MongoDB database")
 
-parser.add_argument("-c", "--command",
+parser.add_argument("command",
     help="which analysis command to run (groups, species_hits)",
-    choices=["groups"], required=True)
+    choices=["groups"])
 
 parser.add_argument("-v", "--verbose", help="Display debug status messages", action="store_true")
 parser.add_argument("-q", "--quiet", help="Suppress most output", action="store_true")
@@ -21,13 +21,13 @@
     help="minimum identity for BLAST hits",
     default="0.99", required=True)
 
-parser.add_argument("-s", "spacer",
+parser.add_argument("-s", "--spacer",
     help="Maximum distance between genes to be considered in the same group",
     default="5000")
-parser.add_argument("-l", "length",
+parser.add_argument("--length",
     help="Minimum length (nt) for each protein coding gene",
     default="500")
-parser.add_argument("-d", "species-distance",
+parser.add_argument("-d", "--species-distance",
     help="Minimum distance between species (0-1)",
     default=0)
 parser.add_argument("-g", "--group-size",
@@ -38,21 +38,36 @@
     default="ani")
 parser.add_argument("-o", "--output",
     help="File path for output (usable with distance_matrix)", default="./")
+parser.add_argument("-l", "--log",
+    help="File path for log file")
+
+parser.add_argument("--debug",
+    help="set logging to debug", action="store_true")
 
 args = parser.parse_args()
 
-if args.verbose:
-    logging.basicConfig(level=logging.DEBUG)
+logpath = None
+if args.log:
+    logpath = os.path.abspath(args.log)
+    if os.path.isdir(logpath):
+        logpath = os.path.join(logpath, "kvasir.log")
+
+if args.debug:
+    logging.basicConfig(level=logging.DEBUG, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
+elif args.verbose:
+    logging.basicConfig(level=logging.INFO, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
 elif args.quiet:
-    logging.basicConfig(level=logging.WARNING)
+    logging.basicConfig(level=logging.ERROR, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
 else:
-    logging.basicConfig(level=logging.INFO)
+    logging.basicConfig(level=logging.WARNING, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
+
+
 
 DB = pymongo.MongoClient()[args.mongodb]
 
 if args.command == "groups":
-    g = hgt_groups(float(args.identity), DB, args.length, args.spacer,
-        args.species_distance, args.distance_type)
+    g = hgt_groups(float(args.identity), DB, int(args.length), int(args.spacer),
+        float(args.species_distance), args.distance_type)
 
     if os.path.isdir(args.output):
         out = os.path.join(args.output, "{}-{}-{}_groups.csv".format(
@@ -63,4 +78,4 @@
     else:
         out = os.path.abspath(args.output)
 
-    output_groups(g, out, db, args.group_size)
+    output_groups(g, out, DB, args.group_size)
diff --git a/bin/kv_blast.py b/bin/kv_blast.py
@@ -9,27 +9,39 @@
 from kvasir.run_blast import blast_all, parse_blast_results_xml
 from tempfile import NamedTemporaryFile
 
-logging.basicConfig(level=logging.INFO)
 
 parser = argparse.ArgumentParser(description='Kvasir BLAST commands')
 
 parser.add_argument("mongodb", help="The name of MongoDB database")
 parser.add_argument("-b", "--blastpath", help="path to BLAST database", default="./")
-parser.add_argument("-c", "--command", help="which blast command to run (makedb, blastall, blastone)",
-    choices=["makedb", "blastall", "blastone"], required=True)
+parser.add_argument("command", help="which blast command to run (makedb, blastall, blastone)",
+    choices=["makedb", "blastall", "blastone"])
 parser.add_argument("-f", "--force", help="Override errors (eg re-importing blast for same species)", action="store_true")
 
 parser.add_argument("-v", "--verbose", help="Display debug status messages", action="store_true")
 parser.add_argument("-q", "--quiet", help="Suppress most output", action="store_true")
+parser.add_argument("--debug", help="set logging to debug", action="store_true")
+
+parser.add_argument("-l", "--log",
+    help="File path for log file")
 
 args = parser.parse_args()
 
-if args.verbose:
-    logging.basicConfig(level=logging.DEBUG)
+logpath = None
+if args.log:
+    logpath = os.path.abspath(args.log)
+    if os.path.isdir(logpath):
+        logpath = os.path.join(logpath, "kvasir.log")
+
+if args.debug:
+    logging.basicConfig(level=logging.DEBUG, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
+elif args.verbose:
+    logging.basicConfig(level=logging.INFO, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
 elif args.quiet:
-    logging.basicConfig(level=logging.WARNING)
+    logging.basicConfig(level=logging.ERROR, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
 else:
-    logging.basicConfig(level=logging.INFO)
+    logging.basicConfig(level=logging.WARNING, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
+
 
 DB = pymongo.MongoClient()[args.mongodb]
 BLASTPATH = os.path.abspath(args.blastpath)

diff --git a/bin/kv_distance.py b/bin/kv_distance.py
@@ -9,13 +9,11 @@
 from kvasir.mongo_import import mongo_import_distance, mongo_import_distance_matrix
 from kvasir.distance import get_ani, get_distance_matrix
 
-logging.basicConfig(level=logging.DEBUG)
-
 parser = argparse.ArgumentParser(description='Kvasir Analysis commands')
 
 parser.add_argument("mongodb", help="The name of MongoDB database")
-parser.add_argument("-c", "--command", help="which analysis command to run (ani, distance_matrix)",
-    choices=["ani", "distance_matrix"], required=True)
+parser.add_argument("command", help="which analysis command to run (ani, distance_matrix)",
+    choices=["ani", "distance_matrix"])
 
 parser.add_argument("-o", "--output", help="File path for output (usable with distance_matrix)", default="./")
 parser.add_argument("-i", "--input", help="File path for input (usable with distance_matrix)")
@@ -25,15 +23,29 @@
 
 parser.add_argument("-v", "--verbose", help="Display debug status messages", action="store_true")
 parser.add_argument("-q", "--quiet", help="Suppress most output", action="store_true")
+parser.add_argument("--debug", help="set logging to debug", action="store_true")
+
+parser.add_argument("-l", "--log",
+    help="File path for log file")
 
 args = parser.parse_args()
 
-if args.verbose:
-    logging.basicConfig(level=logging.DEBUG)
+
+logpath = None
+if args.log:
+    logpath = os.path.abspath(args.log)
+    if os.path.isdir(logpath):
+        logpath = os.path.join(logpath, "kvasir.log")
+
+if args.debug:
+    logging.basicConfig(level=logging.DEBUG, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
+elif args.verbose:
+    logging.basicConfig(level=logging.INFO, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
 elif args.quiet:
-    logging.basicConfig(level=logging.WARNING)
+    logging.basicConfig(level=logging.ERROR, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
 else:
-    logging.basicConfig(level=logging.INFO)
+    logging.basicConfig(level=logging.WARNING, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
+
 
 DB = pymongo.MongoClient()[args.mongodb]
 

diff --git a/bin/kv_import.py b/bin/kv_import.py
@@ -6,30 +6,41 @@
 import logging
 from kvasir.mongo_import import mongo_import_genbank
 
-logging.basicConfig(level=logging.INFO)
-
 parser = argparse.ArgumentParser(description='Import genbank files')
 
 parser.add_argument("mongodb", help="The name of MongoDB database")
 parser.add_argument("-i", "--input", help="File or directory to import", required=True)
 
 parser.add_argument("-v", "--verbose", help="Display debug status messages", action="store_true")
 parser.add_argument("-q", "--quiet", help="Suppress most output", action="store_true")
+parser.add_argument("--debug", help="set logging to debug", action="store_true")
+
+parser.add_argument("-l", "--log",
+    help="File path for log file")
 
 args = parser.parse_args()
 
-if args.verbose:
-    logging.basicConfig(level=logging.DEBUG)
+logpath = None
+if args.log:
+    logpath = os.path.abspath(args.log)
+    if os.path.isdir(logpath):
+        logpath = os.path.join(logpath, "kvasir.log")
+
+if args.debug:
+    logging.basicConfig(level=logging.DEBUG, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
+elif args.verbose:
+    logging.basicConfig(level=logging.INFO, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
 elif args.quiet:
-    logging.basicConfig(level=logging.WARNING)
+    logging.basicConfig(level=logging.ERROR, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
 else:
-    logging.basicConfig(level=logging.INFO)
+    logging.basicConfig(level=logging.WARNING, format="%(asctime)s || %(levelname)s: %(message)s", filename=logpath)
 
-INPUT = args.input
-DB = pymongo.MongoClient()[args.mongodb]
 
 success = 0
 
+INPUT = os.path.abspath(args.input)
+DB = pymongo.MongoClient()[args.mongodb]
+
 if os.path.isdir(INPUT):
     for f in os.listdir(INPUT):
         # TODO: figure out way to throw an error if it looks like the same gene/genome is being imported

diff --git a/kvasir/Analysis/ani.py b/kvasir/Analysis/ani.py
@@ -0,0 +1,43 @@
+import os
+import pandas as pd
+from settings import MONGODB as db
+import re
+from subprocess import Popen, PIPE
+
+species = db["genes"].distinct("species")
+name_dict = {sub(r"[. -\(\)]", "_", x):x for x in species}
+
+l = len(species)
+data = {s:[0 for x in range(l)] for s in species}
+dm = pd.DataFrame(data, index=species)
+dm = pd.read_csv("/Users/KBLaptop/Desktop/ani_matrix.csv", index_col=0)
+
+d = "/Users/KBLaptop/Desktop/ani"
+for f1 in os.listdir(d):
+    for f2 in os.listdir(d):
+        if f1.endswith(".fna") and f2.endswith(".fna"):
+            s1 = name_dict[f1[:-4]]
+            s2 = name_dict[f2[:-4]]
+            if s1 == "Glutamicibacter arilaitensis Re117" or s2 == "Glutamicibacter arilaitensis Re117":
+                if s1.split()[0] == s2.split()[0]:
+                    print(s1)
+                    print(s2)
+                    if s1 == s2:
+                        dm.loc[s1, s2] = 100.0
+                        pass
+                    else:
+                        dist = Popen(
+                            ['ruby', '/Users/KBLaptop/computation/Science/enveomics/Scripts/ani.rb',
+                            '--seq1', os.path.join(d, f1),
+                            '--seq2', os.path.join(d, f2),
+                            '--auto', '--quiet'],
+                             stdout=PIPE  # xml output
+                        ).communicate()[0]
+
+                        try:
+                            print(float(dist.strip()))
+                            dm.loc[s1, s2] = float(dist.strip())
+                        except ValueError:
+                            pass
+
+dm.to_csv("/Users/KBLaptop/Desktop/ani_matrix.csv")