Skip to content

Commit

Permalink
Clean-up (#14)
Browse files Browse the repository at this point in the history
* Clean-up

* fixed tab in output.py

* removed old code/ folder

* more cleanup
  • Loading branch information
kescobo committed May 27, 2016
1 parent be74847 commit 47c75d8
Show file tree
Hide file tree
Showing 19 changed files with 44 additions and 1,995 deletions.
4 changes: 0 additions & 4 deletions .gitignore
@@ -1,7 +1,3 @@
.DS_store
*ipynb*
*.sublime-project
*.sublime-workspace
*.pyc
.idea
user_settings.py
43 changes: 18 additions & 25 deletions README.md
Expand Up @@ -7,36 +7,29 @@ Dependencies:
* MongoDB
* pymongo
* BioPython
* ~~Scikit Bio~~
* ~~Pandas~~
* BLAST+ CLI

**Identification of horizontal gene transfer between sequenced microbial genomes**

##The following is out of date... Will get back to this soon
Kvasir takes as the input a folder containing genomes in genbank format. The protein coding genes from these genomes are loaded into a database, and blasted against each other.

##~~Running Kvasir~~
~~With dependencies installed, fire up a Mongod instance. In the terminal:~~
### Usage

~~`mongod --dbpath path/to/db`~~
Change the values in `settings.py` to point at your input folder, output folder, and the name you want for your database.

~~Run Kvasir by invoking run_kvasir.py in your working directory:~~
Launch a local `mongod` instance:
```
$ mongod --dbpath path/to/db
```

~~`python run_kvasir.py /path/to/gb_files name_of_mongoDB`~~
Run functions in `run.py`. Eventually, this will get more streamlined, for now...

####~~DataImport~~:
* ~~Imports genbank-formated annotated genomes into Mongo database.~~
* ~~.gb files require "locus_tag" feature. If your genomes don't have it, FixGbk.py shoul take care of it for you~~
* ~~Mongo database has "collections" and "documents" - a different collection is generated for each species (each separate genbank file) and documents representing each CDS. ~~
* ~~CDS documents are like python dictionaries, and contain entries for species, DNA and amino acid protein sequences, contig and location info, and annotation information.~~
* ~~each document is assigned a uniqe `_id` attribute within the species, so every gene is uniquely identified by a `(species, _id)` tuple~~

####~~MakeBlastDb~~
~~Generates a multi-fasta file containing every gene in the mongo database, generates a BLASTable database, then deletes the temprorary file~~

####~~KsavirBlast~~
* ~~For each species, generates a temporary fasta file and BLASTs against every other gene in the database~~
* ~~BLAST generates and xml document, which is parsed for unique hits~~
* ~~new "hits" entry is added to each gene document in MongoDB, which contains a list of `(species, _id)` tuples for each hit (these are used in the next script to gather info about hits)~~

####~~Outputs~~
~~Still a work in progress. So far, have a bunch of output formates working... will detail later.~~
```
Python 2.7.11 (default, Dec 14 2015, 10:44:13)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import run
>>> run.import_data()
>>> run.blast_db()
>>> run.analyze(0.99)
```
130 changes: 0 additions & 130 deletions code/DataImport.py

This file was deleted.

0 comments on commit 47c75d8

Please sign in to comment.