Skip to content

Python Module for handling NCBI data and subsequently creating a NoSQL database

License

Notifications You must be signed in to change notification settings

BITSapienza/biopyparse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BioPyParse

Python MongoDb Pip

Università La Sapienza Roma, Dipartimento di Informatica

Sapienza Università di Roma

Credits

federico-rosatelli Mat Loriv3 Samsey Calli

MicroAlgae DB

  • This repository is part of MicroAlgae DB project, formed by various modules. Visit the homepage at link https://github.com/BITSapienza for a general guide and automatic installation through docker-compose.

BioPyParse Module Description

Biopyparse is a simple module used to retrive data from NCBI and save it to a mongo-db database. This program enable you to manage a large dataset of data accessible from other technologies and libraries. For faster data retrieval we recommend you to add you NCBI login Email with your API Key.

Database Description

Data Collections (fundamental data):

  • nucleotide_data contains all microalgae data that can be found on NCBI under Nucleotide;
  • taxonomy_data contains all microalgae data that can be found on NCBI under Taxonomy;
  • protein_data contains all microalgae data that can be found on NCBI under Protein;

Auxiliary collections (assembled data):

  • table_basic contains basic data, for performance reasons, when doing search queries;
  • table_complete contains all heavy data such as features table in nucleotides and proteins;
  • taxonomy_tree contains all back-Lineage data for all species on a tree-shape object, it is generated by taxonomy_data;

Structures examples

nucleotide_data

{
	"GBSeqAccessionVersion": "string",
	"GBSeqComment"         : "string",
	"GBSeqCreateDate"      : "string",
	"GBSeqDefinition"      : "string",
	"GBSeqDivision"        : "string",
	"GBSeqFeatureTable"    : [
        {
            "GBFeatureIntervals": {
                "GBIntervalAccession": "string",
                "GBIntervalFrom"     : "string",
                "GBIntervalTo"       : "string"
            },
            "GBFeatureKey"     : "string",
            "GBFeatureLocation": "string",
            "GBFeatureQuals": [
                {
                    "GBQualifierName" : "string",
                    "GBQualifierValue": "string"
                }
            ],
            "GBFeaturePartial3": "string",
            "GBFeaturePartial5": "string"
        }
    ],
	"GBSeqLength"          : "string",
	"GBSeqLocus"           : "string",
	"GBSeqMoltype"         : "string",
	"GBSeqOrganism"        : "string",
	"GBSeqOtherSeqids"     : ["string"],
	"GBSeqPrimaryAccession": "string",
	"GBSeqReferences"      : [
        {
            "GBReferenceAuthors"  : ["string"],
            "GBReferenceJournal"  : "string",
            "GBReferencePosition" : "string",
            "GBReferenceReference": "string",
            "GBReferenceTitle"    : "string"
        }
    ],
	"GBSeqSource"      : "string",
	"GBSeqStrandedness": "string",
	"GBSeqTaxonomy"    : "string",
	"GBSeqTopology"    : "string",
	"GBSeqUpdateDate"  : "string",
}

table_complete

{
	"ScientificName": "string",
	"TaxId"         : "string",
	"Nucleotides"   : [ nucleotide_data_struct ],
	"Proteins"      : [ protein_data_struct ],
	"Products"      : [
        {
            "ProductName" "string",
            "QtyProduct"  "string"
        }
    ],
	"Country": [
        {
            "CountryName" "string"
        }
    ]
}

Module Structure

Image Module Structure

  • The input CSV file is optional and parsed by the findSpeciesFromFile function

  • This module is bound to MongoDB Database through pymongo library. The user can execute various commands to import or get data from database

  • Entrez is used to retrieve data from NCBI. Other auxiliary functions will be used to parse data in input.

  • The output files (csv/json) has been implemented for Vulgaris Platform development. Methods for creating them could be implemented in future and built based on actual module structure.

Tester & Code Example

There is a testing file to manage and verify that all methods of the module work correctly. The tester can be found at ./tests/tester.py
To execute it:

python3 tests/tester.py

Methods examples

The module consists of a class, called biopyparse, and a function. The class is initialized without access to the database which must be set with the method:

def newDatabase(self,databaseName:str,clientIp:str="localhost",clientPort:int=27017)

Assuming you have downloaded the desired taxonomic data, the method:

def generateTaxonomyTree(self,collectionName:str|None = "taxonomy_tree",taxonomyCollection:str|None = "taxonomy_data")

will create a taxonomic tree structured like this

[
    {
        "ScientificName": "ExampleScientificName",
        "TaxId": "ExampleTaxId",
        "Rank": "ExampleRank",
        "SubClasses": [
            {
                "ScientificName": "SubExampleScientificName",
                "TaxId": "SubExampleTaxId"
            },
            ...
        ]
    },
    ...
]

and it'll import it into the database with the collection name as collectionName.

Code Example

Follows a slice of code for example execution set:

from Bioparse import BioPyParse

bio = BioPyParse(verbose=True)
bio.newDatabase("BiologyTest")

organismList = ["Chlorella", "Scenedesmus"]
taxIds = bio.importTaxonFromList(organismList,collectionName="taxonomy")

bio.generateTaxonomyTree(collectionName="taxon_tree",taxonomyCollection="taxonomy")

Modules used

Releases

No releases published

Packages

No packages published

Languages