BioPyParse

Università La Sapienza Roma, Dipartimento di Informatica

Credits

federico-rosatelli Mat Loriv3 Samsey Calli

MicroAlgae DB

This repository is part of MicroAlgae DB project, formed by various modules. Visit the homepage at link https://github.com/BITSapienza for a general guide and automatic installation through docker-compose.

BioPyParse Module Description

Biopyparse is a simple module used to retrive data from NCBI and save it to a mongo-db database. This program enable you to manage a large dataset of data accessible from other technologies and libraries. For faster data retrieval we recommend you to add you NCBI login Email with your API Key.

Database Description

Data Collections (fundamental data):

nucleotide_data contains all microalgae data that can be found on NCBI under Nucleotide;
taxonomy_data contains all microalgae data that can be found on NCBI under Taxonomy;
protein_data contains all microalgae data that can be found on NCBI under Protein;

Auxiliary collections (assembled data):

table_basic contains basic data, for performance reasons, when doing search queries;
table_complete contains all heavy data such as features table in nucleotides and proteins;
taxonomy_tree contains all back-Lineage data for all species on a tree-shape object, it is generated by taxonomy_data;

Structures examples

nucleotide_data

{
	"GBSeqAccessionVersion": "string",
	"GBSeqComment"         : "string",
	"GBSeqCreateDate"      : "string",
	"GBSeqDefinition"      : "string",
	"GBSeqDivision"        : "string",
	"GBSeqFeatureTable"    : [
        {
            "GBFeatureIntervals": {
                "GBIntervalAccession": "string",
                "GBIntervalFrom"     : "string",
                "GBIntervalTo"       : "string"
            },
            "GBFeatureKey"     : "string",
            "GBFeatureLocation": "string",
            "GBFeatureQuals": [
                {
                    "GBQualifierName" : "string",
                    "GBQualifierValue": "string"
                }
            ],
            "GBFeaturePartial3": "string",
            "GBFeaturePartial5": "string"
        }
    ],
	"GBSeqLength"          : "string",
	"GBSeqLocus"           : "string",
	"GBSeqMoltype"         : "string",
	"GBSeqOrganism"        : "string",
	"GBSeqOtherSeqids"     : ["string"],
	"GBSeqPrimaryAccession": "string",
	"GBSeqReferences"      : [
        {
            "GBReferenceAuthors"  : ["string"],
            "GBReferenceJournal"  : "string",
            "GBReferencePosition" : "string",
            "GBReferenceReference": "string",
            "GBReferenceTitle"    : "string"
        }
    ],
	"GBSeqSource"      : "string",
	"GBSeqStrandedness": "string",
	"GBSeqTaxonomy"    : "string",
	"GBSeqTopology"    : "string",
	"GBSeqUpdateDate"  : "string",
}

table_complete

{
	"ScientificName": "string",
	"TaxId"         : "string",
	"Nucleotides"   : [ nucleotide_data_struct ],
	"Proteins"      : [ protein_data_struct ],
	"Products"      : [
        {
            "ProductName" "string",
            "QtyProduct"  "string"
        }
    ],
	"Country": [
        {
            "CountryName" "string"
        }
    ]
}

Module Structure

The input CSV file is optional and parsed by the findSpeciesFromFile function
This module is bound to MongoDB Database through pymongo library. The user can execute various commands to import or get data from database
Entrez is used to retrieve data from NCBI. Other auxiliary functions will be used to parse data in input.
The output files (csv/json) has been implemented for Vulgaris Platform development. Methods for creating them could be implemented in future and built based on actual module structure.

Tester & Code Example

There is a testing file to manage and verify that all methods of the module work correctly. The tester can be found at ./tests/tester.py
To execute it:

python3 tests/tester.py

Methods examples

The module consists of a class, called biopyparse, and a function. The class is initialized without access to the database which must be set with the method:

def newDatabase(self,databaseName:str,clientIp:str="localhost",clientPort:int=27017)

Assuming you have downloaded the desired taxonomic data, the method:

def generateTaxonomyTree(self,collectionName:str|None = "taxonomy_tree",taxonomyCollection:str|None = "taxonomy_data")

will create a taxonomic tree structured like this

[
    {
        "ScientificName": "ExampleScientificName",
        "TaxId": "ExampleTaxId",
        "Rank": "ExampleRank",
        "SubClasses": [
            {
                "ScientificName": "SubExampleScientificName",
                "TaxId": "SubExampleTaxId"
            },
            ...
        ]
    },
    ...
]

and it'll import it into the database with the collection name as collectionName.

Code Example

Follows a slice of code for example execution set:

from Bioparse import BioPyParse

bio = BioPyParse(verbose=True)
bio.newDatabase("BiologyTest")

organismList = ["Chlorella", "Scenedesmus"]
taxIds = bio.importTaxonFromList(organismList,collectionName="taxonomy")

bio.generateTaxonomyTree(collectionName="taxon_tree",taxonomyCollection="taxonomy")

Modules used

Biopython: "https://biopython.org/wiki/Documentation"
Pymongo: "https://pymongo.readthedocs.io/en/stable/"

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Database		Database
NCBISearch		NCBISearch
logos		logos
tests		tests
.gitignore		.gitignore
Bioparse.py		Bioparse.py
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

License

BITSapienza/biopyparse

Folders and files

Latest commit

History

Repository files navigation

BioPyParse

Credits

MicroAlgae DB

BioPyParse Module Description

Database Description

Structures examples

nucleotide_data

table_complete

Module Structure

Tester & Code Example

Methods examples

Code Example

Modules used

About

Topics

Resources

License

Stars

Watchers

Forks

Languages