Skip to content

hectorloma/GNFish

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GNFish

============================================================================================================

When using GNFish or part of its code please cite the following paper:

Lorente-MartĂ­nez, H., Agorreta, A., & San Mauro, D. (2022). Genomic Fishing and Data Processing for Molecular Evolution Research. Methods and Protocols, 5(2), 26. https://doi.org/10.3390/mps5020026

============================================================================================================

Suite of small Python programs following the pipeline detailed in (DOI of article yet to be published) for:
(1) “fishing” specific gene sequences of interest from genomic data available in NCBI databases
(2) processing and depuration of retrieved sequences
(3) production of a multiple sequence alignment
(4) selection of best-fit model of evolution
(5) solid reconstruction of a phylogenetic tree.
(6) There are 12 scripts:

align_sequences.py -> Align sequences using MAFFT software
blast.py -> Perfoms BLAST searches against the download genomes
class_list_files.py -> Class for accessing directories. Not need to be run, but the other scripts need it.
decompress_genomes.py -> Decompress genome files.
get_combined_seqs.py -> Merge sequences for creating a whole alingment.
get_genomes.py -> Download genomes from NCBI databases
get_query_sequences.py -> Download a dataset of protein sequences that act as query for BLAST searches
get_RAW_sequences.py -> Extracts the sequences mapped after BLAST searches
get_unique_hits.py -> Gets unique hits from Blast output files based on genomes IDs
phylogenetic_inference.py -> Runs IQ-Tree program for phylogentic inference
translate_sequences.py -> Translate nucleotide sequences to protein.
alignment_trimming.py -> Runs trimAl program for trimming alignments.

This software is released under the license GNU GPLv3.

This software is provided as is without warranty of any kind.

People wishing to contribute to the software, report issues or seek support can contact Hector Lorente Martinez at hlorente@ucm.es

Download

Download scripts from https://github.com/hectorloma/GNFish
In linux terminal

git clone https://github.com/hectorloma/GNFish

If git is not installed visit https://github.com/git-guides/install-git which is available for any OS

Make scripts executable if you want to run it from the very same terminal

cd GNFish
chmod +x *.py

Dependencies

This software uses Biopython module in python 3 (tested version 1.78-2). It can be installed in your system using pip:

pip install biopython 

get_genomes.py

DESCRIPTION:

Program for downloading genomes from NCBI Databases through Entrez. Requires internet conection.
It needs your e-mail and a file with your queries.
Queries are usallly taxa names (see query_genomes examples files at ./Example)
By default it will download genomic data, but you can also add protein or RNA, using --protein or --rna respectively.
Retmax set a maximum number of downloaded genomes for query.
It creates a directory named Data and three subdirectories named Genomic, Rna and Protein where it downloads the genomes.
Genomes file are compressed you must descompress for working with them (see decompress genome file).
Genome file name will be "Genus_species_assemblyID_datatype.f[n,a]a".

Type on terminal get_genomes.py -h for further information.

USAGE:

**get_genomes.py 'e-mail' 'query.txt' **

PARAMETERS:

--email -> mandatory e-mail for NCBI searches
--query -> file with the queries. Usually simple taxa names (species, group). Field tags or filters can be added to each query. See examples below or look at Example directory for examples of query files

Optional parameters:
--genomic -> downloads whole genomic data
--rna -> downloads protein annotation data
--protein -> downloads protein annotation data
--exclusive -> downloads just protein or rna annotation data if available. No genomic backup
--retmax -> number of NCBI records reported for every query. Default value equal to 200
--refine -> adds filter or field information to all queries. Constant value AND (latest[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter])

EXAMPLES:

Examples files are availabe at Example directory.

1.- Simple search

Query txt with simple searches, just taxa nor filters or field tags. No refine argument.
See ./Example/query_genome_1.txt for examples of queries.
Creates ./Data directory and ./Data/Genomic, ./Data/Rna and ./Data/Protein subdirectories.
Nor filters or field tags applied. Not curated and redundant genomes (one genom for more than one species is used).
Use this when you do not care very much about filtering.

Genomic

./Code/get_genomes.py hlorente@ucm.es Example/query_genome_1.txt

Protein. Look for genomic data as backup (For transcrits use --rna)

./Code/get_genomes.py hlorente@ucm.es Example/query_genome_1.txt --protein

Exclusive protein. No backup. (For transcrits use --rna)

./Code/get_genomes.py hlorente@ucm.es Example/query_genome_1.txt --protein --exlusive

Protein and Rna. Genomic search as backup

./Code/get_genomes.py hlorente@ucm.es Example/query_genome_1.txt --protein --rna

Rna donwload changing number of records downloaded. Will donwload until 1000 genomes if available. By default 200 are donwloaded.

./Code/get_genomes.py hlorente@ucm.es Example/query_genome_1.txt --rna --retmax 1000

2.- Refine search using --refine argument. Recomended

Default. Applies Representative (just one genome for species), Latest, Not Anomalous.
Just protein example. For rna, genomic, exclusive or retmax argument see 1 above and them to this command line.

./Code/get_genomes.py hlorente@ucm.es Example/query_genome_1.txt --protein --refine

Own filters or field tags. Just genomes annotated at chromosome level (Example)

./Code/get_genomes.py hlorente@ucm.es Example/query_genome_1.txt --protein --refine 'AND "chromosome level"[filter]'

More info about filters and field tags at https://www.ncbi.nlm.nih.gov/assembly/help/ and reading https://www.ncbi.nlm.nih.gov/books/NBK3837/ - Entre-zHelp.Entrez_Searching_Options.

3.- Refine applied to each query

You must add a the field tags or filters after your query.
See ./Example/query_genome_2.txt for examples of queries.
As each contains is own filters or field tags this will be applied exclusively.
For gobioidei it will download just genomes annotated at chromosome level.
For Homo sapiens all the available genomes will be downloaded.
Just protein example. For rna, genomic, exclusive or retmax argument see 1 above and them to this command line.

./Code/get_genomes.py hlorente@ucm.es Example/query_genome_2.txt --protein

4.- Refine to each query and general refine

Default. Applies Representative (just one genome for species), Latest, Not Anomalous.
You must add a the field tags or filters after your query.
See ./Example/query_genome_2.txt for examples of queries.
As each contains is own filters or field tags this will be applied exclusively.
For gobioidei it will download just genomes annotated at chromosome level.
For Homo sapiens just the representative one.
Just protein example. For rna, genomic, exclusive or retmax argument see 1 above and them to this command line.

./Code/get_genomes.py hlorente@ucm.es Example/query_genome_2.txt --protein --refine

get_query_seqs.py

DESCRIPTION:

Program for downloading protein or nucleotide from NCBI Databases through Entrez. Requires internet conection.
It needs your e-mail and a file with your queries.
Queries must follow "Gene name (filters). If no filters type () after the gene name.
By default it will download protein data, but you can change using --nucleotide parameter.
Retmax set a maximum number of downloaded genomes for query.
Curated argument refine the search for protein search, just those entries which include the name of the protein in Protein Feature.
Stores sequences at ./Data/Query_seqs.
Output file will be named "genename_query_seqs.fas.

Type on terminal get_query_seqs.py -h for further information.

USAGE:

**get_query_seqs.py 'e-mail' 'query.txt' **

PARAMETERS:

--email -> mandatory e-mail for NCBI searches
--query -> file with the queries. Usually simple taxa names (species, group). Field tags or filters can be added to each query. See examples below or look at Examples directory for examples of query files

Optional parameters:
--nucleotide -> for nucleotide downloading
--curated -> downloads just protein or rna annotation data if available. No genomic backup
--retmax -> number of NCBI records reported for every query. Default value equal to 200
--refine -> adds filter or field tags to the query. Constant value refseq[filter]. Follow constant value structure for your custom filters, begin with "AND"

EXAMPLES:

Examples files are availabe at Example directory.

1.- Simple search

Query txt with simple searches, just gene name and filters equal to ().
Creates ./Data/Query_seqs directory.
Peforms well for protein, for nucleotide it is better to see 2.

Protein

./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_1.txt

Protein. Downloading restricting maximum number to 50

./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_1.txt --retmax 50

Protein. Downloading restricting maximum number to 50 and curating

./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_1.txt --retmax 50 --curated

Nucleotide. Downloading restricting maximum number to 50. Better see 2.

./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_1.txt --retmax 50 --nucleotide

2.- Refine search using --refine argument. Recommended

Default. Applies refseq[filter].
Protein download, restricting maximum number to 50, curating, refine default just sequences from RefSeq.

./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_1.txt --retmax 50 --curated --refine

Nucleotide recommended use. Downloading restricting maximum number to 50 and just transcripts (refine argument).

./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_1.txt --retmax 50 --curated --refine 'biomol_mrna[PROP]'

3.- Refine applied to each query

You must add a the field tags or filters after your query.
As each query contains is own filters or field tags these will be applied exclusively to every query. See ./Example/query_seqs_2.txt and quer_seqs_3.txt for a scheme.
Protein. For aquaporin just vertebrates sequences. For hemoglobin just sequences between 100 and 200 positions.

./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_2.txt

Nucleotide. For aquaporin sequences just download sequences from RefSeq. For hemoglobin just from Plasmids.

./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_3.txt

4.- Refine to each query and general refine

Protein. Same filters as in 3 plus sequences just from 1999.

./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_2.txt --refine 'AND ("1999/01/01"[PDAT] : "3000/12/31"[PDAT])'

Nucleotide. Same filters as in 3 plus sequences just from 1999.

./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_3.txt --refine 'AND ("1999/01/01"[PDAT] : "3000/12/31"[PDAT])'

More info about filters and field tags at x and reading y.

decompress_genomes.py

DESCRIPTION:

Program for decompress downloaded genomes.
It searchs for '.gz' files. Can be applied to any other '.gz' file.
With data type (genomic, rna, protein) argument the program will look at ./Data/Datat_type.
With directory argument the program will iterativily within the detail path.

Type on terminal decompress_genomes.py -h for further information.

USAGE:

**decompress_genomes.py --data_type or --directory 'path' **

PARAMETERS:

Optional parameters:
--directory -> sets path to custom folder
--genomic -> looks at ./Data/Genomic
--rna -> looks at ./Data/Rna
--protein -> looks at ./Data/Protein

blast.py

DESCRIPTION:

BLAST against genome .fna and .faa files.
It requires BLAST download at https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/.

Type on terminal blast.py -h for further information.

USAGE:

**blast.py --data_type or --directory 'path' **

PARAMETERS:

blast_path -> path to bin ncbi-blast directory
query_file -> path to your query file with your reference sequences (Fasta file with prot or nucl seqs)
query_type -> data type of query sequences. prot or nucl

Optional parameters:
--directory -> sets path to custom folder
--genomic -> looks at ./Data/Genomic
--rna -> looks at ./Data/Rna
--protein -> looks at ./Data/Protein
--evalue -> E-value threshold for search. Default value equal to 1e-10
--outfmt -> Output alignment options. Default value 6
--out_exten -> Extension of the BLAST output file. Default "_out.tsv"

EXAMPLES:

Examples files are availabe at Example directory.

1.- Query and database (genome) are proteins

Protein directory

./Code/blast.py 'path_to_bin_directory' ./Example/protein_query.fas 'prot' --protein

Custom directory

./Code/blast.py 'path_to_bin_directory' ./Example/protein_query.fas 'prot' --directory 'path_to_custom_directory' --protein

2.- Query and database (genome) are nucleotides

Rna directory

./Code/blast.py 'path_to_bin_directory' ./Example/nucleotide_query.fas 'nucl' --rna

Custom directory

./Code/blast.py 'path_to_bin_directory' ./Example/nucleotide_query.fas 'nucl' --directory 'path_to_custom_directory' --rna

3.- Query is protein and database (genome) is nucleotide

Rna directory

./Code/blast.py 'path_to_bin_directory' ./Example/protein_query.fas 'prot' --rna

Custom directory

./Code/blast.py 'path_to_bin_directory' ./Example/protein_query.fas 'prot' --directory 'path_to_custom_directory' --rna

4.- Query is nucleotide and database (genome) is protein

Protein directory

./Code/blast.py 'path_to_bin_directory' ./Example/protein_query.fas 'nucl' --protein

Custom directory

./Code/blast.py 'path_to_bin_directory' ./Example/protein_query.fas 'nucl' --directory 'path_to_custom_directory' --protein

get_unique_hits.py

DESCRIPTION:

Gets unique hits from BLAST output files based on genomes IDs.
The program search for '.tsv' files. Change this with --pattern argument.
Generates output with 'unique.tsv' extension.
Outputs will be stored at Genomic, Rna, Protein or custom directory.

Type on terminal get_unique_hits.py -h for further information.

USAGE:

**get_unique_hits.py --data_type or --directory 'path' **

PARAMETERS:

Optional parameters:
--directory -> sets path to custom folder
--genomic -> looks at ./Data/Genomic
--rna -> looks at ./Data/Rna
--protein -> looks at ./Data/Protein
--pattern -> custom pattern to find Blast output files. Default ".tsv"

EXAMPLES:

Examples files are availabe at Example directory.

1.- Default

Protein directory. Output will be stored at ./Data/Protein/Species_directory

./Code/get_unique_hits.py --protein

Custom directory. Output will be stored at custom directory (where the file is located)

./Code/get_unique_hits.py --directory 'path_to_custom_directory'

2.- New pattern

Using '.txt'. Program will serach for '.txt' files.

./Code/get_unique_hits.py --protein --pattern '.txt'

get_RAW_sequences.py

DESCRIPTION:

Gets RAW sequences from genomes based on 'unique.tsv' files.
The program search for 'unique.tsv' and '.f[a,n]a files. Change this with --blast_pattern an--genome_pattern arguments.
Genome and Blast unique output file must have a "Genus_species_assemblyID" structure or the name provided by NCBI when downloaded.

Type on terminal get_unique_hits.py -h for further information.

USAGE:

**get_RAW_sequences.py --data_type or --directory 'path' **

PARAMETERS:

Optional parameters:
--directory -> sets path to custom folder
--genomic -> looks at ./Data/Genomic
--rna -> looks at ./Data/Rna
--protein -> looks at ./Data/Protein
--blast_pattern -> custom pattern to find BLAST output files. Default ".tsv"
--genome_pattern -> pattern to find genome files. Default ".f[a,n]a"
--in_len -> Number of sites extracted upstream and downstream from the blast hit. Default 10000. A whole sequence of at least 20000 sites if exists
--query_seqs -> Use it when you want to attach some query sequences according to BLAST results for future alignments
--query_seqs_num -> Maximum number of query sequences extracted. Use it when you want to attach some sequences to genomic sequences for future alignments. 5 by default
--email -> Mandatory when you want to download some sequences to complete the RAW files for future alignments. As in get_query_seqs

EXAMPLES:

Examples files are availabe at Example directory.

1.- Default

Protein directory. Output will be stored at ./Data/Protein/Species_directory

./Code/get_RAW_sequences.py --protein

Genomic directory, sequences of 20,000 positions if possible. Output will be stored at ./Data/Genomic/Species_directory

./Code/get_RAW_sequences.py --genomic

Custom directory. Output will be stored at custom directory (where the file is located)

./Code/get_RAW_sequences.py --directory 'path_to_custom_directory'

2.- Adding seqs for future alignment. Recommended for Genomic sequences

Genomic directory, sequences of 20,000 positions if possible. Output will be stored at ./Data/Genomic/Species_directory
20 sequences from the dataset of nucleotide_query_seqs.fas.

./Code/get_RAW_sequences.py --genomic --in_len --query_seqs ./Example/nucleotide_query_seqs.fas --query_seqs_num 20 --email hlorente@ucm.es

3.- Changing genome and BLAST pattern

Genomic directory, sequences of 20,000 positions if possible. Output will be stored at ./Data/Genomic/Species_directory
20 sequences from the dataset of nucleotide_query_seqs.fas. Program will look for 'unique.txt' and for '.fas'.

./Code/get_RAW_sequences.py --genomic --in_len --blast_pattern 'unique.txt' --genome_pattern '.fas' --query_seqs ./Example/nucleotide_query_seqs.fas --query_seqs_num 20 --email hlorente@ucm.es

4.- Changing length of genomic extracted sequence

Genomic directory, sequences of 30,000 positions if possible. Output will be stored at ./Data/Genomic/Species_directory

./Code/get_RAW_sequences.py --genomic --in_len 15000

align_sequences.py

DESCRIPTION:

Aligns files enclosed at Data folder or your custom file.
You must have used --query_seqs argument in get_RAW_sequences.py or have manually added some sequences to raw files.
See https://mafft.cbrc.jp/ for information about algorithms.
Type on terminal get_unique_hits.py -h for further information.

USAGE:

**align_sequences.py --data_type or --directory 'path' **

PARAMETERS:

Optional parameters:
--directory -> sets path to custom folder
--genomic -> looks at ./Data/Genomic
--rna -> looks at ./Data/Rna
--protein -> looks at ./Data/Protein
--pattern -> sets custom pattern to find files for alignment. Default 'RAW.fas'. Backup search RAW.fas
--algorithm -> sets MAFFT algorithm. Default mafft (automatic)

EXAMPLES:

Examples files are availabe at Example directory.

1.- Automatic

Genomic directory. Output will be stored at ./Data/Genomic/Species_directory

./Code/align_sequences.py --genomic

Custom directory. Output will be stored at custom directory (where the file is located)

./Code/align_sequences.py --directory 'path_to_custom_directory'

2.- Custom directory, e-insi algorithma and custom pattern

Custom directory. Output will be stored at custom directory (where the file is located)

./Code/align_sequences.py --directory 'path_to_custom_directory' --algorithm einsi --pattern 'RAW_mod.fas'

translate_sequences.py

DESCRIPTION:

Translates files.
Open code file of the program for checking for different genetic codes.

USAGE:

**translate_sequences.py --data_type or --directory 'path' **

PARAMETERS:

Optional parameters:
--directory -> sets path to custom folder
--genomic -> looks at ./Data/Genomic
--rna -> looks at ./Data/Rna
--pattern -> custom pattern to find files for translation. Defaul '_final.fas' and 'RAW.fas' as backup
--genetic_code -> Default 1 -Standard. Look at code file for more genetics codes
--out_exten -> extension of the output file. Default '_transtaled.fas'.

EXAMPLES:

1.- Default

Rna directory. Output will be stored at ./Data/Rna/Species_directory

./Code/translate_sequences.py --rna

Custom directory. Output will be stored at custom directory (where the file is located)

./Code/translate_sequences.py --directory 'path_to_custom_directory'

2.- Changing genetic code

Custom directory. Output will be stored at custom directory (where the file is located)

./Code/get_RAW_sequences.py --directory 'path_to_custom_directory' --genetic_code 4

get_combined_seqs.py

DESCRIPTION:

Combines FASTA files in just one matrix.
You can combine genomic, RNA and protein data. But be careful with duplications. Besides genomic and RNA should have been translated (see above) in the correct reading framework before translation (see above).
Note that the program concatenate sequences to same file if you do not change file_name argument.
After the pipeline (see article) you may have:
1.-Untouched sequences. Typically protein or RNA sequence (valid for future nucleotide matrix if you do not care about UTRs; not valid for future protein matrix, because of lost of reading framework). Extension 'RAW.fas'.
2.- Aligned and trimmed (mismatched positions) sequences. Mandatory for genomic sequences and in some cases for RNA sequences (if you are going to uses for protein matrix after translation, but in this case they will follow point 3). Extension '_final.fas'. Add the extension to your file after visualizing the alingment even when you do not trimmed any position of the alingment.
3.- Translated sequences. Mandatory for genomic sequences and RNA sequences
4.- Raw translated sequences. If the RNA dataset you are working with is just the coding sequence and you do not need to trimm it. Extension '_RAW_translated.fas

USAGE:

**get_combined_seqs.py --data_type or --directory 'path' **

PARAMETERS:

output_name -> name for the file of the combined matrix
output_path -> path were to store the combined matrix

Optional parameters:
--directory -> sets path to custom folder
--genomic -> looks at ./Data/Genomic
--rna -> looks at ./Data/Rna
--protein -> looks at ./Data/Protein
--pattern -> custom pattern to find files for translation. Defaul '_translated.fas' and 'RAW.fas' as backup'
--out_exten -> extension of the output file. Default "_all_combined.fas"

EXAMPLES:

1.- Default

Protein and Genomic directory. Output will be stored at ./Data/Rna/Species_directory

./Code/get_combined_seqs.py --protein --genomic

Custom directory. Output will be stored at custom directory (where the file is located)

./Code/get_combined_seqs.py --directory 'path_to_custom_directory'

2.- Changing pattern

Protein and Genomic directory. Output will be stored at ./Data/Rna/Species_directory

./Code/get_combined_seqs.py --protein --genomic --pattern 'modified.fas'

alignment_trimming.py

DESCRIPTION:

Trims alignment.
See http://trimal.cgenomics.org/use_of_the_command_line_trimal_v1.2 for more information about trimAL and running algorithms.

USAGE:

**alignment_trimming.py trimal_path 'path' input_file 'path' output_file 'path' **

PARAMETERS:

trimal_path -> path to trimAl directory
input_file -> alignment (FASTA file with prot or nucl aligned seqs)
output_file -> output file name

Optional parameters:
--trimal_command -> command to run trimal options. Default -gt 0.1

EXAMPLES:

Examples files are availabe at Example directory.

1.- Default

Remove all positions in the alignment with gaps in 90%

./Code/alignment_trimming.py trimal_path 'path_to_trimAl_source_directory' --input_file 'path_to_alignment' --output_file 'path_to_ouput_file'

2.- Changing algorithm

Remove all positions in the alignment with gaps in 50%

./Code/alignment_trimming.py trimal_path 'path_to_source_directory' --input_file 'path_to_alignment' --output_file 'path_to_ouput_file' --trimal_command -gt 0.5

phylogenetic_inference.py

DESCRIPTION:

Phylogenetic analysis using IQ-TREE.
See http://www.iqtree.org/doc/ for more information about IQ-TREE and running algorithms.

USAGE:

**phylogenetic_inference.py --data_type or --directory 'path' **

PARAMETERS:

iqtree_path -> path to IQ-TREE bin folder
input_file -> alignment (FASTA file with prot or nucl aligned seqs)

Optional parameters:
--trimal_command -> command to run IQ-Tree, without output declaration, see next parameter. Default TEST, ULTRAFAST BOOSTRAP 1000, ALRT 1000
--output_suffix -> suffixes added to input file. Recommended, model and type of support. Default _TEST_UFBS_alrt

EXAMPLES:

1.- Default

Phylogenetic inference running TEST for calculating model of evolutoon, 1000 replicates of Ultrafast Boostrap and 1000 replicates of alrt as support.

./Code/alignment_trimming.py iqtree_path 'path_to_IQTREE_bin_directory' --input_file 'path_to_alignment'

2.- Changing algorithm

Phylogenetic inference running LG as model of evolution, 1000 replicates of Boostrap and 1000 replicates of alrt as support.

./Code/alignment_trimming.py iqtree_path 'path_to_IQTREE_bin_directory' --input_file 'path_to_alignment' --trimal_command '-m LG -b 1000 -alrt 1000

About

Suite of Pyhton programs for Gene extraction and Phylogentic Inference

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages