A collection of assorted python scripts for Bioinformatics purposes
So you want to automatically retrieve a bunch of nucleotide sequences from NCBI. Of course there's Bio.Entrez, but they kindly ask to minimize the number of requests and I happen to find myself in a situation where I have a bunch of text files, each of which contains the accession numbers for orthologue genes across several species.
Which brings us to batch Entrez. This portal allows to upload exactly such text files (no coincidence) and will select the corresponding records for you. Rather than doing this manually for the 80+ files I have, I came up with this script which, from a folder, takes all files and submits them one by one to Entrez and saves the query results as FASTA.
The script contains quite a few time.sleep()
commands sprinkled throughout the code in order to give the server the necessary time to respond (amount of sleep time has been experimentally optimized to a certain degree). However, some of the requests will still fail, so I've wrapped the entire procedure in a loop that'll keep trying to get the failed requests (you can set a maximum number of tries, currently maxiter = 5
).
There's probably a better way to do this, but I'm not (yet) the python wizzard I'd like to be. And probably, since you're reading this, neither are you.