New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taxonomy identifiers for Cov reference database #45
Comments
I also downloaded the original
|
Desired Fields (when avail)
RCE: deleted requested field "FASTA sequence length" because it duplicates information in the sequence itself. |
I'm looking at this now, the most recent R package (https://bioconductor.org/packages/release/bioc/html/genbankr.html) does not allow for only some of the entries to have chromosome names:
I guess I will keep looking, I don't know if they want the files to just be consistent, or if we just need the info. Otherwise, we might delete the chromosome name beforehand? edit: seems we have only 21 entries with the chromosome name so I will delete it |
OPEN questions:
I'm taking the string between the 1st and 2nd comma but its not 100% consistent, more like 90.
|
This issue is solved by having the Genbank records, if we have those there is no need to copy fields into FASTA headers. |
@rcedgar |
None that I know of right now. Sorry you wasted time on that. Things are a bit disorganized because we're moving quickly; not sure how to avoid this problem in future, perhaps wait until an issue is assigned to you. |
makes sense, thanks for the quick response |
I set up @Bdegraaf1234 on this. I'd like to have that genbank metadata in a flat TSV file and in an R data.frame for easier downstream analysis. There's alot of these small formatting issues inconsistencies (i.e. data tidying) which will be incredibly helpful to have moving forward.
We can swing back to these once the main data is cleaned up and well annotated. These values are something we are generating as we process theses bulk sequences into what we are calling a 'pan-genome' You can see this processing procedure in this notebook entry |
The thing with "duplicates" is a formatting problem caused by copy-pasting the list. See the first time it appears above; I edited the list to remove "sequence length" from @ababaian 's original wish-list on the grounds that it was redundant (which it would be with fasta, but not with tsv). |
If we're going to re-format the gb records so that they are more easily parsed, then I would suggest using FASTA per my original "ABC" proposal. This is more flexible and forwards-compatible than TSV, and avoids introducing a new file format -- everything can read fasta, and parsing a defline with fields is just as easy as parsing a tsv file, e.g. in Python it is trivial to do |
What's the current status on this issue? |
Closing this and moving discussion to #101 |
It would be useful to include taxonomy information in pan-genome reference to enable analyses such as taxonomy-aware summaries of bowtie2 hits. Taxonomy descriptions are included in the full FASTA deflines, but not the taxonomy id (i.e., the integer accession in the NCBI Taxonomy database). The integer accession is better because it places the sequence in a tree structure. For example, MH726362.1 is "Porcine epidemic diarrhea virus isolate GDS05, complete genome" and the taxonomy id in the Genbank record is 28295. To implement this, we need a script which can take a long list (~33k) of sequence accessions (MH726362.1) and fetch their taxonomy identifiers. I'm guessing this can be done by a simply query using Entrez or something like that.
Even better would be to have the taxonomy identifier (ideal) or species name (good) of the host as well. This can be done by downloading the Genbank record and extracting the "/host" annotation which gives the species name, e.g. for MH726362.1 you would find /host="Sus scrofa". The taxonomy identifier of the host is not given, so an additional lookup (Entrez?) would be required to retrieve it.
The text was updated successfully, but these errors were encountered: