Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Include data processing steps, reference to which the reads were aligned or if possible lab protocol into the main table #188

Open
ajandria opened this issue Apr 11, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@ajandria
Copy link

Is your feature request related to a problem? Please describe.

I was wondering whether it is possible to also retrieve data processing description that is present in the sample's records in GEO. See here for an example: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6005004 - there is a lot of information that we would like to see in the table that pysradb generates:

Status
Title
Sample type
Source name
Organism
Characteristics
Treatment protocol
Growth protocol
Extracted molecule
Extraction protocol
Library strategy
Library source
Library selection
Instrument model
Description
Data processing

Describe the solution you'd like

I like the table that is currently generated using the following:
df = db.sra_metadata(df["study_accession"], detailed = True, expand_sample_attributes = True, output_read_lengths = True)
although I feel like it is missing sometimes crucial information that is only included in GEO under specific records of the samples. For an example it the record of the sample that I have included above you can find the following:

Sequenced reads were trimmed for adaptor sequence and low-quality sequence (bbduk; minlength=30, qtrim=rl, trimq=15)
Reads were then mapped to the reference genome of Mus musculus (GRCm38) using STAR aligner version 2.5.3a with parameters --quantMode GeneCounts --runThreadN 4
Assembly: GRCm38

It would be nice to get that into the sra_metadata table too if that is possible. I guess for now I could just use geoquery for that and then merge two tables if possible by GSM sample ids, although I would need to test that. Then probably the hustle including this here would be redundant. But still it seems like a nice direction that one could take to expand this :)

Thank you for your work so far!

@ajandria ajandria added the enhancement New feature or request label Apr 11, 2023
@ajandria ajandria changed the title [ENH] [ENH] Include data processing steps, reference to which the reads were aligned or if possible lab protocol into the main table Apr 11, 2023
@saketkc
Copy link
Owner

saketkc commented Apr 11, 2023

Thanks, this is a great suggestion! It is doable - once the experiment_alias is fetched pysradb would need to make another request for the corresponding detailed GEO metadata. I currently do not have the bandwidth to do this, but pull requests are always welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants