New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible to extract a m/z - rt slice of data? #3
Comments
This is still in the list of TODOs. For now we only expose single spectrum or spectra iterator interfaces.
Yes spectra are indexed by their primary key (database ID) and two other indexes (https://github.com/mzdb/mzdb-specs/blob/master/version_0.7.0/mzDB_0.7.0_schema_script.sql):
Peaks ([m/z - intensity] pairs) are also indexed by m/z thanks to the Bounding Box representation. |
thanks for the clarification |
You are welcome. I reopen to have it on the list of TODOs. |
Any progress on this issue? Thanks! |
No progress for now. The student who was working on this has finished his internship. |
That would be tremendously helpful for my little script. Thanks! |
@heejongkim: to extract multiple XIC from the same file I suggest that you pass the definition for the mz and rt range as |
Here are some good news. Here are some information regarding the dataset:
Total XICs duration: 10 seconds I need to perform other benchmarks and to test on DIA files, but these are interesting preliminary results I wanted to share. |
@jotsetung Thanks for the suggestion. By doing a minor change to accommodate the matrix-based extraction, I was able to bring it down to 17 min. @david-bouyssie Awesome, I can't wait to give it a shot on my DIA datasets. Thanks for your great work. |
@heejongkim great. how many XICs do you perform? |
@david-bouyssie Roughly 20,000 ~ 30,000 for now |
@david-bouyssie That number range is the current rough estimation from my peptide table without counting M+1, M+2 or SILAC pairs. Also, this needs to be done on multiple files (from 10 to 100 raw files). So, I wonder if mzDB interface will behave well under foreach or mclapply parallelization. |
@heejongkim: ok, thank you for the update.
Thus, in this context it appears that R is 2 to 4 times slower than bare metal C++. If I goes down to 27451 XICs the total duration is about 8.11 seconds when calling from R. Note that I have not checked to quality of the retrieved data. I hope everything went fine and that I made no mistake that could explain the so fast observed results. The next step is to test reading DIA files with rmzdb. Stay tuned ;-) |
About the parallelization, I would suggest to not perform the XICs in parallel inside a given mzDB for multiple reasons that I could discuss here if you want. You could effectively open multiple mzDB files in parallel and you might observe some slight improvements but I guess only for a limited number of concurrent file operations. However, my advice would be to avoid parallelization for IO operations and to apply it to your post-processing pipeline where CPU usage might be important. |
@david-bouyssie So, it's still awesomely fast so I don't think I really need to parallelize the XIC. I can't wait to try it out on my DIA datasets. Thank you so much for the update. Out of curiosity, "XICs duration when performed from C++ and returning R data structure" <- this will be some sort of compiled Rcpp interface? I'm very much interested in performance improvement when appropriate and applicable. Thanks! |
@heejongkim: effectively when I don't send the results back to R then the observed timings are better. It happens when everything is done in C++ only, without constructing the R data matrix and without returning it to R via Rcpp. And if you really want high performance data processing, it might be worth to implement some parts of your post-processing pipeline in C++ using Rcpp. |
@david-bouyssie I totally agree. Eventually, I need to learn Rcpp for the post-processing for the alignment, co-elution evaluation, and scoring functions. |
@heejongkim: good news. I just released a preview version of rmzdb including support for DIA XICs. In the meantime you can try this pre-release: I hadn't the time to make QC against the retrieved XICs. So please double check that you obtain reliable results compared to what you observe with other file formats. Side note: I saw that you work in a core facility. You might be interested to check the tools we recently developed for DDA datasets: http://www.profiproteomics.fr/proline/ |
@david-bouyssie Thank you so much for the alpha version. I'm so excited but embarrassingly having a little trouble to install the alpha version here. install.packages("https://github.com/mzdb/rmzdb/releases/download/v0.1-alpha/rmzdb_0.1_dia_preview.zip", repos = NULL, type="source", destdir = 'C:/temp/source', lib = 'C:/temp/libs')trying URL 'https://github.com/mzdb/rmzdb/releases/download/v0.1-alpha/rmzdb_0.1_dia_preview.zip' Warning in read.dcf(file.path(pkgname, "DESCRIPTION"), c("Package", "Type")) : I thought that it's some sort of a permission problem so I started the Rstudio with admin permission and add destdir and lib locations but none of them was helpful. Have you ever encountered something like this before? Thanks! |
It might be my mistake because I created the zip file manually. The zip file contains an rmzdb folder. Maybe the R installation system assumes that the package fIles are located at the root level of the archive. |
@heejongkim: I found a problem in the zip file. The DESCRIPTION file contained a blank line that was not respecting the R DESCRIPTION file writing convention. After removing this blank line, install.packages is effectively working. I just updated the released zip file, so if you download the fixed archive it should also work for you. |
@david-bouyssie Unfortunately, I still get the same error message.
Warning in read.dcf(file.path(pkgname, "DESCRIPTION"), c("Package", "Type")) : Also, I updated the RStudio as of now. I believe the issue is still coming from DESCRIPTION. Will try R CMD INSTALL ... |
@heejongkim: as I said previously, the provided zip file need to be first unzipped because it's a manually created archive. However I see that it's simpler to perform the install.packages directly on the archive so I replaced the manually created zip file by a tar.gz one created using R CMD BUILD. I tested the following command on a clean computer and it works: I hope it will also work for you! |
@heejongkim Ooops. It got installed beautifully. Let me give a shot. |
@heejongkim: great :D If you want some code examples, have look at the script named "test_dia.R" (should be inside the tar.gz archive). If something is not clear just let me know. I hope it will perform well with your datasets. |
@heejongkim So, are you satisfied with the current release? |
@david-bouyssie Sorry. I had a grant due. I'm back and will check it out tonight (hopefully). |
It seems it doesn't work on my hands and probably it's due to my dumb setting or conversion. Here's what I did. ************ Summary of the conversion ************ Created file: C:\Users\Administrator\Desktop\jcl\HPV_BPV_runs\20181013_jcl_McBride_293TT_HPVctrls_histones_1.mzDB I1027 13:15:47.896770 42164 raw2mzDB.cpp:635] Checking run slices numbers after that, I followed your test script and, from dia window extraction, I get nothing:
Thank you so much! best, p.s. I opened the mzDB directly from SQLite Browser but I realized that I need to study your paper more. I'm heading back to mzDB paper now. |
It seems that your DIA windows are not well defined. It may be an issue relative to the raw file conversion. |
Assuming I have an retention time range and an m/z range, what would be the fastest/easiest way to extract the intensity values within these ranges from an mzDB file?
Looking through the exported methods I didn't spot a function that would enable such a call.
Maybe a stupid question, but I assume mzDB files index the spectrum data by retention time/scan index (same as in mzML). Are intensity values also indexed by m/z (to support faster extraction in the above case)?
The text was updated successfully, but these errors were encountered: