Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to extract a m/z - rt slice of data? #3

Open
jorainer opened this issue Aug 8, 2018 · 29 comments
Open

Possible to extract a m/z - rt slice of data? #3

jorainer opened this issue Aug 8, 2018 · 29 comments

Comments

@jorainer
Copy link

jorainer commented Aug 8, 2018

Assuming I have an retention time range and an m/z range, what would be the fastest/easiest way to extract the intensity values within these ranges from an mzDB file?

Looking through the exported methods I didn't spot a function that would enable such a call.

Maybe a stupid question, but I assume mzDB files index the spectrum data by retention time/scan index (same as in mzML). Are intensity values also indexed by m/z (to support faster extraction in the above case)?

@david-bouyssie
Copy link
Contributor

Assuming I have an retention time range and an m/z range, what would be the fastest/easiest way to >extract the intensity values within these ranges from an mzDB file?
Looking through the exported methods I didn't spot a function that would enable such a call.

This is still in the list of TODOs. For now we only expose single spectrum or spectra iterator interfaces.
The XICs functions need to be ported from the Java implementation.

Maybe a stupid question, but I assume mzDB files index the spectrum data by retention time/scan index (same as in mzML). Are intensity values also indexed by m/z (to support faster extraction in the above case)?

Yes spectra are indexed by their primary key (database ID) and two other indexes (https://github.com/mzdb/mzdb-specs/blob/master/version_0.7.0/mzDB_0.7.0_schema_script.sql):

CREATE UNIQUE INDEX spectrum_initial_id_idx ON spectrum (initial_id ASC,run_id ASC);
CREATE INDEX spectrum_ms_level_idx ON spectrum (ms_level ASC,run_id ASC);

Peaks ([m/z - intensity] pairs) are also indexed by m/z thanks to the Bounding Box representation.
The default size of each MS1 level BB is [5 Da ; 15s].

@jorainer
Copy link
Author

jorainer commented Aug 9, 2018

thanks for the clarification

@jorainer jorainer closed this as completed Aug 9, 2018
@david-bouyssie
Copy link
Contributor

You are welcome.

I reopen to have it on the list of TODOs.

@david-bouyssie david-bouyssie reopened this Aug 9, 2018
@heejongkim
Copy link

Any progress on this issue?
It would be awesome to have XIC across RT function for DIA (given precursor m/z bin).

Thanks!

@david-bouyssie
Copy link
Contributor

No progress for now. The student who was working on this has finished his internship.
I have to update the current code base to enable this functionality.
The easiest possibility for me would be to create a RCpp C++ Class using the "BBOX ITERATOR" functions defined in libmzdb. It should not be very complicated.
Let see if I can work on this next week.

@heejongkim
Copy link

That would be tremendously helpful for my little script.
I used MSnbase to extract several thousands of XIC across entire RT (both ms1 and ms2). Even though I did nested parallelization, I can't go below 30min processing time per file. Directly utilizing the vendor's binary access is sweet but I'd like to stick to the standard format to support cross-platform. I can't wait your next update. I'm so excited!!

Thanks!

@jorainer
Copy link
Author

@heejongkim: to extract multiple XIC from the same file I suggest that you pass the definition for the mz and rt range as matrix to the chromatogram function (one row for e.g. lower and upper rt, or mz). The chromatogram for OnDiskMSnExp has been optimized to load the data required for each XIC only once. Alternatively, you could try to load the data into memory (i.e. as an MSnExp) - could also be a little faster.

@david-bouyssie
Copy link
Contributor

Here are some good news.
I have an experimental version which works on my laptop (SSD drive).

Here are some information regarding the dataset:

  • mzDB file size: 1.4 GB
  • acquisition type: DDA HCD (QEx HFX)
  • 23096 MS spectra ; 129471 MS/MS spectra
  • number of performed XICs: 10000

Total XICs duration: 10 seconds

I need to perform other benchmarks and to test on DIA files, but these are interesting preliminary results I wanted to share.

@heejongkim
Copy link

@jotsetung Thanks for the suggestion. By doing a minor change to accommodate the matrix-based extraction, I was able to bring it down to 17 min.

@david-bouyssie Awesome, I can't wait to give it a shot on my DIA datasets. Thanks for your great work.

@david-bouyssie
Copy link
Contributor

@heejongkim great. how many XICs do you perform?

@heejongkim
Copy link

@david-bouyssie Roughly 20,000 ~ 30,000 for now

@heejongkim
Copy link

@david-bouyssie That number range is the current rough estimation from my peptide table without counting M+1, M+2 or SILAC pairs. Also, this needs to be done on multiple files (from 10 to 100 raw files). So, I wonder if mzDB interface will behave well under foreach or mclapply parallelization.
Many Thanks!!

@david-bouyssie
Copy link
Contributor

@heejongkim: ok, thank you for the update.
I made some other tests on the same data file.
Here are the results:

  • 113920 XICs based on MS2 events
  • XICs duration when performed from C++ and returning C++ data structure =>13.7 secs
  • XICs duration when performed from C++ and returning R data structure =>24.2 secs
  • XICs duration when performed from R => 57.2 secs

Thus, in this context it appears that R is 2 to 4 times slower than bare metal C++.

If I goes down to 27451 XICs the total duration is about 8.11 seconds when calling from R.

Note that I have not checked to quality of the retrieved data. I hope everything went fine and that I made no mistake that could explain the so fast observed results.

The next step is to test reading DIA files with rmzdb. Stay tuned ;-)

@david-bouyssie
Copy link
Contributor

About the parallelization, I would suggest to not perform the XICs in parallel inside a given mzDB for multiple reasons that I could discuss here if you want.

You could effectively open multiple mzDB files in parallel and you might observe some slight improvements but I guess only for a limited number of concurrent file operations.

However, my advice would be to avoid parallelization for IO operations and to apply it to your post-processing pipeline where CPU usage might be important.

@heejongkim
Copy link

@david-bouyssie So, it's still awesomely fast so I don't think I really need to parallelize the XIC. I can't wait to try it out on my DIA datasets. Thank you so much for the update.

Out of curiosity, "XICs duration when performed from C++ and returning R data structure" <- this will be some sort of compiled Rcpp interface? I'm very much interested in performance improvement when appropriate and applicable.

Thanks!

@david-bouyssie
Copy link
Contributor

@heejongkim: effectively when I don't send the results back to R then the observed timings are better. It happens when everything is done in C++ only, without constructing the R data matrix and without returning it to R via Rcpp.
However, IMO the observed will probably be negligible compared to the time required to post-process the XICs.

And if you really want high performance data processing, it might be worth to implement some parts of your post-processing pipeline in C++ using Rcpp.

@heejongkim
Copy link

@david-bouyssie I totally agree. Eventually, I need to learn Rcpp for the post-processing for the alignment, co-elution evaluation, and scoring functions.
Whenever the test version is available, let me know.
Again, thank you so much!

@david-bouyssie
Copy link
Contributor

@heejongkim: good news. I just released a preview version of rmzdb including support for DIA XICs.
The code is not committed yet because I need to clean it a little bit and this will take time, and as you know this the most wanted resource...
The main issue I had is relative the XML parsing library I was using. I had some weird bugs when using it from RCpp. I spent some time trying to fix it, but in the end I decided to switch to a different library which faster and more stable.
I thus need to update the libmzdb project to finish this dependency switch.

In the meantime you can try this pre-release:
https://github.com/mzdb/rmzdb/releases/download/v0.1-alpha/rmzdb_0.1_dia_preview.zip

I hadn't the time to make QC against the retrieved XICs. So please double check that you obtain reliable results compared to what you observe with other file formats.
If you want more details we can plan a call.

Side note: I saw that you work in a core facility. You might be interested to check the tools we recently developed for DDA datasets: http://www.profiproteomics.fr/proline/
It is of course based on the mzDB file format and thus the quantification phase is very fast.
We will soon submit a manuscript to present some benchmarks regarding the quality of the obtained results.

@heejongkim
Copy link

@david-bouyssie Thank you so much for the alpha version. I'm so excited but embarrassingly having a little trouble to install the alpha version here.
Here's what I used to try it:

install.packages("https://github.com/mzdb/rmzdb/releases/download/v0.1-alpha/rmzdb_0.1_dia_preview.zip", repos = NULL, type="source", destdir = 'C:/temp/source', lib = 'C:/temp/libs')

trying URL 'https://github.com/mzdb/rmzdb/releases/download/v0.1-alpha/rmzdb_0.1_dia_preview.zip'
Content type 'application/octet-stream' length 1893409 bytes (1.8 MB)
downloaded 1.8 MB

Warning in read.dcf(file.path(pkgname, "DESCRIPTION"), c("Package", "Type")) :
cannot open compressed file 'rmzdb_0.1_dia_preview/DESCRIPTION', probable reason 'No such file or directory'
Error in read.dcf(file.path(pkgname, "DESCRIPTION"), c("Package", "Type")) :
cannot open the connection
In R CMD INSTALL
Warning in install.packages :
installation of package ‘C:/temp/source/rmzdb_0.1_dia_preview.zip’ had non-zero exit status

I thought that it's some sort of a permission problem so I started the Rstudio with admin permission and add destdir and lib locations but none of them was helpful.

Have you ever encountered something like this before?

Thanks!

@david-bouyssie
Copy link
Contributor

david-bouyssie commented Oct 12, 2018

It might be my mistake because I created the zip file manually. The zip file contains an rmzdb folder. Maybe the R installation system assumes that the package fIles are located at the root level of the archive.
What I could suggest is to unzip the archive and then call "R CMD build rmzdb" from the unzipped folder (the one containing the rmzdb folder).
Then from Rstudio change the workspace location to the same directory anf execute:
install.packages("rmzdb", repos = NULL, type="source", INSTALL_opts=c("--no-multiarch"))

@david-bouyssie
Copy link
Contributor

@heejongkim: I found a problem in the zip file. The DESCRIPTION file contained a blank line that was not respecting the R DESCRIPTION file writing convention. After removing this blank line, install.packages is effectively working.
Sorry for this issue, I edited the file just before uploading...

I just updated the released zip file, so if you download the fixed archive it should also work for you.
Note that to have a proper auto-completion of available methods (on each specific class/object) you need a recent version of R Studio. I discovered that old versions were not supporting this feature quite well.
See: rstudio/rstudio#1654

@heejongkim
Copy link

@david-bouyssie Unfortunately, I still get the same error message.

install.packages("https://github.com/mzdb/rmzdb/releases/download/v0.1-alpha/rmzdb_0.1_dia_preview.zip", type="source", repo=NULL)
trying URL 'https://github.com/mzdb/rmzdb/releases/download/v0.1-alpha/rmzdb_0.1_dia_preview.zip'
Content type 'application/octet-stream' length 1898519 bytes (1.8 MB)
downloaded 1.8 MB

Warning in read.dcf(file.path(pkgname, "DESCRIPTION"), c("Package", "Type")) :
cannot open compressed file 'rmzdb_0.1_dia_preview/DESCRIPTION', probable reason 'No such file or directory'
Error in read.dcf(file.path(pkgname, "DESCRIPTION"), c("Package", "Type")) :
cannot open the connection
In R CMD INSTALL
Warning in install.packages :
installation of package ‘C:/Users/ADMINI~1/AppData/Local/Temp/2/Rtmpa6Fv08/downloaded_packages/rmzdb_0.1_dia_preview.zip’ had non-zero exit status

Also, I updated the RStudio as of now. I believe the issue is still coming from DESCRIPTION.

Will try R CMD INSTALL ...

@david-bouyssie
Copy link
Contributor

@heejongkim: as I said previously, the provided zip file need to be first unzipped because it's a manually created archive. However I see that it's simpler to perform the install.packages directly on the archive so I replaced the manually created zip file by a tar.gz one created using R CMD BUILD.

I tested the following command on a clean computer and it works:
install.packages("https://github.com/mzdb/rmzdb/releases/download/v0.1-alpha/rmzdb_0.1_dia_preview.tar.gz", type="source", repo=NULL, INSTALL_opts=c("--no-multiarch"))

I hope it will also work for you!

@heejongkim
Copy link

@heejongkim Ooops. It got installed beautifully. Let me give a shot.

@david-bouyssie
Copy link
Contributor

@heejongkim: great :D If you want some code examples, have look at the script named "test_dia.R" (should be inside the tar.gz archive). If something is not clear just let me know. I hope it will perform well with your datasets.
Side note: if you want to increase performance, it might be possible to change the settings you use to create the mzDB file.s If you define smaller m/z windows (let say 1 Da instead of 5 Da) you may observe a performance increase. However the file size should be a little bit higher. I never made this comparison on DIA data so I may be wrong...

@david-bouyssie
Copy link
Contributor

@heejongkim So, are you satisfied with the current release?

@heejongkim
Copy link

@david-bouyssie Sorry. I had a grant due. I'm back and will check it out tonight (hopefully).
Thanks!

@heejongkim
Copy link

@david-bouyssie

It seems it doesn't work on my hands and probably it's due to my dumb setting or conversion.
If you can enlighten me, it would be great.

Here's what I did.
First, I made the mzDB by following command:
raw2mzDB_0.9.10_build20170802\raw2mzDB.exe -c 1-2 -a dia -s -i 1.raw -o 1.mzDB
---- last few lines of outputs ----
W1027 13:15:45.087618 47052 swath_producer.hpp:178] Low resolution spectrum is not normal in DIA mode
W1027 13:15:45.090617 47052 swath_producer.hpp:178] Low resolution spectrum is not normal in DIA mode
W1027 13:15:45.093627 47052 swath_producer.hpp:178] Low resolution spectrum is not normal in DIA mode
W1027 13:15:45.097627 47052 swath_producer.hpp:178] Low resolution spectrum is not normal in DIA mode
W1027 13:15:45.100641 47052 swath_producer.hpp:178] Low resolution spectrum is not normal in DIA mode
W1027 13:15:45.103631 47052 swath_producer.hpp:178] Low resolution spectrum is not normal in DIA mode
W1027 13:15:45.106627 47052 swath_producer.hpp:178] Low resolution spectrum is not normal in DIA mode
[======> ] 12% I1027 13:15:47.431748 42164 mzdb_writer.hpp:536] Creating permanent spectrum table
I1027 13:15:47.519750 42164 mzdb_writer.cpp:368] Creates indexes...
I1027 13:15:47.892771 42164 spectrum_inserter.h:599]

************ Summary of the conversion ************

Created file: C:\Users\Administrator\Desktop\jcl\HPV_BPV_runs\20181013_jcl_McBride_293TT_HPVctrls_histones_1.mzDB
MS Input Output Nb spectra
MS1 PROFILE CENTROID 11541

I1027 13:15:47.896770 42164 raw2mzDB.cpp:635] Checking run slices numbers
I1027 13:15:48.995817 42164 raw2mzDB.cpp:652] Elapsed Time: 292.547 sec
---------------- End of the output from raw2mzDB

after that, I followed your test script and, from dia window extraction, I get nothing:

mzdb_obj <- new(MzDb, "1.mzDB")
Opening mzDB file...
bpc_mat <- mzdb_obj$get_bpc_matrix(1)
print(nrow(bpc_mat))
[1] 11541
dia_windows <- mzdb_obj$get_dia_isolation_windows()
print(nrow(dia_windows))
[1] 0
head(dia_windows)
<0 x 0 matrix>

Thank you so much!

best,
hee jong

p.s. I opened the mzDB directly from SQLite Browser but I realized that I need to study your paper more. I'm heading back to mzDB paper now.

@david-bouyssie
Copy link
Contributor

It seems that your DIA windows are not well defined. It may be an issue relative to the raw file conversion.
Could you try to convert with the option -f 1-2 instead of -c 1-2 ?
It should not have a direct link with this issue bit I would to check if it changes something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants