L1000 peak deconvolution based on Bayesian analysis

Overview

This project is intended to generate high quality perturbagen signatures from LINCS L1000 assay. We build a pipeline, in parallel with L1000 group, to process raw fluorescent intensity data into z-scores as perturbagen signatures. Pre-computed datasets covering a majority of LINCS L1000 Phase I and Phase II is available in Downloads and Zenodo.

Our pipeline is different from the L1000 pipeline mostly in the peak deconvolution algorithm. We implement our algorithm in both C++ and CUDA, which can be used with various languages. We give two examples for how to use these functions with C++ natively and how to be called in Wolfram Mathematica.

Also, we have prepared a small batch of real data and relavant code for you to test our pipeline at a very small scale. You may follow the instructions, run the pipeline, and check the results.

Datasets

Summary

LINCS L1000 Phase I (GSE92742) & Phase II (GSE70138) datasets generated by our pipeline are currently available. The datasets cover three levels: Our Level 4 and Level 5 data are equivalent to Level 4 and Level 5 data provided by L1000; the marginal distributions data of peak locations (GSE92742 small molecule treatments only and GSE70138) are similiar to L1000 Level 2 data, except that they are probability distributions instead of precise numbers of peak locations.

Unless you are interested in managing z-score inference and combination, we encourage you to use combined z-scores by bio-replicates (Level 5 data).

Downloads

Description	Download
Marginal distributions of peak locations	Bayesian_GSE70138_Level2_DPEAK.zip Bayesian_GSE92742_Level2_DPEAK.zip
Plate control z-scores	Bayesian_GSE70138_Level4_ZSPC_n335465x978.h5 Bayesian_GSE92742_Level4_ZSPC_n1093191x978.h5
Combined z-scores by bio-replicates	Bayesian_GSE70138_Level5_COMPZ_n116218x978.h5 Bayesian_GSE92742_Level5_COMPZ_n361481x978.h5
Checksum	Bayesian_L1000_sha512sum.txt

The meta data are available from the publication by L1000 group: GSE70138 and GSE92742. They include perturbagen and cell line information associated with signature and instance IDs in the datasets.

Data stuctures

The z-score results (as HDF5) are compatible with those published by L1000 group. Each of them contains three datasets as follows:

/colid are the signature IDs (Level 5) or instance IDs (Level 4);
/rowid are the names of landmark genes;
/data are the z-scores as a matrix.

Each marginal distribution file contain the information of peak locations on one plate. It contains four datasets as follows:

/colid are the instance IDs;
/rowid are the names of landmark genes;
/peakloc are the locations of the peaks for calculating likelihood function;
/data are encoded log-likelihoods as a rank-3 array of 16-bit unsigned integers. To retrieve the log-likelihoods, the values should be multiplied by a factor of -0.001. Note that they are not normalized.

Citation

Qiu, Yue, et al., 2020, Bioinformatics, 36(9), 2787, https://doi.org/10.1093/bioinformatics/btaa064

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

L1000 peak deconvolution based on Bayesian analysis

Overview

Datasets

Summary

Downloads

Data stuctures

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

L1000 peak deconvolution based on Bayesian analysis

Overview

Datasets

Summary

Downloads

Data stuctures

Citation