Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Theta doesn't converge #235

Open
binyaminZ opened this issue Mar 25, 2022 · 10 comments
Open

Theta doesn't converge #235

binyaminZ opened this issue Mar 25, 2022 · 10 comments

Comments

@binyaminZ
Copy link

Hi!
I'm running BASiCS with a large dataset, with 20-30K cells per sample. For noise quantification, I down-sampled it to ~5K cells. This data has no spike-ins. If I understand correctly, batch information is somehow used instead of spike-ins. The problem is that in my MCMC chains all parameters converge nicely (with N = 20000, Thin = 20, Burn = 10000) except for Theta, which obviously doesn't and is approaching Zero. I don't care so much about this, it seems like my replicates are very similar and basically there is no batch contribution to mean or noise. May this have some negative impact on the estimates of the other parameters? Or can I ignore this?

Finally, I would like to run BASiCS on the complete data set. Problem is that the down-sampled dataset runs for ~70 hours, and doing the same for the complete dataset is very long. What may help? Decreasing the number of genes quantified? Splitting the two batches into smaller groups (probably not optimal for many genes with sparse expression)?

Thanks!

@alanocallaghan
Copy link
Collaborator

I'm not sure what's going on with theta in your example. I'll try to get back to you on that, but in the meantime here's an example of some example code that will run MCMC across different subsets of genes and combine the results - it should be a good deal faster

library("BASiCS")
library("BiocParallel")
## register the number of cores here as normal with biocparallel
## you cant use multicoreparam currently with BASiCS
register(SnowParam(workers = 2))

## make some mock data
sce <- BASiCS_MockSCE(NGenes = 4000, NCells = 100)

## fit a model normally
fit_joint <- BASiCS_MCMC(sce,
    ## N,Thin,Burn here are far too low, as this is just for a demonstration
    N = 2000, Thin = 10, Burn = 1000, Regression = TRUE, WithSpikes = TRUE
)

## fit a model across 4 cores, each with a subset of the genes
fit_split <- BASiCS_MCMC(sce,
    ## N,Thin,Burn here are far too low, as this is just for a demonstration
    N = 2000, Thin = 10, Burn = 1000, Regression = TRUE, WithSpikes = TRUE,
    ## split the data into batches with 1000 genes each
    SubsetBy = "gene", NSubsets = 4
)
## compare
BASiCS_ShowFit(fit_joint)
BASiCS_ShowFit(fit_split)

@binyaminZ
Copy link
Author

See here convergence plots:
Mu-
image
Delta-
image
s-
image
nu-
image
Theta-
image

@catavallejos
Copy link
Owner

Hi @binyaminZ. You mention that you have 20-30K cells per sample. Is there a strong sub-structure in the data (i.e. do you have very distinct sub-types within those cells). BASiCS is designed for supervised analysis in which you have sorted distinct populations a priori and therefore strong substructure will break some of the model assumptions. If that's the case, perhaps you can run BASiCS separately for each distinct cell type? That will also speed up things a lot as you can run them in parallel.

@binyaminZ
Copy link
Author

Hi @catavallejos,
Thanks for your response! No, this is a very uniform population of Jurkat cells, and we are studying the effect of a certain treatment on noise. Dimensionality reductions (both UMAP and tSNE) show very nicely that there is no structure, just one population after some very basic filtering.

@binyaminZ
Copy link
Author

Dear @alanocallaghan,
trying to apply the parallelization code you suggested, I have a few questions:
a. if you register 2 workers but have 4 subsets, it means each core will run 2 consecutive jobs. does this still provide an advantage over subsetting into 2 subsets in terms of time?
b. don't I need to supply Threads = ncors parameter for BASiCS_MCMC? how does it evaluate the registered cores (by using detectCores() or availableCores() or something else directly seeing what I registered)?
c. what would be the minimal size you suggest for each subset? can I go down to 500, or 1000 is better? I probably get worse per-cell parameters as I reduce the gene batch size, not sure to what extent BASiCS_MCMC can integrate all the subsets after running in parallel.

Thanks!
Binyamin

@alanocallaghan
Copy link
Collaborator

alanocallaghan commented Mar 28, 2022

a. if you register 2 workers but have 4 subsets, it means each core will run 2 consecutive jobs. does this still provide an advantage over subsetting into 2 subsets in terms of time?

That's right, and no it probably won't provide any benefit over just doing 2 subsets. That's because you'll still just have two cores running sequentially through basically the same amount of work.

b. don't I need to supply Threads = ncors parameter for BASiCS_MCMC? how does it evaluate the registered cores (by using detectCores() or availableCores() or something else directly seeing what I registered)?

I don't know how well Threads will work with the divide and conquer approach above, honestly to be safe I'd just set Threads = 1 and use the workers for parallelisation.

c. what would be the minimal size you suggest for each subset? can I go down to 500, or 1000 is better? I probably get worse per-cell parameters as I reduce the gene batch size, not sure to what extent BASiCS_MCMC can integrate all the subsets after running in parallel.

The main issue I saw when prototyping and testing this stuff was that estimating the mean vs overdispersion trend becomes very unstable with a small amount of genes. The code tries to divide into batches semi-intelligently, but even so if you have only a few hundred genes the trend is very noisy. Based on what I saw in testing, 500 genes per subset is about the minimum, but if you can run with 1000, then that should be a bit more reliable.

@binyaminZ
Copy link
Author

Thanks!! I get it better now. (I have no experience with parallel processing in R, sorry for the confusion)
another thing that seems like a bug - when I specify the subsets for parallelization, I get the following error message:

basics_results = BASiCS_MCMC(data, N = 200, Thin = 20, Burn = 100, WithSpikes = FALSE, 
                             Regression = F, PrintProgress = FALSE,
                             StoreChains = TRUE, RunName = samp,
                             SubsetBy = "gene", NSubsets = ncors)
Error: BiocParallel errors
  8 remote errors, element index: 1, 2, 3, 4, 5, 6, ...
  0 unevaluated and other errors
  first remote error: object 'samp' not found

samp is a variable with my sample name, and it's present in my environment, but somehow it's not passed to the workers. when I use a quoted expression (like "my_sample"), the process is successful with no errors.

@alanocallaghan
Copy link
Collaborator

That's interesting! I don't know for sure what's going on there as different BiocParallel backends can be a bit odd. Can you post the rest of your code (just the stuff setting up BiocParallel and BASiCS)?

@binyaminZ
Copy link
Author

binyaminZ commented Mar 28, 2022

library(BASiCS)
library(Seurat)
library(data.table)
library("BiocParallel")
library(parallelly)

ncors = availableCores()
register(SnowParam(workers = ncors))

my_strsplit = function(x, sep, n)
{
  x = unlist(lapply(strsplit(x, sep), "[", n))
  return(x)
}

samp = commandArgs(trailingOnly=TRUE)

data = fread(paste0(samp, "_filtered_counts2.csv"),data.table = F, select = 1:150)
rownames(data) = data$V1
data = data[,-1]
data = BASiCS_Filter(as.matrix(data),MinAvCountsPerCellsWithExpression = 1.5)$Counts

repN = sub("rep", "", my_strsplit(colnames(data),"_",3))
data = SingleCellExperiment(assays = list(counts = data), 
                            colData = data.frame(BatchInfo = repN))
samp = paste0(samp, "_test")
basics_results = BASiCS_MCMC(data, N = 200, Thin = 20, Burn = 100, WithSpikes = FALSE, 
                             Regression = F, PrintProgress = FALSE,
                             StoreChains = TRUE, RunName = samp,
                             SubsetBy = "gene", NSubsets = ncors)
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] parallelly_1.26.1           BiocParallel_1.28.3         data.table_1.14.0          
 [4] SeuratObject_4.0.2          Seurat_4.0.3                BASiCS_2.7.5               
 [7] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0 Biobase_2.54.0             
[10] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1         IRanges_2.28.0             
[13] S4Vectors_0.32.3            BiocGenerics_0.40.0         MatrixGenerics_1.6.0       
[16] matrixStats_0.61.0         

loaded via a namespace (and not attached):
  [1] snow_0.4-4                plyr_1.8.6                igraph_1.2.11            
  [4] lazyeval_0.2.2            splines_4.1.0             listenv_0.8.0            
  [7] scattermore_0.7           ggplot2_3.3.5             digest_0.6.29            
 [10] htmltools_0.5.2           viridis_0.6.2             fansi_1.0.2              
 [13] magrittr_2.0.1            ScaledMatrix_1.2.0        tensor_1.5               
 [16] cluster_2.1.2             ROCR_1.0-11               limma_3.50.1             
 [19] globals_0.14.0            spatstat.sparse_2.0-0     colorspace_2.0-3         
 [22] ggrepel_0.9.1             dplyr_1.0.7               crayon_1.5.0             
 [25] RCurl_1.98-1.6            jsonlite_1.8.0            hexbin_1.28.2            
 [28] spatstat.data_2.1-0       survival_3.2-11           zoo_1.8-9                
 [31] glue_1.4.2                polyclip_1.10-0           gtable_0.3.0             
 [34] zlibbioc_1.40.0           XVector_0.34.0            leiden_0.3.8             
 [37] DelayedArray_0.20.0       BiocSingular_1.10.0       future.apply_1.7.0       
 [40] abind_1.4-5               scales_1.1.1              DBI_1.1.1                
 [43] edgeR_3.36.0              miniUI_0.1.1.1            Rcpp_1.0.8               
 [46] viridisLite_0.4.0         xtable_1.8-4              reticulate_1.20          
 [49] spatstat.core_2.2-0       dqrng_0.3.0               rsvd_1.0.5               
 [52] metapod_1.2.0             htmlwidgets_1.5.4         httr_1.4.2               
 [55] RColorBrewer_1.1-2        ellipsis_0.3.2            ica_1.0-2                
 [58] pkgconfig_2.0.3           scuttle_1.4.0             uwot_0.1.10              
 [61] deldir_0.2-10             locfit_1.5-9.4            utf8_1.2.2               
 [64] tidyselect_1.1.1          rlang_0.4.11              reshape2_1.4.4           
 [67] later_1.3.0               munsell_0.5.0             tools_4.1.0              
 [70] generics_0.1.0            ggridges_0.5.3            stringr_1.4.0            
 [73] fastmap_1.1.0             goftest_1.2-2             fitdistrplus_1.1-5       
 [76] purrr_0.3.4               RANN_2.6.1                pbapply_1.4-3            
 [79] future_1.21.0             nlme_3.1-152              sparseMatrixStats_1.6.0  
 [82] mime_0.12                 scran_1.22.1              ggExtra_0.9              
 [85] compiler_4.1.0            plotly_4.9.4.1            png_0.1-7                
 [88] spatstat.utils_2.2-0      tibble_3.1.6              statmod_1.4.36           
 [91] stringi_1.7.6             lattice_0.20-44           bluster_1.4.0            
 [94] Matrix_1.3-3              vctrs_0.3.8               pillar_1.7.0             
 [97] lifecycle_1.0.1           spatstat.geom_2.2-0       lmtest_0.9-38            
[100] RcppAnnoy_0.0.18          BiocNeighbors_1.12.0      cowplot_1.1.1            
[103] bitops_1.0-7              irlba_2.3.5               httpuv_1.6.5             
[106] patchwork_1.1.1           R6_2.5.1                  promises_1.2.0.1         
[109] KernSmooth_2.23-20        gridExtra_2.3             codetools_0.2-18         
[112] MASS_7.3-54               assertthat_0.2.1          sctransform_0.3.2        
[115] GenomeInfoDbData_1.2.7    mgcv_1.8-36               parallel_4.1.0           
[118] grid_4.1.0                rpart_4.1-15              beachmat_2.10.0          
[121] tidyr_1.1.3               coda_0.19-4               DelayedMatrixStats_1.16.0
[124] Rtsne_0.15                shiny_1.7.1              

@alanocallaghan
Copy link
Collaborator

Sorry for the delay getting back to this - for future reference it's easier to debug stuff like this if you give a reproducible example (ie something I can run locally without modifying, and not something that references files you have on your computer that I don't have access to). I'll patch out the code that causes this error in the next Bioc version

As a quick fix, you can leave out the StoreChains and RunName arguments, and save the object to disk yourself like this:

library("BASiCS")
library("Seurat")
library("data.table")
library("BiocParallel")
library("parallelly")

ncors <- availableCores()
register(SnowParam(workers = ncors))

data <- BASiCS_MockSCE()

samp <- "Run1"

basics_results <- BASiCS_MCMC(
    data, N = 200, Thin = 20, Burn = 100, WithSpikes = FALSE, 
    Regression = FALSE, PrintProgress = FALSE,
    SubsetBy = "gene", NSubsets = ncors
)
saveRDS(basics_results, paste0(samp, ".rds"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants