Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convergence issues due to all cells in comparison expressed certain genes? #151

Open
hoholee opened this issue Mar 4, 2021 · 2 comments

Comments

@hoholee
Copy link

hoholee commented Mar 4, 2021

Hi MAST developers,

I ran into some convergence issues when I tried to fit a model with a random effect term. See the attached minimum reproducible rds file and the following code block.
(This may relate to these issues: #147 #148 #142)

If I'm understanding this correctly, the reason that genes MALAT1 and GPM6A failed to converge on the discrete part of the model is that all the cells are expressing these genes. So when the model is trying to predict a constant value (1 for expressed in this case) from the covariants it raised the error "Error: Response is constant". But this is also true when I ran it with a fixed model (excluding the subject term). How come in that model the discrete part for these two genes fit well? Is it because of the eBayes steps used in the fixed model (which is turned off in the glmer model used with the random effect)?

Another issue I noticed is for some genes like CADM2 and MAGI, MAST reported extremely huge FC and confidence intervals. I'm not sure whether this is also because of the imbalance of the number of cells expressing or not expressing these genes, but there are only < 5 cells that do not express these genes (see the code block below). Is it why the coefficients of the discrete part are so huge and hence the extreme FC and CI?

Genes are expressed in all cells in comparison can still be potentially interesting, especially in scenarios that comparing disease vs. control samples. So what would you recommend to do if I want to recover these genes given the mixed model? Can I compute the logFC and p-value only using the continuous part of the model? If so can you point me to the code block that I need to modify to achieve that? (I don't quite understand how you compute the logFC and p-value in MAST).

Any help would be appreciated. Thanks in advance!

Best,
Junhao.

test_sca.rds.zip

library(tidyverse)
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose
library(NMF)
#> Loading required package: pkgmaker
#> Loading required package: registry
#> Loading required package: rngtools
#> Loading required package: cluster
#> NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 31/32
#>   To enable shared memory capabilities, try: install.extras('
#> NMF
#> ')
library(rsvd)
library(lme4)
#> Loading required package: Matrix
#> 
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack
library(MAST)
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'matrixStats'
#> The following objects are masked from 'package:Biobase':
#> 
#>     anyMissing, rowMedians
#> The following object is masked from 'package:dplyr':
#> 
#>     count
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> The following object is masked from 'package:Biobase':
#> 
#>     rowMedians
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:Matrix':
#> 
#>     expand
#> The following object is masked from 'package:NMF':
#> 
#>     nrun
#> The following object is masked from 'package:pkgmaker':
#> 
#>     new2
#> The following objects are masked from 'package:data.table':
#> 
#>     first, second
#> The following objects are masked from 'package:dplyr':
#> 
#>     first, rename
#> The following object is masked from 'package:tidyr':
#> 
#>     expand
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
#> Loading required package: IRanges
#> 
#> Attaching package: 'IRanges'
#> The following object is masked from 'package:data.table':
#> 
#>     shift
#> The following objects are masked from 'package:dplyr':
#> 
#>     collapse, desc, slice
#> The following object is masked from 'package:purrr':
#> 
#>     reduce
#> Loading required package: GenomeInfoDb

sca <- readRDS("test_sca.rds") # expression values are log2(CPM)

# fitting with a fixed model works fine
zlmCond <- zlm(~disease + cn_genes_on + cn_nCount_SCT + cn_percent_mt + cn_age + sex + seq_batch,
               sca)
#> Warning in .nextMethod(object = object, value = value): Coefficients
#> seq_batchbatch_6 are never estimible and will be dropped.
#> 
#> Done!

lrt_term <- "diseaseALS" 

summary_cond <- summary(zlmCond, doLRT = lrt_term)
#> Combining coefficients and standard errors
#> Calculating log-fold changes
#> Calculating likelihood ratio tests
#> Refitting on reduced model...
#> 
#> Done!

summary_Dt <- summary_cond$datatable

fcHurdle <- merge(summary_Dt[contrast==lrt_term & component=='H',.(primerid, `Pr(>Chisq)`)],
                  summary_Dt[contrast==lrt_term & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], by='primerid')
fcHurdle[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]

fcHurdle_df <- fcHurdle %>%
  as_tibble() %>%
  arrange(fdr) %>% 
  dplyr::rename(gene = primerid,
                p_value = `Pr(>Chisq)`,
                log2FC = coef)

fcHurdle_df
#> # A tibble: 6 x 6
#>   gene     p_value  log2FC   ci.hi   ci.lo       fdr
#>   <chr>      <dbl>   <dbl>   <dbl>   <dbl>     <dbl>
#> 1 GFAP   2.02e-187  3.20    3.47    2.93   1.21e-186
#> 2 GPM6A  1.84e-105 -0.558  -0.510  -0.607  5.53e-105
#> 3 CADM2  1.07e- 42 -0.368  -0.313  -0.422  2.14e- 42
#> 4 ERBB4  1.58e- 25 -0.302  -0.247  -0.358  2.36e- 25
#> 5 MAGI2  1.26e- 18 -0.244  -0.190  -0.299  1.51e- 18
#> 6 MALAT1 8.08e-  1  0.0108  0.0434 -0.0218 8.08e-  1

# fitting with a mixed model with a random effect term raise issues...
zlmCond_mixed <- zlm(~disease + cn_genes_on + cn_nCount_SCT + cn_percent_mt + cn_age + sex + seq_batch + (1|subject),
               sca,
               method = 'glmer',
               ebayes = FALSE,
               strictConvergence = FALSE,
               fitArgsD = list(nAGQ = 0)
)
#> Warning in .nextMethod(object = object, value = value): Coefficients
#> seq_batchbatch_6 are never estimible and will be dropped.
#> 
#> Done!

lrt_term <- "diseaseALS" 

summary_cond_mixed <- summary(zlmCond_mixed, doLRT = lrt_term, fitArgsD = list(nAGQ = 0))
#> Combining coefficients and standard errors
#> Calculating log-fold changes
#> Calculating likelihood ratio tests
#> Refitting on reduced model...
#> 
#> Done!

summary_Dt_mixed <- summary_cond_mixed$datatable

fcHurdle_mixed <- merge(summary_Dt_mixed[contrast==lrt_term & component=='H',.(primerid, `Pr(>Chisq)`)],
                  summary_Dt_mixed[contrast==lrt_term & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], by='primerid')
fcHurdle_mixed[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]

fcHurdle_df_mixed <- fcHurdle_mixed %>%
  as_tibble() %>%
  arrange(fdr) %>% 
  dplyr::rename(gene = primerid,
                p_value = `Pr(>Chisq)`,
                log2FC = coef)

fcHurdle_df_mixed # genes failed to converge (like GPM6A and MALAT1) and genes with extremely huge FC and CI (like CADM2 and MAGI2)
#> # A tibble: 6 x 6
#>   gene      p_value  log2FC      ci.hi      ci.lo       fdr
#>   <chr>       <dbl>   <dbl>      <dbl>      <dbl>     <dbl>
#> 1 GFAP    0.0000863   3.36      5.80        0.923  0.000345
#> 2 CADM2   0.00189   -10.1   10167.     -10187.     0.00324 
#> 3 ERBB4   0.00243    -0.240    -0.0962     -0.384  0.00324 
#> 4 MAGI2   0.210      -5.46   9654.      -9665.     0.210   
#> 5 GPM6A  NA         NaN       NaN         NaN     NA       
#> 6 MALAT1 NA         NaN       NaN         NaN     NA

# try to manually fit the model as recommended in other Github issues
y <- assay(sca)["MALAT1",]
dat <- colData(sca)
dat <- data.frame(y,dat)
dat$y_disc <- dat$y>0

dat %>% as_tibble() %>% dplyr::count(disease, subject, y_disc)
#> # A tibble: 12 x 4
#>    disease subject y_disc     n
#>    <fct>   <chr>   <lgl>  <int>
#>  1 Control 1069    TRUE     480
#>  2 Control 902     TRUE     871
#>  3 Control 904     TRUE     216
#>  4 Control 906     TRUE     416
#>  5 Control 91      TRUE     286
#>  6 Control 945     TRUE     319
#>  7 ALS     110     TRUE     693
#>  8 ALS     111     TRUE     302
#>  9 ALS     113     TRUE     338
#> 10 ALS     332     TRUE      15
#> 11 ALS     388     TRUE     189
#> 12 ALS     52      TRUE     312

disc_fit <-
  glmer(
    y_disc ~ disease + cn_genes_on + cn_nCount_SCT + cn_percent_mt + cn_age + sex + seq_batch + (1 | subject),
    data = dat,
    family = "binomial",
    control = glmerControl(optimizer = "bobyqa"),
    nAGQ = 0,
  )
#> fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
#> Error: Response is constant

y <- assay(sca)["CADM2",]
dat <- colData(sca)
dat <- data.frame(y,dat)
dat$y_disc <- dat$y>0

dat %>% as_tibble() %>% dplyr::count(disease, subject, y_disc)
#> # A tibble: 14 x 4
#>    disease subject y_disc     n
#>    <fct>   <chr>   <lgl>  <int>
#>  1 Control 1069    TRUE     480
#>  2 Control 902     FALSE      1
#>  3 Control 902     TRUE     870
#>  4 Control 904     TRUE     216
#>  5 Control 906     TRUE     416
#>  6 Control 91      TRUE     286
#>  7 Control 945     TRUE     319
#>  8 ALS     110     FALSE      2
#>  9 ALS     110     TRUE     691
#> 10 ALS     111     TRUE     302
#> 11 ALS     113     TRUE     338
#> 12 ALS     332     TRUE      15
#> 13 ALS     388     TRUE     189
#> 14 ALS     52      TRUE     312

disc_fit <-
  glmer(
    y_disc ~ disease + cn_genes_on + cn_nCount_SCT + cn_percent_mt + cn_age + sex + seq_batch + (1 | subject),
    data = dat,
    family = "binomial",
    control = glmerControl(optimizer = "bobyqa"),
    nAGQ = 0,
  )
#> fixed-effect model matrix is rank deficient so dropping 1 column / coefficient

zlmCond@coefC
#>        (Intercept)  diseaseALS cn_genes_on cn_nCount_SCT cn_percent_mt
#> GFAP       8.29401  1.12023728 -0.07655072    0.02442343  -0.020249861
#> MALAT1    15.59505  0.01082459  0.12577780   -0.24172040  -0.090441340
#> GPM6A     12.28415 -0.55844186 -0.07206625    0.08239564  -0.009892255
#> ERBB4     11.59645 -0.30211774 -0.10247324    0.21863447  -0.042176694
#> CADM2     12.29975 -0.36309474 -0.07710509    0.11810448   0.005804970
#> MAGI2     11.43671 -0.24065765 -0.13788845    0.11733356  -0.032837481
#>             cn_age        sexM seq_batchbatch_2 seq_batchbatch_3
#> GFAP   -0.20639925 -0.09135056       0.49418686       0.44759747
#> MALAT1 -0.18206980 -0.20952663      -0.18163052       0.03419533
#> GPM6A   0.01503511  0.29918196       0.07135877       0.34234493
#> ERBB4   0.10387150 -0.14493914      -0.04503845       0.26233164
#> CADM2   0.01522204 -0.53816474      -0.55727286       0.30972378
#> MAGI2   0.06218008  0.18165004      -0.03257498       0.08604887
#>        seq_batchbatch_4 seq_batchbatch_5
#> GFAP         0.84527533        1.7331418
#> MALAT1      -0.36238682        0.3129492
#> GPM6A        0.04144394       -0.5939891
#> ERBB4       -0.08014849       -0.4053022
#> CADM2       -0.48380033       -0.6011401
#> MAGI2        0.13870640       -0.6291547
zlmCond@coefD
#>        (Intercept)    diseaseALS  cn_genes_on cn_nCount_SCT cn_percent_mt
#> GFAP      0.880975  2.362414e+00 3.703621e-01  2.676908e-01 -7.586761e-02
#> MALAT1   10.638409 -5.465235e-16 1.998504e-16  1.187756e-16 -2.199635e-16
#> GPM6A    10.638409 -5.465235e-16 1.998504e-16  1.187756e-16 -2.199635e-16
#> ERBB4    10.226903 -6.264171e-01 1.674600e+00 -3.311432e-01 -5.580800e-01
#> CADM2     8.050730 -8.027358e-01 1.219512e+00 -2.684928e-01  3.959452e-01
#> MAGI2     7.565095 -5.186919e-01 8.960942e-01  2.237338e-01  1.178158e-01
#>               cn_age          sexM seq_batchbatch_2 seq_batchbatch_3
#> GFAP   -4.452868e-01 -4.945825e-01     1.093009e-01    -1.494435e-02
#> MALAT1  1.520002e-16  2.020446e-16     7.612317e-17     7.561554e-17
#> GPM6A   1.520002e-16  2.020446e-16     7.612317e-17     7.561554e-17
#> ERBB4   3.420845e-01 -1.659656e+00     6.848692e-01    -1.987180e-02
#> CADM2  -4.373242e-01  5.716655e-01     1.263799e+00     3.827985e-01
#> MAGI2  -8.245549e-01 -5.977621e-01     1.830101e+00     9.999911e-01
#>        seq_batchbatch_4 seq_batchbatch_5
#> GFAP       1.363749e+00     5.789519e-01
#> MALAT1     4.987686e-17     9.059252e-17
#> GPM6A      4.987686e-17     9.059252e-17
#> ERBB4      2.389449e-01    -2.009024e-01
#> CADM2      1.101994e+00    -5.283315e-01
#> MAGI2     -1.031814e-01     1.404526e+00


sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-conda-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.7 LTS
#> 
#> Matrix products: default
#> BLAS/LAPACK: /cndd/junhao/anaconda3/envs/r_env_v4/lib/libopenblasp-r0.3.12.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] MAST_1.17.3                 SingleCellExperiment_1.12.0
#>  [3] SummarizedExperiment_1.20.0 GenomicRanges_1.42.0       
#>  [5] GenomeInfoDb_1.26.1         IRanges_2.24.0             
#>  [7] S4Vectors_0.28.1            MatrixGenerics_1.2.0       
#>  [9] matrixStats_0.58.0          lme4_1.1-25                
#> [11] Matrix_1.2-18               rsvd_1.0.3                 
#> [13] NMF_0.23.0                  Biobase_2.50.0             
#> [15] BiocGenerics_0.36.0         cluster_2.1.0              
#> [17] rngtools_1.5                pkgmaker_0.32.2            
#> [19] registry_0.5-1              data.table_1.13.6          
#> [21] forcats_0.5.0               stringr_1.4.0              
#> [23] dplyr_1.0.2                 purrr_0.3.4                
#> [25] readr_1.4.0                 tidyr_1.1.2                
#> [27] tibble_3.0.4                ggplot2_3.3.2              
#> [29] tidyverse_1.3.0            
#> 
#> loaded via a namespace (and not attached):
#>  [1] nlme_3.1-150           bitops_1.0-6           fs_1.5.0              
#>  [4] lubridate_1.7.9.2      progress_1.2.2         doParallel_1.0.16     
#>  [7] RColorBrewer_1.1-2     httr_1.4.2             tools_4.0.3           
#> [10] backports_1.2.0        utf8_1.1.4             R6_2.5.0              
#> [13] DBI_1.1.1              colorspace_2.0-0       withr_2.4.1           
#> [16] prettyunits_1.1.1      tidyselect_1.1.0       compiler_4.0.3        
#> [19] cli_2.3.0              rvest_0.3.6            xml2_1.3.2            
#> [22] DelayedArray_0.16.0    scales_1.1.1           digest_0.6.27         
#> [25] minqa_1.2.4            rmarkdown_2.5          XVector_0.30.0        
#> [28] pkgconfig_2.0.3        htmltools_0.5.0        dbplyr_2.0.0          
#> [31] highr_0.8              rlang_0.4.10           readxl_1.3.1          
#> [34] generics_0.1.0         jsonlite_1.7.2         RCurl_1.98-1.2        
#> [37] magrittr_2.0.1         GenomeInfoDbData_1.2.4 fansi_0.4.2           
#> [40] Rcpp_1.0.6             munsell_0.5.0          abind_1.4-5           
#> [43] lifecycle_0.2.0        stringi_1.5.3          yaml_2.2.1            
#> [46] zlibbioc_1.36.0        MASS_7.3-53            plyr_1.8.6            
#> [49] grid_4.0.3             crayon_1.4.1           lattice_0.20-41       
#> [52] haven_2.3.1            splines_4.0.3          hms_0.5.3             
#> [55] knitr_1.30             pillar_1.4.7           boot_1.3-25           
#> [58] reshape2_1.4.4         codetools_0.2-18       reprex_0.3.0          
#> [61] glue_1.4.2             evaluate_0.14          modelr_0.1.8          
#> [64] vctrs_0.3.5            nloptr_1.2.2.2         foreach_1.5.1         
#> [67] cellranger_1.1.0       gtable_0.3.0           assertthat_0.2.1      
#> [70] xfun_0.19              gridBase_0.4-7         xtable_1.8-4          
#> [73] broom_0.7.2            iterators_1.0.13       statmod_1.4.35        
#> [76] ellipsis_0.3.1
@hoholee hoholee changed the title Convergene issues due to all cells in comparison expressed certain genes? Convergence issues due to all cells in comparison expressed certain genes? Mar 4, 2021
@KeatsShwab
Copy link

I'm having the same problem. Any guidance would be appreciated.

@amcdavid
Copy link
Member

amcdavid commented Nov 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants