Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in handle_design_parameter(design, data, col_data, reference_level) : Too few replicates / too many coefficients to fit model. #42

Open
sopenaml opened this issue Dec 13, 2022 · 1 comment

Comments

@sopenaml
Copy link

sopenaml commented Dec 13, 2022

Hi Constantin,

Thank you very much for glmGamPoi! I've been trying to get it work on a dataset I have, where I have 9 clusters for 2 genotypes, 3 replicate for each genotype. I have tried to run:

fit <- glm_gp( bm.mat_subset,
               design = ~ genotype + seurat_clusters + genotype:seurat_clusters - 1,
               col_data = metadat,
               subsample = TRUE,
               on_disk = FALSE,
               reference_level = "WT" )

c0 <- test_de(fit,  
                  contrast = genotypeKO - genotypeWT,
                  pseudobulk_by = mouse , 
         sort_by = pval, decreasing = FALSE)

Error in handle_design_parameter(design, data, col_data, reference_level) : 
  The model_matrix has more columns (18) than the there are samples in the data matrix (6 columns).
Too few replicates / too many coefficients to fit model.
The head of the design matrix: 
     genotypeWT genotypeKO seurat_clusters1 seurat_clusters2 seurat_clusters3 seurat_clusters4 seurat_clusters5 seurat_clusters6 seurat_clusters7 seurat_clusters8 

I've looked at your example and I get the same error if I don't pre-filter the data to few clusters (NK cells, B cells and T cells), as it's done in the example. The resulting fit has 16 coeficients and the data has 16 samples ( ind + stim) so it produces the same error too few replicates/too many coeficients to fit the model. Is this a bug of do we always have to prefilter the data so the number of coeficients is less than the number of samples you "pseudobulk_by"?

would you say that doing the following would be a good way to overcome the issues above?

de_res <- test_de(fit, contrast = `stimstim` + `cellCD4 T cells:stimstim`, 
                  pseudobulk_by = paste(stim,  ind,  cell, sep="_" )) 

Thank you very much for your help

Miriam


sce_subset <- sce[rowSums(counts(sce)) > 100, 
                  sample(which(! is.na(sce$cell)), 1000)]
 counts(sce_subset) <- as.matrix(counts(sce_subset))
 sce_subset$cell <- droplevels(sce_subset$cell)
fit <- glm_gp(sce_subset, design = ~ cell + stim +  stim:cell - 1,
              reference_level = "NK cells")
fit
glmGamPoiFit object:
The data had 9727 rows and 1000 columns.
A model with 16 coefficient was fitted.
> de_res <- test_de(fit, contrast = `stimstim` + `cellCD4 T cells:stimstim`, 
                  pseudobulk_by = paste0(stim, "-", ind)) 

Error in handle_design_parameter(design, data, col_data, reference_level) : 
  The model_matrix has more columns (16) than the there are samples in the data matrix (16 columns).
Too few replicates / too many coefficients to fit model.
The head of the design matrix: 
          cellNK cells cellB cells cellCD14+ Monocytes cellCD4 T cells cellCD8 T cells cellDendritic cells cellFCGR3A+ Monocytes cellMegakaryocytes stimstim cellB cells:stimstim cellCD14+ Monocytes:stimstim cellCD4 T cells:stimstim cellCD8 T cells:stimstim cellDendritic cells:stimstim cellFCGR3A+ Monocytes:stimstim cellMegakaryocytes:stimstim
 ctrl-101   0.05714286  0.11428571           0.4000000       0.2285714      0.14285714          0.00000000            0.02857143         0.02857143        0                    0                            0                        0                        0                            0                              0                           0
ctrl-1015   0.04761905  0.20000000           0.2761905       0.3333333      0.03809524          0.01904762      

@const-ae
Copy link
Owner

const-ae commented Jan 3, 2023

Hi Miriam,

Sorry for the late reply.

You are right to form pseudobulk sample to assess the effect of the genotype on the expression in your cell types.

Your idea pseudobulking over the simulation, the individual, and the cell type is exactly what I would recommend to do:

de_res <- test_de(fit, contrast = `stimstim` + `cellCD4 T cells:stimstim`, 
                  pseudobulk_by = paste(stim,  ind,  cell, sep="_" )) 

Your question actually prompted me to take another look at documentation and implementation regarding pseudobulks in glmGamPoi. I decided to deprecate the pseudobulk_by argument in test_de in the upcoming release. Not because it produces wrong results, but simply because it is wasteful to first fit a model on thousands of cells and then throw the model away and fit a new model inside test_de on the aggregated data.

In the next release, the first step is to form a SummarizedExperiment object with the pseudobulk samples and then call glm_gp and test_de on the aggregated data.

For your use case the code would be something like this:

bm.mat_pseudobulk <- pseudobulk(bm.mat_subset, col_data = metadat, group_by = vars(mouse, genotype, seurat_clusters))

fit <- glm_gp( bm.mat_pseudobulk,
               design = ~ genotype + seurat_clusters + genotype:seurat_clusters - 1,
               reference_level = "WT" )

c0 <- test_de(fit,  
                  contrast = genotypeKO - genotypeWT,
         sort_by = pval, decreasing = FALSE)

If you are curious and want to try out the new approach (and maybe provide some feedback), you can install it from Github by calling

devtools::install_github("const-ae/glmGamPoi")

The upcoming release comes with a new vignette that also describes the process that you can find at https://github.com/const-ae/glmGamPoi/blob/master/vignettes/pseudobulk.Rmd and a rendered version will hopefully appear in the next few days at https://bioconductor.org/packages/devel/bioc/vignettes/glmGamPoi/inst/doc/pseudobulk.html

Best,
Constantin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants