Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

differential expression analysis and GSEA #119

Open
Sophon-0 opened this issue Mar 19, 2024 · 5 comments
Open

differential expression analysis and GSEA #119

Sophon-0 opened this issue Mar 19, 2024 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@Sophon-0
Copy link

Sophon-0 commented Mar 19, 2024

Hello,

I was using decoupleR for pathway analysis in single cell.

I tried a few lines:

msigdb <- decoupleR::get_resource("MSigDB")
msigdb = msigdb[msigdb$collection =='hallmark', ]


# Enrichment with Over Representation Analysis (ORA)
mat <- as.matrix(seurat_object@assays$RNA@data)
result <- run_ora(
  mat=mat,
  network=msigdb,
  .source='geneset',
  .target='genesymbol'
)

Where should I specify the case group and the control group ? Where do we specify the statistics chosen ?
Looking at the output, it seems to be considering every single cell as a condition.

@PauBadiaM PauBadiaM self-assigned this Mar 19, 2024
@PauBadiaM
Copy link
Collaborator

Hi @AgentScientist,

Pathway analysis can be performed at two levels: at the observation (cell) or contrast. In your case, you seem interested in the difference between conditions so first you will have to generate contrast level gene statistics using your favorite differential expression framework (limma, deseq2, edger, etc.). Because you are working with single cell data, it is a good practice to perform DEA at the pseudobulk level to reduce the overinflation of p-values (check this ref if you want to read more about it). Once you have performed DEA on your pseudobulk profiles you can perform any enrichment analysis that you want. In this vignette, we show how to perform pseudobulking, DEA with DESeq2 and different enrichment analyses you can perform. Unfortunately, it is only available in the python version of decoupler, but should be easy to follow/adapt in R too. Hope this is helpful!

@Sophon-0
Copy link
Author

Sophon-0 commented Mar 20, 2024

Hello,
So I have in total 6 samples from 3 donors. 2 conditions: normal versus treatment.
So for each donor, there are 2 samples, one normal, one treated.
I want to compare cell type X in treatment versus control.

"We know that single cells within a sample are not independent of each other, since they were isolated from the same environment. If we treat cells as samples, we are not testing the variation across a population of samples, rather the variation inside an individual one. "

Would this really be a problem ? I mean, that cell type X in treated group is in a totally different state than in control group, and it's a 3 vs 3. I got biological more sound result using Seurat::FindMarkers + PIANO than using the pseudobulk method. Maybe I'm missing something.

At the step of pseudo bulk generation. At the sample column, is it ok to choose sample, knowing that some samples come from the same donor?

pdata = dc.get_pseudobulk( adata, sample_col='sample', groups_col='cell_type', layer='counts', mode='sum', min_cells=10, min_counts=1000 )

PROGENy works perfectly fine. I'm just having some issues with the MsigDB analysis, which missed some key pathways that I got using Seurat::FindMarkers + PIANO

@PauBadiaM
Copy link
Collaborator

Hi @AgentScientist,

I would strongly advice against performing DEA at the single-cell level followed by enrichment analysis. Even if you have a balanced experimental design, still samples might contribute different number of cells which will bias the test, using single-cells as observations breaks the assumption of any DEA test that samples are independent from each other, and also it overinflates the obtained p-values (so you will get many false positives).

Regarding your pseudobulk results, one thing you might try is to be less strict with the gene filtering with dc.plot_filter_by_expr. Are you keeping enough genes after filtering for DEA testing?

Regarding your patient vs sample metadata, you can do two things: i) use the patient id as your sample col (you summarize per patient), ii) you include the patient id as a covariate to the DESeq2 model to correct for that.

Hope this is helpful!

@PauBadiaM PauBadiaM added the question Further information is requested label Mar 22, 2024
@Sophon-0
Copy link
Author

Sophon-0 commented Mar 22, 2024

Hello, it was a geneset issue ! I used the standard MSigDB 1329 pathways, and I got same result as before.

deeenes added a commit to saezlab/pypath-common that referenced this issue Mar 22, 2024
@Sophon-0
Copy link
Author

Sophon-0 commented Mar 22, 2024

"Regarding your patient vs sample metadata, you can do two things: i) use the patient id as your sample col (you summarize per patient), ii) you include the patient id as a covariate to the DESeq2 model to correct for that."

  1. So I have these 2 columns in my adata: "Donor" and "sample" (for each donor, there are 2 samples, one normal, one treated)

pdata = dc.get_pseudobulk(
adata,
sample_col='Donor',
groups_col='cell_type',
layer='counts',
mode='sum',
min_cells=10,
min_counts=1000
)

If I do this, in the resulting pdata, for some reason I lose a few columns: the conditions (treated/normal) columns for ex. I keep all the columns only when I do sample_col='sample'

I actually think using "sample" is fine, since we don't want to mix the cell type within donors.

  1. Do you also mean to include "Donor" in the DeseqDataSet function ? I don't see any options for this.

I did this:

dds = DeseqDataSet(
adata=DCs,
design_factors=['conditions',"Donor"],
ref_level=['conditions','Normal'],
refit_cooks=True,
inference=inference
)

  1. For the ANOVA: "In order to have a better overview of the association of PCs with sample metadata, let’s perform ANOVA on each PC and see whether they are significantly associated with any technical or biological annotations of our samples"

I see in the example that there is no association between any PC and the condition column ("disease"). Even in the PCA plot, we don't distinguish between COVID-19 and Normal. Shouldn't we see some differences ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants