Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in DOSE:::gseaScores - Error in if (abs(max.ES) > abs(min.ES)) #46

Open
therealgenna opened this issue Feb 2, 2021 · 9 comments
Open

Comments

@therealgenna
Copy link

therealgenna commented Feb 2, 2021

Hi,

I am using R v4 and the latest clusterProfiler and DOSE package (DOSE version 3.16.0) and I get the following error:

Error in if (abs(max.ES) > abs(min.ES)) { : 
  missing value where TRUE/FALSE needed

As far as I can tell this comes from line 23 of DOSE:::gseaScores. The error happens when scores of all the genes in the considered geneSet are 0. In this case NR=0 and division by NR in Phit <- cumsum(Phit/NR) leads to all Phit values set to NaN. This eventually leads to the above-mentioned error.

It might be unlikely to expect zero scores for some genes, but I do get them, depending on how I get these scores. One dirty way to fix it would be to simulate tiny random deviations from zero for zero scores, but I wonder if there is a cleaner way to do it.

Thank you

@huerqiang
Copy link
Contributor

@therealgenna Thank you for your suggestion. Could please provide repeatable code and data so that we can modify this problem more conveniently.

@therealgenna
Copy link
Author

@huerqiang I'll try to send you a code with an example, but I've already pointed to the problem, which is easy to see.
From the code of gseaScores (https://github.com/YuLab-SMU/DOSE/blob/master/R/gsea.R), lines 470-475:

Phit <- Pmiss <- numeric(N)
    hits <- names(geneList) %in% geneSet ## logical

    Phit[hits] <- abs(geneList[hits])^exponent
    NR <- sum(Phit)
    Phit <- cumsum(Phit/NR)

I have a case that all geneList[hits] equal 0. Then clearly NR=0 and in the last line above Phit becomes NaNs.

I think maybe a reasonable fix would be to do the following - add the indicated line:

Phit <- Pmiss <- numeric(N)
    hits <- names(geneList) %in% geneSet ## logical

    Phit[hits] <- abs(geneList[hits])^exponent
    NR <- sum(Phit)

    # possible fix - add this line here (maybe also issue some warning):
    if(NR==0) {Phit[hits] <- 1; NR <- sum(Phit)} # NR should then equal Nh

    Phit <- cumsum(Phit/NR)

I can create an example simply by picking one of the gene sets and setting scores for those genes to 0 (so that geneList[hits] will all be zero. I will send this later if you think it's useful.

@therealgenna
Copy link
Author

Hi @huerqiang , here is the example and the data is attached in error_exampleData.RData.zip (I had to zip it because I cannot upload RData files)
error_exampleData.RData.zip
:

> gsea = GSEA(geneList, TERM2GENE=wpid2gene, TERM2NAME=wpid2name,pAdjustMethod="BH", pvalueCutoff=1)
preparing geneSet collections...
GSEA analysis...
Error in if (abs(max.ES) > abs(min.ES)) { : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (0.51% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.

I've noticed that I also apparently need pvalueCutoff=1 to get the error, but this is a legitimate value, if I want to see everything.

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
[1] org.Dr.eg.db_3.12.0    AnnotationDbi_1.52.0   IRanges_2.24.1        
[4] S4Vectors_0.28.1       Biobase_2.50.0         BiocGenerics_0.36.0   
[7] clusterProfiler_3.18.0 ggplot2_3.3.3          tidyr_1.1.2           

loaded via a namespace (and not attached):
 [1] viridis_0.5.1       tidygraph_1.2.0     bit64_4.0.5         viridisLite_0.3.0  
 [5] splines_4.0.3       ggraph_2.0.4        assertthat_0.2.1    shadowtext_0.0.7   
 [9] DO.db_2.9           rvcheck_0.1.8       BiocManager_1.30.10 blob_1.2.1         
[13] ggrepel_0.9.1       pillar_1.4.7        RSQLite_2.2.3       lattice_0.20-41    
[17] glue_1.4.2          downloader_0.4      digest_0.6.27       RColorBrewer_1.1-2 
[21] polyclip_1.10-0     qvalue_2.22.0       colorspace_2.0-0    cowplot_1.1.1      
[25] Matrix_1.3-2        plyr_1.8.6          pkgconfig_2.0.3     purrr_0.3.4        
[29] GO.db_3.12.1        scales_1.1.1        tweenr_1.0.1        enrichplot_1.10.2  
[33] BiocParallel_1.24.1 ggforce_0.3.2       tibble_3.0.6        generics_0.1.0     
[37] farver_2.0.3        ellipsis_0.3.1      cachem_1.0.1        withr_2.4.1        
[41] cli_2.3.0           magrittr_2.0.1      crayon_1.4.0        memoise_2.0.0      
[45] DOSE_3.16.0         fansi_0.4.2         MASS_7.3-53         tools_4.0.3        
[49] data.table_1.13.6   lifecycle_0.2.0     stringr_1.4.0       munsell_0.5.0      
[53] compiler_4.0.3      rlang_0.4.10        grid_4.0.3          rstudioapi_0.13    
[57] igraph_1.2.6        labeling_0.4.2      gtable_0.3.0        DBI_1.1.1          
[61] reshape2_1.4.4      graphlayouts_0.7.1  R6_2.5.0            gridExtra_2.3      
[65] dplyr_1.0.3         fastmap_1.1.0       bit_4.0.4           utf8_1.1.4         
[69] fastmatch_1.1-0     fgsea_1.16.0        GOSemSim_2.16.1     stringi_1.5.3      
[73] Rcpp_1.0.6          vctrs_0.3.6         tidyselect_1.1.0    scatterpie_0.1.5   

Thank you.

@huerqiang
Copy link
Contributor

@therealgenna I looked at your data and found that there are 23119 genes in your geneList, of which 17456 have a value of 0. If you got these values through foldchange, it is likely that your expression profile has not been preprocessed, and the expression values of many genes in a certain group are all 0..

@therealgenna
Copy link
Author

@huerqiang Yes, I have many zeros, but this is my scoring, which is based on multiple factors, including fold changes and p.adjust values in multiple comparisons. I'd assume the function would/should still work as I don't have to have ranks based on a single set of fold changes alone. Am I wrong?

@GuangchuangYu
Copy link
Member

You should persuade us that your data is truly valid. Otherwise the only reason I can think of is that more than half of your data are missing and GSEA is not designed for it.

@therealgenna
Copy link
Author

therealgenna commented Feb 6, 2021

@GuangchuangYu As I wrote above, I derived my scoring based on multiple pairwise comparisons, each one having fold changes and (adjusted) p-values. Either when p.adjust is 1 or reaches some high threshold (which I can choose) in multiple comparisons for a given gene, its score will be zero - this is how I decided to do that and there is nothing wrong about it and I will know how to interpret the result.

I have opened this issue simply because your function does not behave gracefully - it just throws an error, fails and no result is produced. This should not happen in a well-polished code. That's my point basically. I am just saying that if it comes to the situation when NR is zero the function should know how not to crash. I am not claiming that my suggested fix is good, but something should be done in that case, like skipping the gene set for which NR is zero (it does not happen for all gene sets in the list of sets), returning zero enrichment score etc., but going through all gene sets provided and returning the result to the user. Wouldn't you agree that a hard crash of the gseaScores function is not the best way to handle it?

@GuangchuangYu
Copy link
Member

Returning NAN instead of throw error is not always 'well-polished'. In case the NAN was generated due to a wrong input, it would be better to throw error.

How do we rank a gene list for half of them are zero?

pls answer my question directly instead of teach me how to polish my code.

If you can't persuade us, we can't help.

@therealgenna
Copy link
Author

@GuangchuangYu In case of the NAN was generated it would be better to give some informative warning (and proceed with other gene sets.)

If your question is "How do we rank a gene list for half of them are zero?" - it's the user who provides you with the ranked list, so it's user's responsibility. I still might have thousands of genes with positive and negative score and the idea of GSEA to try and pick weak signal is still valid.

One can replace zeros by tiny randomly generated values around zero; for example I can modify my scoring to give some values like that (e.g., not return 0 when padj=1; either do it randomly or deterministically) but really there would be no good way to say that some of these "zero" genes are higher ranked than the others.

If it is important for the program to know the "true" order of "zero" genes and the results would be substantially different if I permute this "zero" subset, then, first, it is not clear to me that this is necessarily the case [is it?], and if it is then it might make sense to do some permutation reorderings of this subset and get some average.

I am not sure why you are so reluctant to make some fixes to avoid the hard crash. We are talking about two separate issues here: (i) can I use a gene list with many genes having score zero, and (ii) code crashing. My focus is on (ii). You are worried about (i) but as I said I think the idea of GSEA is still valid. You seem to say no and I don't know why. But even if you stick to your opinion, for example because the actual order of the "zero" genes affects results a lot for many/all gene sets, at least giving an indication to the user of why the function fails would make sense (still, making it not fail would be preferable.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants