Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getXrefList(...) method is broken #30

Open
OliveIsLazy opened this issue Mar 8, 2023 · 14 comments
Open

getXrefList(...) method is broken #30

OliveIsLazy opened this issue Mar 8, 2023 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@OliveIsLazy
Copy link

OliveIsLazy commented Mar 8, 2023

For a few months, I've struggled with getting this method to work (first issues relating to getting geneset names that later resolved itself, and now issues with getting the actual genes in the genesets), which is strange because the issues started after the last commit.

The code I use is:

# Get metadata.
pw <- listPathways("Mus musculus")

# Compile pathways.
gs <- list()
for (i in pw$id)
{
  print(i)
  gs[[length(gs)+1]] <- getXrefList(i, "L")
  print(gs[[length(gs)]])
}

This code used to work for me, but now it produces incomplete returns like this (I only copied the first few outputs):

[1] "WP1"
[1] ""
[1] "WP10"
 [1] "11651"  "12702"  "14784"  "16186"  "16198"  "16199"  "16367"  "16451"  "16453"  "18708"  "19247"  "20416"  "20846" 
[14] "20848"  "20850"  "20851"  "26395"  "26396"  "26413"  "26417"  "269523" "384783" "54721"  "81601" 
[1] "WP103"
[1] ""
[1] "WP108"
[1] ""
[1] "WP113"
[1] ""
[1] "WP116"
[1] ""
[1] "WP1241"
[1] ""
[1] "WP1242"
[1] ""
[1] "WP1243"
[1] ""
[1] "WP1244"
[1] ""

I was wondering if it had to do with calling bridgedb's datasources repo, as their datasources.tsv file has been moved from the path described in location in getXrefList(...)'s help page.

Please tell me if you need more information!

@egonw egonw self-assigned this Mar 8, 2023
@egonw egonw added the bug Something isn't working label Mar 8, 2023
@egonw
Copy link
Member

egonw commented Mar 8, 2023

Hi @OliveIsLazy, we're running into a few issues ourselves. Thank you for our report. I hope we can fix things as soon as possible.

@mkutmon
Copy link
Member

mkutmon commented Mar 8, 2023

@OliveIsLazy as an alternative, you can download the mouse pathway collection in GMT format which has all the entrez gene ids for every approved mouse pathway. should also be a lot faster.

library(rWikiPathways)

wp.mm <- rWikiPathways::downloadPathwayArchive(date="20230210", organism="Mus musculus", format="gmt")
wp2gene <- readPathwayGMT(wp.mm)

@OliveIsLazy
Copy link
Author

OliveIsLazy commented Mar 8, 2023

@mkutmon This method has helped! The dataframe had a lot of the information I needed. None of the pathways have no genes in them, though pathway WP4627 (" Lipids measured in liver metastasis from breast cancer") has only one gene it in. Here is the code block I used in case anyone else is interested.

#' Generate a list of lists, where each sublist is a WikiPathway genesets.
#' The sublist names correspond to the genesets' name.
#' The sublist values correspond to the genes in the geneset.
#' bug report: https://github.com/wikipathways/rWikiPathways/issues/30#issuecomment-1460013132
#' #' DATE: 08-03-2023
get_wikipathways_two <- function() {
  
  # Get metadata.
  wp.mm <- rWikiPathways::downloadPathwayArchive(date="20230210", organism="Mus musculus", format="gmt")
  wp2gene <- readPathwayGMT(wp.mm)
  
  # Set up key variables.
  pwid <- sort(unique(wp2gene$wpid))
  gs <- list()
  
  # For each pathway id, find the rows with a matching ID and save the associated genes in a list
  for (i in pwid)
  {
    gs[[length(gs)+1]] <- wp2gene[(wp2gene$wpid==i), "gene"]
    
    # Acts as a fail-safe in case there's an issue with the id-pathwayname mapping.
    names(gs)[length(gs)] <- unique(wp2gene[(wp2gene$wpid==i), "name"])
  }
 
  return(gs)
}

By the way, I checked the number of separate genesets in the function you sent me and it totals to 199, which is less than the amount I recorded from December 14th 2022, 238. And indeed, the function I posted initially still generate 238 genesets. I checked and the 199 current pathways appear to all be a subset of the previous 238. Do you know what happened to these missing pathways or how I could get access to them?

On that note, I'd also like to know if any of the pathway definitions (aka the genes in them) has changed since this date, if that's something you're able to tell me.

Thank you very much for your help and your quick response, it is much appreciated.

@mkutmon
Copy link
Member

mkutmon commented Mar 8, 2023

Ah, the GMT file only contains approved pathways. Pathways under development and not finished pathways are not included in our monthly releases. I will double-check if that is indeed the case for those 39 mouse pathways.
Regarding the changes of the pathways > the GMT files are created monthly always on the 10th. So you simply change the date in the downloadPathwayArchive function and go back to see if anything changed.

Hope that helps!

@OliveIsLazy
Copy link
Author

I tried messing around a little and I haven't found any changes Dec 2022 and Feb 2023, but I really appreciate being able to confirm that!

This next part doesn't really relate to the coding backend, so I apologise if it's a bit rude, but I went and checked for which pathways I was missing for my analysis I'm in need of these five pathways:

  • "SRF and miRs in Smooth Muscle Differentiation and Proliferation"
  • "Toll-like receptor signaling pathway"
  • "Adar1 editing defficiency immune response"
  • "Elongation of (very) long chain fatty acids"
  • "GPCRs, non-odorant"

I hope to hear back soon on whether or not their gene lists can be made available again.

@DeniseSl22
Copy link

  • "SRF and miRs in Smooth Muscle Differentiation and Proliferation"

WP2081 --> not Approved; Homology Converted

  • "Toll-like receptor signaling pathway"

WP1271 --> Not Approved; Homology Converted

  • "Adar1 editing defficiency immune response"

WP3415 --> Not Approved

  • "Elongation of (very) long chain fatty acids"

WP4491 --> Not Approved

  • "GPCRs, non-odorant"

WP1396 --> Approved, so this one should be available in the GMT download @mkutmon ...? I checked, and the GMT release of Feb. 2023 includes this PW ID, so not sure why it's not available in R?

And, how can someone make a GMT file for unapproved PWs? @mkutmon @egonw ? Maybe we should make that option available in the future?

@mkutmon
Copy link
Member

mkutmon commented Mar 9, 2023

WP1396 is also in the R table.
Creating a GMT file for all pathways might be possible but it's likely not something we want to distribute since the pathways might not be ready for analysis yet. I restarted our indexer but the xreflist function still shows a bit of inconsistent behavior. I'll need to do more testing.

@AlexanderPico
Copy link
Member

Agreed. GMTs serve a specific purpose for enrichment analysis. We really don't want to make or share GMTs with unapproved content.

@OliveIsLazy
Copy link
Author

OliveIsLazy commented Mar 10, 2023

I can confirm that WP1396 is in the R table. I checked using the pathway names where the December 2022 version is called "GPCRs, non-odorant" and the February 2023 version is of the same name except with a space added at the end, hence the mismatch.

My apologies for the confusion.

@egonw
Copy link
Member

egonw commented Mar 12, 2023

Okay, we worked hard to solve a number of issues, including fixes in the BridgeDb Webservice, the ID mappings databases, uptime of the data.wikipathways.org server, and the indexer. I guess one of the last two was underlying your observation. @OliveIsLazy, can you confirm your original code is working again? I get:

> getXrefList("WP3415", "L")
 [1] "100038882" "100039796" "100702"    "106759"    "110454"    "12675"    
 [7] "142980"    "15945"     "15957"     "15959"     "15962"     "15975"    
[13] "15977"     "15979"     "16150"     "16151"     "17067"     "17858"    
[19] "18033"     "18035"     "19765"     "20304"     "20558"     "20684"    
[25] "21353"     "21822"     "22031"     "22034"     "223672"    "228607"   
[31] "230073"    "231655"    "23960"     "23961"     "23962"     "240327"   
[37] "240328"    "24110"     "246727"    "246728"    "246729"    "246730"   
[43] "26409"     "270151"    "320377"    "327959"    "434341"    "54123"    
[49] "54131"     "55932"     "56066"     "56417"     "56480"     "56489"    
[55] "58185"     "58203"     "60440"     "620913"    "623121"    "626578"   
[61] "66513"     "66724"     "667370"    "667373"    "67775"     "68652"    
[67] "71586"     "71898"     "74498"     "80861"     "99899"    

But I cannot say the results are very stable. It seems that ID mapping does not take place:

> getXrefList("WP2338", "En")
[1] "ENSG00000100697" "ENSG00000113360" "ENSG00000124571" "ENSG00000128191"
> getXrefList("WP2338", "L")
[1] "5901"

The returned Ensembl and NCBI identifiers here are only those that are explicitly in the pathways and none of the mapped identifiers. We see this problem also on BioConductor: https://bioconductor.org/checkResults/release/bioc-LATEST/rWikiPathways/lconway-checksrc.html

Basically, it seem the reboot of the Indexer solved some problems, but not all. The Indexer is not using the latest ID mapping file, which we hope to get from @tabbassidaloii soon.

@mkutmon
Copy link
Member

mkutmon commented Mar 13, 2023

I created an issue in the indexer repo: wikipathways/lucene-indexer#4. It has nothing to do with the new mapping databases so I need more time to actually look at the code and check what is happening and why the ID mapping doesn't work anymore.

@OliveIsLazy
Copy link
Author

I ran the code I originally posted, and all the genesets that I needed to match did based on the name (except the "GPCRs, non-odorant" but only due to the aforementioned naming difference). It generated 238 pathways, as it did in December. I tried checking the code again against another older file I had that saved 195 of the genesets' WikiPathway IDs, and all 195 genesets were accounted for.

As @egonw pointed out, there are issues with the ID mapping. 71 of the genesets have either no or one EntrezIDs in them. Here's two examples:

$`Amino acid conjugation of benzoic acid`
[1] "60525"

$Apoptosis
[1] ""

Thank you all very much for helping me, I really appreciate the hard work.

@DeniseSl22
Copy link

@OliveIsLazy ; I'm assuming you are running your code for the mouse species, right?

  • Amino acid conjugation of benzoic acid (Mm) WP1252 :

    • Website (showing 2 IDs):
      image
    • Webservice (showing 1 ID, filtering for NCBI gene/ Entrez ID (System code L):)
      image
    • Webservice (showing the other ID, filtering for Ensembl ID (System code En):)
      image
  • Apoptosis (Mm) WP1254 is showing the same behaviour (Filtering for Ensembl gives me all annotated DataNodes, filtering for Entrez Gene gives me no results at all); all DataNodes are annotated with an Ensembl ID, and the webservice XRef mapping doesn't seem to work...

@egonw and @mkutmon : seems like the mapping to other databases is (still) not happening.
When I execute the regular xRef mapping API call in the BridgeDb webservice, I do get a repsonse:

curl -X GET "https://webservice.bridgedb.org/Mus%20musculus/xrefs/En/ENSMUSG00000000817?dataSource=L" -H "accept: */*"

14103	Entrez Gene

When I try out the batch mapping, I get no matches:

curl -X POST "https://webservice.bridgedb.org/Mouse/xrefsBatch/En?dataSource=L" -H "accept: */*" -H "Content-Type: text/html" -d "ENSMUSG00000000817ENSMUSG00000001729ENSMUSG00000003184ENSMUSG00000003873"

TypeError: Failed to fetch

The same happens when I use the batch mapping, but just input 1 ID.

@OliveIsLazy
Copy link
Author

Yes, all the code I've ran has been only for the mice pathways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants