Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genome reference assembly extraction from variants/reads objects? #13

Open
ttriche opened this issue Oct 30, 2014 · 4 comments
Open

genome reference assembly extraction from variants/reads objects? #13

ttriche opened this issue Oct 30, 2014 · 4 comments

Comments

@ttriche
Copy link
Contributor

ttriche commented Oct 30, 2014

One of the handy defaults in Bioconductor GRanges/VRanges objects is a slot for the genome (reference assembly) of each chromosome (should all be the same for any sane object, of course). This helps prevent comparisons of (e.g.) hg18 and hg19, or hg19 and hg38, data as if on the same coordinate system (a practice which is such a terrible idea that it's essentially never worth allowing).

Unfortunately, this safeguard can't be enforced when no genome is specified.

In the course of adding a default seqlevelsStyle for *Ranges, I realized it would be nice to have the genome reference automatically specified when retrieving variants or aligned reads. This doesn't seem to be possible from the current data structure. What am I overlooking here?

@pgrosu
Copy link

pgrosu commented Oct 30, 2014

Hi Tim,

Actually I was able to find it using the java api client using this command:

java -jar target/genomics-tools-client-java-v1beta.jar custom --custom_endpoint "variantsets/search" --custom_method POST --custom_body '{"datasetIds":["10473108253681171589"]}' --pretty_print | less

Then in the the metadata section you will see listed as GRCh37 like this:

...
 "metadata" : [ {
      "key" : "FORMAT",
      "id" : "DS",
      "type" : "float",
      "number" : "1",
      "description" : "Genotype dosage from MaCH/Thunder"
    }, {
      "key" : "reference",
      "value" : "GRCh37",
      "type" : "unknownType"
    }, {
...

Hope it helps,
~p

@ttriche
Copy link
Contributor Author

ttriche commented Oct 30, 2014

That is terrific; it's in there, at least.

This sort of thing may become more of an issue (generally) as "variant" calls move towards graph traversals from assembly A to variant assembly representation B. But at least there's a way to extract it and prevent some issues when using the data from R, which avoids certain types of derp.

Thank you!

--t

On Oct 29, 2014, at 7:58 PM, Paul Grosu notifications@github.com wrote:

Hi Tim,

Actually I was able to find it using the java api client using this command:

java -jar target/genomics-tools-client-java-v1beta.jar custom --custom_endpoint "variantsets/search" --custom_method POST --custom_body '{"datasetIds":["10473108253681171589"]}' --pretty_print | less
Then in the the metadata section you will see listed as GRCh37 like this:

...
"metadata" : [ {
"key" : "FORMAT",
"id" : "DS",
"type" : "float",
"number" : "1",
"description" : "Genotype dosage from MaCH/Thunder"
}, {
"key" : "reference",
"value" : "GRCh37",
"type" : "unknownType"
}, {
...
Hope it helps,
~p


Reply to this email directly or view it on GitHub.

@pgrosu
Copy link

pgrosu commented Oct 30, 2014

Yeah, using a graph representation of the variants will definitely be a different approach, and that's the way GA4GH is also going. It will take a bit of work to do it efficiently, but that's the fun part :)

Also below are the steps if you want to do what I did above in Java through R:

  1. First paste the following after you perform the authenticate("client_secrets.json"):
library(httr)

getMetaData <- function(datasetIds="10473108253681171589", endpoint="https://www.googleapis.com/genomics/v1beta/",
  pageToken=NULL) {

  # Fetch data from the Genomics API
  body <- list(datasetIds=list(datasetIds), pageToken=pageToken)

  message("Fetching read data page")

  #queryParams <- list(fields="nextPageToken,reads(name,cigar,position,originalBases,flags)")
  queryParams <- list(fields="nextPageToken,variantSets(metadata)")
  queryConfig <- config()
  if (GoogleGenomics:::.authStore$use_api_key) {
    queryParams <- c(queryParams, key=GoogleGenomics:::.authStore$api_key)
  } else {
    queryConfig <- config(token=GoogleGenomics:::.authStore$google_token)
  }
  res <- POST(paste(endpoint, "variantsets/search", sep=""),
    query=queryParams,
    body=rjson::toJSON(body), queryConfig,
    add_headers("Content-Type"="application/json"))
  if("error" %in% names(httr::content(res))) {
    print(paste("ERROR:", httr::content(res)$error$message))
  }
  stop_for_status(res)

  message("Parsing read data page")
  res_content <- httr::content(res)
  res_content
}

dataset_metadata_info <- getMetaData()
dataset_metadata_info$variantSets[[1]]$metadata[[2]]$value
  1. Once you run the last two commmands you should see something like this:
> dataset_metadata_info <- getMetaData()
dataset_metadata_info$variantSets[[1]]$metadata[[2]]$value
Fetching read data page
Parsing read data page
> [1] "GRCh37"

~p

@deflaux
Copy link
Collaborator

deflaux commented Jan 15, 2015

This is a great feature request and is something we should do soon.

For reads, its the referenceSetId which can then be used to look up the reference set

java -jar target/genomics-tools-client-java-v1beta2.jar getreadgroupset --id CMvnhpKTFhDnk4_9zcKO3_YB --pretty_print
...
    "id" : "ChhDTXZuaHBLVEZoRG5rNF85emNLTzNfWUIQAQ",
    "name" : "ERR016214",
    "predictedInsertSize" : 465,
    "referenceSetId" : "EOSt9JOVhp3jkwE",
    "sampleId" : "HG00261"
...

For variants, it comes from the VCF the header and we can get it from the variant set metadata.

java -jar target/genomics-tools-client-java-v1beta2.jar getvariantset --id 3049512673186936334 --pretty_print
...
 }, {
    "key" : "reference",
    "type" : "UNKNOWN_TYPE",
    "value" : "file:///illumina/development/Isis/Genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFASTA/genome.fa"
  }, {
...

Per @calbach down the road we should have https://github.com/googlegenomics/api-client-java/issues/66

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants