Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example code motivating seqlevelsStyle() patch #21

Open
ttriche opened this issue Nov 5, 2014 · 1 comment
Open

Example code motivating seqlevelsStyle() patch #21

ttriche opened this issue Nov 5, 2014 · 1 comment

Comments

@ttriche
Copy link
Contributor

ttriche commented Nov 5, 2014

Per #15 and ensuing blahblah, here's the unvarnished code.
If it breaks, you get to keep the pieces. :-)

library(GoogleGenomics)
## 
## may need to do this interactively:
## 
authenticate('client_secrets.json')
## 
## the rest should not require user input 
## 
variant <- getVariants(datasetId = "10473108253681171589", chromosome = "22",
                       start = 16051400, end = 16051500, oneBasedCoord = FALSE,
                       fields = NULL, converter = c)
variantVR <- variantsToVRanges(variant, slStyle = 'UCSC')
## 
## Replace the middle base of a DNAStringSet with the alternate SNP allele
## (only tested for SNPs where both reference and alternate are 1bp wide!)
## There are so very many ways in which this is horrible...
##
replaceWithAlt <- function(stringSet, position, alt) {
  strings <- as.character(stringSet)
  substr(strings, position, position) <- as.character(alt)
  return(DNAStringSet(strings))
}
## 
## Use the BSgenome representation of hg19 to play with the ref/alt sequences:
## One base on each side, what happens when we go from reference to alternate?
## Major issue: how to know what reference assembly the variants come from?
##
library(BSgenome.Hsapiens.UCSC.hg19)
##
## Absent a reference assembly, the above (a guess!) could be very wrong...
## Hence issue #13 in the github repository 
##
variantVR$context <- getSeq(Hsapiens, resize(variantVR, 3, fix='center')) 
variantVR$altcontext <- replaceWithAlt(variantVR$context, 2, alt(variantVR))
show(variantVR)
## 
## what is the consequence (what has typically been "lost" to the reference?)
## just in terms of simple doublets, nothing fancy here 
## 
refCpG <- length( grep('CG', variantVR$context) ) / length( variantVR ) 
altCpG <- length( grep('CG', variantVR$altcontext) ) / length( variantVR )
##
## So half of a randomly selected set of 4 variants are "losses" of CpG sites.
## Given the sample size we obviously can't conclude anything, but purging of 
## CpG sites from the genome (via spontaneous 5mC -> T deamination) is a steady
## process that has heavily depleted the "average" genome of said doublets. 
##
print(paste0('In this sample, ', 
             (altCpG - refCpG)*100, '% of ref-to-alt changes create a CpG.'))
##
## It is interesting (NOT rigorous evidence of any sort) that a randomly picked
## set of SNPs from the 1KGP participants would happen to illustrate the theme.
@pgrosu
Copy link

pgrosu commented Nov 5, 2014

This is great - thanks Tim! For me it's still team effort :) Sharing is caring, right? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants