Skip to content
This repository has been archived by the owner on Oct 17, 2022. It is now read-only.

Lift symmetric CpG methcounts from hg19 to hg38

Benjamin Decato edited this page Mar 17, 2016 · 2 revisions

Lift a hg19 methcounts file to hg38 using a cpgmap

cpgmap=~jqu/cmb-01/cpgmaps/human/hg19_cpg.over_hg38_cpg.map
~jqu/panfs/tools/git/jqujqu_python_scripts/netmap.py -n $cpgmap -o Sample_hg38.meth.unsorted Sample_hg19.meth
LC_ALL=C sort -k1,1 -k2,2g Sample_hg38.meth.unsorted -o Sample_hg38.meth && rm Sample_hg38.meth.unsorted

A $cpgmap file looks like this

chr1    895356  895357  +       chr1    959976  959977  +
chr1    895358  895359  +       chr1    959978  959979  +
chr1    895365  895366  +       chr1    959985  959986  +

The first 4 columns are the cytosine locations in the old assembly, and the last 4 columns are corresponding cytosine locations in the new assembly.

The python script takes both formats for input methcounts file. The output will be in the new methcounts format.

Preparation of the cpg map between assemblies

Here goes how the cpg map was generated.

  • Get CpG (2bp) locations in the old assembly and in the new assembly. We need FASTA files for both old and new assemblies. You can also construct TMP_OLD_CPG.BED and TMP_NEW_CPG.BED from methcounts files if available.
old_assembly=~/panfs/cache/human/hg19_chroms
new_assembly=~/panfs/cache/human/hg38_chroms
~jqu/cmb-01/tools/py_scripts/fa_motif.py -d ${old_assembly} -m CG -o  TMP_OLD_CPG.BED
~jqu/cmb-01/tools/py_scripts/fa_motif.py -d ${new_assembly} -m CG -o  TMP_NEW_CPG.BED
  • Name old CpG sites with their old location
awk '{OFS="\t"; print $1, $2, $3, $1":"$2":"$3":"$4":"$6, 0, $6}' < TMP_OLD_CPG.BED > TMP_OLD_CPG_NAMED.BED
  • Download proper chain file from UCSC Genome Browser. Use liftOver to map CpG sites to new assembly and sort by new locations.
chain=~/panfs/cache/liftOverchains/hg19ToHg38.over.chain.gz
liftOver TMP_OLD_CPG_NAMED.BED $chain TMP_OLD_CPG_NAMED_LIFTED.BED UNMAPPED
LC_ALL=C sort -k1,1 -k2,2g -k3,3g TMP_OLD_CPG_NAMED_LIFTED.BED -o TMP_OLD_CPG_NAMED_LIFTED.BED.SORTED 
  • Match lifted position with true CpG location with bedtools. Only keep mapping locations without indels, and output single base-pair location for cytosines in CpG dinucleotides on the "+" strand.
bedtools closest -a TMP_OLD_CPG_NAMED_LIFTED.BED.SORTED -b TMP_NEW_CPG.BED  | \
awk '{OFS="\t"; 
if ($8==$2 && $9==$3) {split($4, a, ":"); print a[1],a[2],a[2]+1,"+", $7,$8,$8+1,"+"} 
}' > TMP_CPG_MAP 
  • Keep one entry from duplicated sites and remove temporary files
export LC_ALL=C;
sort -k5,5 -k6,6g -u TMP_CPG_MAP  | sort -k1,1 -k2,2g -u  /dev/stdin > old_cpg.over_new_cpg.map
rm TMP* UNMAPPED 
cpgmap=old_cpg.over_new_cpg.map

Your CpG map file should be named with the old and new assemblies.