Skip to content

ManavalanG/UniProt-genome-annotations-hg19

Repository files navigation

Read this first:

This repository was last updated for March-2017 UniprotKB release. Now, I have moved on to using BigBed files from UniProt, instead of Bed files used here. This recent work is available at repository uniprot_genomic.

UniProt in hg19 coordinates

UniProt provides human genome annotation data enabling mapping of amino acid annotations directly to reference genome coordinates, but they are available only in hg38 coordinates. See this publication for more info:

Nucleic Acids Res. 2017 Jan 4;45(D1):D158-D169. doi: 10.1093/nar/gkw1099. UniProt: the universal protein knowledgebase. The UniProt Consortium.

This repository converts and makes this data available in hg19 coordinates.

Files for download:

Besides conversion to hg19 coordinated, few changes are made here to suit our purposes, which is to identify if query amino acids have any UniProt annotation. See 'Processing pipeline' section for details.

  • Restructured, hg19-converted Bed files. This is what you probably are interested in.

  • Two merged files each containing selective sequence annotations of interest, as listed below.

    a. Merged file - Type 1 has following annotation types merged into a single file.

      1	Active site
      2	Binding site for any chemical group
      3	Calcium binding region
      4	Cross-link between proteins
      5	Disulfide bond
      6	Glycosylation-PTM
      7	Interesting site
      8	Lipidation-PTM
      9	Metal binding site
      10	Motif
      11	Nucleotide binding region
      12	Other PTM
      13	Signal peptide
      14	Transit peptide
      15	Zinc finger region
    

    b. Merged file - Type 0 has following annotation types merged into a single file.

      1	Active peptide
      2	Chain
      3	Coiled coil
      4	DNA binding domain
      5	Domain
      6	Intramembrane
      7	Natural variant
      8	Region of interest
      9	Repeated motifs or domains
      10	Topological domain
      11	Transmembrane region
    

Processing pipeline:

  1. Use liftOver tool for conversion of hg38 to hg19 coordinates. Note: If you are interested in excecuting the script, download chain file and store it in settings_files directory. It is not provided here due to license concerns.

  2. Fix formatting issues in resulting Bed files.

  3. Reformat Bed files as follows:

    a. Replace score column (5th column), which is zero by default in UniProt provided data, with corresponding sequence annotation type as shown below.

    Original format by UniProt:
    >chr1	7970956	7970959	Q99497	0	+	7970956	7970959	255,102,102	1	3	0	.	Nucleophile. Pubmed:20304780, Pubmed:25416785
    
    Format we used here:
    >chr1	8031016	8031019	Q99497	Active site	+	8031016	8031019	255,102,102	1	3	0	.	Nucleophile. Pubmed:20304780, Pubmed:25416785
    

    b. Restructure the rows in Bed files that have non-continuous amino acids as in example below.

    Original format by UniProt (this line has coordinates for three, non-continuous amino acids):
    >chr1	1633782	1633815	O75900	0	+	1633782	1633815	0,153,0	3	3,3,3	0,12,30	.	Zinc; catalytic.
    
    Format we used here (one amino acid per row, if non-continuous):
    >chr1	1569161	1569164	O75900	Metal binding site	+	1569161	1569164	0,153,0	1	3	0	.	Zinc; catalytic.
    >chr1	1569173	1569176	O75900	Metal binding site	+	1569173	1569176	0,153,0	1	3	0	.	Zinc; catalytic.
    >chr1	1569191	1569194	O75900	Metal binding site	+	1569191	1569194	0,153,0	1	3	0	.	Zinc; catalytic.
    

Resulting Bed files are what you probably need if you are looking for replacement for UniProt provided hg38 genome coordinates in hg19 format.

Further Restructuring:

We further merge sequence annotation types of our interest into two Bed files.

  1. For annotation type 'natural variant', replace disease acronyms with their complete name.
  2. Merge Bed files of interest (as customized in the settings file; based on values 0 and 1) based on sequence annotation types into two sets of merged files.

Download the resulting merged bed files:

a. Merged Bed file - Type 1

b. Merged Bed file - Type 0

Disclaimer

UniProt's license applies for the genome coordinates data available in this repository. Thanks to UniProt for permitting us to distribute this data in hg19 format. Data is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Code in this repository is distributed under MIT license.

About

UniProt genome annotation data in hg19/GRCh37 coordinates

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published