Skip to content

RenneLab/hkref

Repository files navigation

hkref (Hybkit-Reference)

GitHub release (latest by date including pre-releases)

Made with Python

Made with Bash

BioPython Project

Powered by MyGene.info

This repository is a part of the hybkit project.
Full hybkit project documentation is available at hybkit's ReadTheDocs.

Description:

This repository includes an up-to-date human genomic sequence reference designed to be compaitble with the Hyb program for chimeric (hybrid) read calling for ribonomics experiments.
The method for reference library construction is based on the protocol provided in the supplemental methods of:
Helwak, Aleksandra, et al. 'Mapping the human miRNA interactome by CLASH reveals frequent noncanonical binding.' Cell 153.3 (2013): 654-665. http://dx.doi.org/10.1016/j.cell.2013.03.043
The reference library is primarily based on sequences downloaded from Ensembl via the Biomart API, with the use of miRBase for mature miRNA sequences and a few other sequence sources.
Biomart queries include:
  • mRNA : transcript_biotype=protein_coding; cdna; (limited to where a RefSeq Protein Identifier Exists)
  • lncRNA : transcript_biotype=lncrna; cdna
  • rRNA : transcript_biotype=lncrna; cdna
  • rRNA : transcript_biotype=[Mt_rRNA, rRNA, rRNA_pseudogene]; transcript_exon_intron
  • tRNA : transcript_biotype=Mt_tRNA; transcript_exon_intron
  • other : transcript_biotype=[all remaining]; transcript_exon_intron

For a detailed description of the current sequences and queries utilized, see "Current Reference Details" below.

Run Reference Creation Pipeline:

The reference pipeline is designed using Nextflow, and has been tested on Nextflow/23.04.1. Dependency handling is performed with conda modules (containerized implementation in development).

Required program dependencies are:
Required Python Packages:

The scripts can be run by executing the first script: "00_run_all.sh" using the presupplied conda configuration, or by making all required resources (seqkit, python3) available on the system path.

Hyb Reference Specification:

The Hyb program has requirements about the formatting of the FASTA file used for the reference.

Currently identified requirements include:
  • No description in FASTA sequence header (no whitespace characters)
  • Sequence identifier be of the form of "{1}_{2}_{seqid}_{biotype}"
    {1}: Arbitrary Identifier (ENSG... for Ensembl Sequences)
    {2}: Arbitrary Identifier (ENST... for Ensembl Sequences)
    {3}: Name of gene/miRNA
    {4}: Ensembl-style transcript_biotype.
        (Note, "microRNA" must be used in place of "miRNA" for recognition by Hyb)
  • {1}, {2}, {seqid}, and {biotype} should contain only [a-z], [A-Z], [0-9],

    "-", and "|" characters.

  • "_", ".", and "," characters are specifically excluded from identifiers.

Examples:

>ENSG00000003137_ENST00000001146_CYP26B1_mRNA
.....
>MIMAT0000062_MirBase_let-7a_microRNA
TGAGGTAGTAGGTTGTATAGTT

Thanks to Grzegorz Kudla ( https://github.com/gkudla ) for providing information on Hyb reference creation.

Current Reference Details:

Text of: ./01_notes.sh

Download a reference sequence library for the Hyb program from Ensembl
using the Biomart python module.

Library construction is based on the protocol provided in the supplemental methods of:
Helwak, Aleksandra, et al. 'Mapping the human miRNA interactome by CLASH reveals
frequent noncanonical binding.' Cell 153.3 (2013): 654-665.
http://dx.doi.org/10.1016/j.cell.2013.03.043
( Supplemental methods section found only in PDF-fulltext )

Biomart queries include:
  protein_coding (as cDNA)
  lncRNA (as cDNA)
  All remaining gene_biotypes
      as unspliced transcripts ('transcript_exon_intron')

tRNAs:  genomic tRNA database http://gtrnadb.ucsc.edu/)
rRNAs:  NCBI Genbank Database, rRNA sequences (NR_003287.4, NR_003286.4);
miRNAs: miRBase release 22.1 (http://www.mirbase.org): mature human miRNAs.

These sequences are then formatted in the required {}_{}_{name}_{biotype} header
format for Hyb, and all extra '.' and '_' symbols are removed.

Original biotypes from the hOH7 Hyb database are:
Ig, lincRNA, microRNA, miscRNA, mRNA, mtrRNA, pr-tr, pseudo, rRNA, snoRNA, snRNA, Trec, tRNA
In this version, biotypes are passed through as with the ensembl 'transcript_biotype' field.

In order to facillitate unambiguous miRNA alignment, mature iRNA sequences are aligned to the
reference transcriptome, and any alignemnts within transcripts are masked. This is performed to
ensure both that each given miRNA sequence has only a single reference alignment, as well as
to allow miRNA precursor transcripts to be identified as hybrid targets.

"""
echo "${NOTES}"