Bioinformatics @ Sydney Informatics Hub

This page includes bioinformatics pipelines, software, and training material developed by the Sydney Informatics Hub, which is a Core Research Facility of the University of Sydney. The Sydney Informatics Hub is an official node of the Australian BioCommons, and has worked in partnership with National Computational Infrastructure, Pawsey Supercomputing Research Centre, and QCIF to create command-line resources that make bioinformatics more accessible for life scientists.

Many of the resources available here are focused on making processing data at scale more accessible. To achieve this we have developed optimised pipelines for national HPC infrastructures and resources for workflow development.

💻 Scalable data processing pipelines
📓 Reproducible code notebooks
✨ Supporting Nextflow
💾 Software and helper scripts
🎓 Training materials
💁 Cite us to support us

💻 Reproducible pipelines

Our pipelines have been optimised for compute platforms including the University of Sydney's HPC Artemis, the National Compute Infrastructure (NCI), Pawsey Supercomputing Research Centre's HPC Setonix and Nimbus cloud, the University of Queensland's (UQ's) HPC Flashlite and AWS Cloud. You can find DOIs for all our pipelines at the Sydney Informatics Hub's WorkflowHub.

We also support the use of nf-core workflows. Check out the institutional configs we've build for Australian HPC and cloud infrastructures.

Category	Pipeline	Infrastructure	Description	Software
Quality control	fastqc-nf	Nextflow - NCI Gadi	QC of raw Illumina sequence reads	fastQC, multiqc
Quality control	BamQC-nf	Nextflow - NCI Gadi, Pawsey Setonix, Pawsey Nimbus	Short read alignment file QC stats	samtools, mosdepth, qualimap, multiqc
Genomics	Parabricks-Genomics	Nextflow - NCI Gadi	GPU-enabled, rapid whole genome sequence alignment and short variant calling against a refence genome	Parabricks, BWA-MEM, DeepVariant, Glnexus, VEP
Genomics	Fastq-to-BAM	Optimised - NCI Gadi	Whole genome sequence alignment to a reference genome following pre-processing recommendations by the BROAD Institute	bwa-kit, fastp, BWA-MEM, SAMbamba, SAMblaster, SAMtools, GATK4
Genomics	Germline-ShortV	Optimised - NCI Gadi	Germline short variant calling (joint calling) following the Germline short variant discovery (SNPs + Indels) Best Practices Workflow by the BROAD Institute	GATK4
Genomics	Bootstrapping-for-BQSR	Optimised - NCI Gadi	Bootstrapping a variant resource to enable GATK base quality score recalibration (BQSR) for non-model organisms that lack a publicly available variant resource.	GATK4
Genomics	Somatic-ShortV	Optimised - NCI Gadi	Somatic short variant calling (joint calling) following the Somatic short variant discovery (SNPs + Indels) Best Practices Workflow by the BROAD Institute for tumour-normal pairs	GATK4
Genomics	Somatic-ShortV-nf	Nextflow - NCI Gadi, Pawsey Setonix, Pawsey Nimbus	Currently under development	GATK4
Genomics	GermlineStructuralV-nf	Nextflow - NCI Gadi, Pawsey Setonix, Pawsey Nimbus	Germline structural variant calling with short read bam files	manta, smoove, tiddit, survivor, annotSV
Genomics	BioCommons-Canu-Metrics	Optimised - NCI Gadi	Collect compute resource usage metrics (CPU, memory, /scratch disk, /jobfs disk, iNode) after running Canu optimised for NCI Gadi by the Australian BioCommons
Genomics	Flashlite-Juicer	Optimised - Flashlite [decomissioned]	PBS version of Juicer that generates Hi-C maps from raw fastq files	Juicer
RNAseq	RNASeq-DE	Optimised - NCI Gadi	Process RNA sequencing data for differential expression, including fastQC, trimming, mapping with STAR and obtaining a raw count matrix	fastQC, multiQC, bbduk, STAR, RSeQC, HTSeq
Metagenomics	Shotgun-Metagenomics-Analysis	Optimised - NCI Gadi	Analysis of metagenomic shotgun sequences including assembly, speciation, abundance, ARG discovery, functional profiling, gene prediction, insertion sequence annotation and estimation of the resitome.	abricate, bbtools, bracken, bwa, diamond, fastqc, gatk, humann2, kraken2, kronatools, megahit, metaphlan2, multiqc, nci-parallel, openmpi, prodigal, prokka, python3, sambamba, samtools, seqtk
Transcriptomics	Gadi-Trinity	Optimised - NCI Gadi	Perform de novo transcriptome assembly with Trinity	Trinity
Data preparation	IndexReferenceFasta-nf	Nextflow - NCI Gadi, Pawsey Setonix, Pawsey Nimbus	Create fasta file indexes	samtools, bwa, gatk

📓 Reproducible notebooks

Notebook	Description
Rnaseq: differential expression	A Rmarkdown notebook to convert raw gene counts to functional enrichments
Proteomics: differential abundance	Currently under development
Metagenomics: taxonomic profiling	Currently under development

✨ Supporting Nextflow

We have created resources to support Nextflow workflow development and deployment on HPC infrastructures.

Tool	Description
Nextflow DSL2 template	A straightforward Nextflow workflow template generator.
Nextflow ConfigBuilder	A simple custom config file generator. Under development.
Institutional nf-core configs	Public config files for running nf-core pipelines at NCI and Pawsey infrastructures.

💾 Software and helper scripts

We have created resources to support workflow development and deployment on HPCs, resource benchmarking, and flexible data visualisation.

Tool	Description
HPC usage reports	Pull resource usage data from HPC job logs into reports.
NCI Gadi benchmarking template	Automated submission of identical benchmark tasks with increasing compute resources.
IGVreport-nf	Generate IGV report for a set of variants.
split-GeneWiz-fastq	Split GeneWiz 'combined' (concatenated) fastq files into correct flowcell-lane pairs.
Fix-BAM-read-groups	Change the read group metadata within a BAM file. Operates on the header as well as the individual SAM output lines.

🎓 Self-directed training materials

We deliver national training events focused on the accessibility of command-line bioinformatics as a part of the Australian BioCommons training cooperative. Visit their events page for upcoming events. You can find recordings of past events on the Australian BioCommons YouTube channel and the Sydney Informatics Hub YouTube channel. Materials for all Australian BioCommons events are published on Zenodo.

Event	Description	Prerequisites
Unlocking nf-core workshop	Foundational skills for running and customising nf-core workflows reproducibly.	Experince with Unix CLI, familiarity with Nextflow and nf-core
Introduction to RNAseq workshop	RNAseq data analysis for differential expression on the CLI.	Experince with Unix CLI, familiarity with R/RStudio
Artemis HPC training series	A comprehensive introduction to USyd's HPC.	Competency with Unix CLI
Introduction to NCI Gadi workshop	A quickstart guide for experienced supercomputer users.	Experience with Unix CLI, HPC infrastructures

💁 Cite us to support us!

Acknowledgements (and co-authorship, where appropriate) are an important way for us to demonstrate the value we bring to your research. Your research outcomes are vital for ongoing funding of the Sydney Informatics Hub and national compute facilities. Please cite the pipeline repository(s) that you have used. You can also find DOIs for all our pipelines at the Sydney Informatics Hub's WorkflowHub.

Suggested acknowledgements:

Sydney Informatics Hub

The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney and the Australian BioCommons which is enabled by NCRIS via ARDC and Bioplatforms Australia.

NCI Gadi

The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney and the Australian BioCommons which is enabled by NCRIS via Bioplatforms Australia. The authors acknowledge the use of the National Computational Infrastructure (NCI) supported by the Australian Government and the Sydney Informatics Hub HPC Allocation Scheme, supported by the Deputy Vice-Chancellor (Research), University of Sydney and the ARC LIEF, 2019: Smith, Muller, Thornber et al., Sustaining and strengthening merit-based access to National Computational Infrastructure (LE190100021).

USyd Artemis

The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney and the Australian BioCommons which is enabled by NCRIS via Bioplatforms Australia. This research utilised the high performance computing service, Artemis, provided by the Sydney Informatics Hub, Core Research Facility, University of Sydney.

UQ's Flashlite

The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney and the Australian BioCommons which is enabled by NCRIS via Bioplatforms Australia. The authors acknowledge the use of the high performance computing services provided by the University of Queensland's Research Computing Centre (RCC).

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

Bioinformatics @ Sydney Informatics Hub

💻 Reproducible pipelines

📓 Reproducible notebooks

✨ Supporting Nextflow

💾 Software and helper scripts

🎓 Self-directed training materials

💁 Cite us to support us!

About

Releases

Packages

Contributors 7

Sydney-Informatics-Hub/Bioinformatics

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

Bioinformatics @ Sydney Informatics Hub

💻 Reproducible pipelines

📓 Reproducible notebooks

✨ Supporting Nextflow

💾 Software and helper scripts

🎓 Self-directed training materials

💁 Cite us to support us!

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Packages