Skip to content

microsoft/genomicsnotebook

Repository files navigation

Genomics Data Analysis with Jupyter Notebooks on Azure

text

Jupyter notebook is a great tool for data scientists who are working on genomics data analysis. In this repo, we demonstrate the use of Azure Notebooks for genomics data analysis via GATK, Picard, Bioconductor and Python libraries.

For more information about Codespaces please visit the product page

Here is the list of sample notebooks on this repo:

  1. genomics.ipynb: Analysis from 'uBAM' to 'structured data table' analysis.
  2. genomicsML.ipynb: Train Machine Learning models with Genomics + Clinical Data
  3. genomics-platinum-genomes.ipynb: Accessing Illumina Platinum Genomes data from Azure Open Datasets* and to make initial data analysis.
  4. genomics-reference-genomes.ipynb: Accessing reference genomes from Azure Open Datasets*
  5. genomics-clinvar.ipynb: Accessing ClinVar data from Azure Open Datasets*
  6. genomics-giab.ipynb: Accessing Genome in a Bottle data from Azure Open Datasets*
  7. SnpEff.ipynb: Accessing SnpEff databases from Azure Open Datasets*
  8. 1000 Genomes.ipynb: Accessing 1000 Genomes dataset from Azure Open Datasets*
  9. GATKResourceBundle.ipynb: Accessing GATK resource bundle from Azure Open Datasets*
  10. ENCODE.ipynb: Accessing ENCODE dataset from Azure Open Datasets*
  11. genomics-OpenCRAVAT.ipynb: Accessing OpenCRAVAT dataset from Azure Open Datasets and deploy built-in Azure Data Science VM for OpenCRAVAT*
  12. Bioconductor.ipynb: Pulling Bioconductor Docker image from Microsoft Container Registry
  13. simtotable.ipynb: Simulate NGS data, use Cromwell on Azure OR Microsoft Genomics service for secondary analysis and convert the gVCF data to a structured data table.
  14. igv_jupyter_extension_sample.ipynb: Download sample VCF file from Azure Open Datasets and use igv-jupyter extension on Jupyter Lab environment.
  15. radiogenomics.ipynb: Combine DICOM, VCF and gene expression data for patient segmentation analysis.
  16. fhir+PacBio.ipynb: Convert Synthetic FHIR and PacBio VCF Data to parquet and Explore with Azure Synapse Analytics
  17. fhir-vcf-clustering.ipynb: Convert Synthetic FHIR and PacBio VCF Data to parquet and Explore with Azure Synapse Analytics

*Technical note: Explore Azure Genomics Data Lake with Azure Storage Explorer

1. Prerequisites

Create and manage Azure Machine Learning workspaces in the Azure portal

text

For further details on creation of Azure ML workspace please visit this page.

Run the notebook in your workspace

This chapter uses the cloud notebook server in your workspace for an install-free and pre-configured experience. Use your own environment if you prefer to have control over your environment, packages and dependencies.

Follow along with this video or use the detailed steps below to clone and run the tutorial from your workspace.

Watch the video

2. Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

3. References

  1. Jupyter Notebook on Azure
  2. Introduction to Azure Notebooks
  3. GATK
  4. Picard
  5. Azure Machine Learning
  6. Azure Open Datasets
  7. Cromwell on Azure
  8. Bioconductor