Skip to content

sof202/ChromOptimise

Repository files navigation

ChromOptimise is a pipeline that identifies the optimum number of states that should be used with ChromHMM's LearnModel command for a particular genomic dataset.

For more specific information, please head over to the wiki.

Table of contents

About

When using ChromHMM to learn hidden Markov models for genomic data, it is often difficult to determine how many states to include:

  • Including too many states will result in overfitting your data and introduces redundant states
  • Including too few states will result in underfitting your data and thus results in lower model accuracy

This pipeline identifies the optimal number of states to use by finding a model that avoids the two above points.

After using this pipeline, the user will have greater knowledge over their dataset in the context of ChromHMM, which will allow them to make more informed decisions as they continue to further downstream analysis.

Getting started

  1. Clone this repository
  2. Ensure all required software is installed
  3. If using LDSC, download 1000 genomes files (or similar) from this repository
  4. Create the configuration files using the templates provided and place them in a memorable location
  5. Run the setup executable
    • You may need to use chmod +x setup first
    • The user will be prompted for whether they want to remove lines beginning with module (artefact of HPC system used at UoE)
    • The user will also be prompted if they want to remove SLURM directives that are specific to the UoE HPC.

Usage

After completing 'getting started', run the master script (ChromOptimise.sh) in the command line with:

bash ChromOptimise.sh path/to/configuration/directory

Alternatively, you can run each of the shell scripts in JobSubmission sequentially for each epigenetic mark. For further information and example usage please consult the pipeline explanation.

Depending on your chosen dataset, you may not need to run all scripts. For example:

  • If you are not downloading data from EGA, the first two scripts are not necessary
    • Just ensure that .bam files are organised into directories named [[epigenetic mark name]] within the raw files directory
  • If your data is already processed (quality controlled), then start from the subsampling script.
    • Again, ensure that .bam files are organised into directories named [[epigenetic mark name]] within the Processed files directory

There also exists supplementary scripts for further information on your chosen data set. Most importantly, thresholds used in redundancy analysis can be inferred from the results of Redundancy_Threshold_Optimisation. Further details for these scripts can be found in the wiki.

Software requirements

This pipeline requires a unix-flavoured OS with the following software installed:

Further information

This study makes use of data generated by the Blueprint Consortium. A full list of the investigators who contributed to the generation of the data is available from www.blueprint-epigenome.eu. Funding for the project was provided by the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 282510 – BLUEPRINT.

For any further enquiries, please open an issue or contact Sam Fletcher:
s.o.fletcher@exeter.ac.uk