GitHub - sof202/ChromOptimise: Find the optimum number of states to use in a ChromHMM model

ChromOptimise is a pipeline that identifies the optimum number of states that should be used with ChromHMM's LearnModel command for a particular genomic dataset.

For more specific information, please head over to the wiki.

About

When using ChromHMM to learn hidden Markov models for genomic data, it is often difficult to determine how many states to include:

Including too many states will result in overfitting your data and introduces redundant states
Including too few states will result in underfitting your data and thus results in lower model accuracy

This pipeline identifies the optimal number of states to use by finding a model that avoids the two above points.

After using this pipeline, the user will have greater knowledge over their dataset in the context of ChromHMM, which will allow them to make more informed decisions as they continue to further downstream analysis.

Getting started

Clone this repository
Ensure all required software is installed
If using LDSC, download 1000 genomes files (or similar) from this repository
Create the configuration files using the templates provided and place them in a memorable location
Run the setup executable
- You may need to use chmod +x setup first
- The user will be prompted for whether they want to remove lines beginning with module (artefact of HPC system used at UoE)
- The user will also be prompted if they want to remove SLURM directives that are specific to the UoE HPC.

Usage

After completing 'getting started', run the master script (ChromOptimise.sh) in the command line with:

bash ChromOptimise.sh path/to/configuration/directory

Alternatively, you can run each of the shell scripts in JobSubmission sequentially for each epigenetic mark. For further information and example usage please consult the pipeline explanation.

Depending on your chosen dataset, you may not need to run all scripts. For example:

If you are not downloading data from EGA, the first two scripts are not necessary
- Just ensure that .bam files are organised into directories named [[epigenetic mark name]] within the raw files directory
If your data is already processed (quality controlled), then start from the subsampling script.
- Again, ensure that .bam files are organised into directories named [[epigenetic mark name]] within the Processed files directory

There also exists supplementary scripts for further information on your chosen data set. Most importantly, thresholds used in redundancy analysis can be inferred from the results of Redundancy_Threshold_Optimisation. Further details for these scripts can be found in the wiki.

Software requirements

This pipeline requires a unix-flavoured OS with the following software installed:

Bash (>=4.2.46(2))
SLURM Workload Manager (>=20.02.3)
SAMtools (>=1.9)
R (>=3.6.0)
Java (>= openjdk 13.0.2)
ChromHMM (>=1.23)
sed (>=4.2.2)
bc (>=1.07.1)
LDSC (>=aa33296)
zcat (gzip) (>=1.5)

Further information

This study makes use of data generated by the Blueprint Consortium. A full list of the investigators who contributed to the generation of the data is available from www.blueprint-epigenome.eu. Funding for the project was provided by the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 282510 – BLUEPRINT.

For any further enquiries, please open an issue or contact Sam Fletcher:
s.o.fletcher@exeter.ac.uk

Name		Name	Last commit message	Last commit date
Latest commit History 591 Commits
.github		.github
ChromOptimiseCheckpoints		ChromOptimiseCheckpoints
JobSubmission		JobSubmission
Rscripts		Rscripts
documentation		documentation
supplementary/Redundancy_Threshold_Optimisation/JobSubmission		supplementary/Redundancy_Threshold_Optimisation/JobSubmission
.gitignore		.gitignore
ChromOptimise.sh		ChromOptimise.sh
LICENSE		LICENSE
README.md		README.md
setup		setup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

ChromOptimiseCheckpoints

ChromOptimiseCheckpoints

JobSubmission

JobSubmission

Rscripts

Rscripts

documentation

documentation

supplementary/Redundancy_Threshold_Optimisation/JobSubmission

supplementary/Redundancy_Threshold_Optimisation/JobSubmission

.gitignore

.gitignore

ChromOptimise.sh

ChromOptimise.sh

LICENSE

LICENSE

README.md

README.md

setup

setup

Repository files navigation

Table of contents

About

Getting started

Usage

Software requirements

Further information

About

Releases 2

Packages

Languages

License

sof202/ChromOptimise

Folders and files

Latest commit

History

Repository files navigation

Table of contents

About

Getting started

Usage

Software requirements

Further information

About

Topics

Resources

License

Stars

Watchers

Forks

Languages