Pipeline and code for Alvarez-Benayas, Katsarou and Trasanidis et al

This pipeline is created in the ruffus (http://www.ruffus.org.uk/) and cgatcore (https://cgat-core.readthedocs.io/en/latest/) frameworks. See there documentation to understand how pipeline script files.

The script pipeline_atacseq.py contains the pipeline logic.

The module pipelineAtacseq.py contains various work functions that are executed by the pipeline.

The directory scripts contains several ancillary scripts (mostly in R).

The directory Notebooks contains Rmarkdown and jupyter notebooks containing interpretive analysis and plotting code.

Dependency installation

Dependencies are listed in environment.yml. This can be used to build a conda environment. We recommend using mamba to do this, as it is significantly faster than conda.

$ mamba env create [-p PATH_TO_STORE_ENV|-n NAME_FOR_ENV] -f environment.yml

Next you will need to install cgat-flow, which is not yet available on conda. This can be achieved by cloning the github repository and installing it into your environment.

$ conda activate ENV_NAME_OR_PATH
$ git clone https://github.com/cgat-developers/cgat-flow.git
$ cd cgat-flow
$ python setup.py develop

Configuring the pipeline

To set up the pipeline create a folder and run python PATH/TO/REPO/pipeline_atac.py config
Edit the configuration file. In particular set the location of the geneset, a samtools indexed copy of version 38 of the human genome, and the ENCODE 3. black list and low mappability regions bed files.
Copy into the directory the sample annotation file in this rep samples.tsv.
Put all the ATAC seq mapped BAM (mapped with bowtie2) files in the root of the directory.
Put both fastq files and bam files (HiSat mapped) into a folder called input_rna.dir

Running the pipeline

You will need to configure the cgatcore frame work to work with your cluster. Do this by creating a .cgat.yml file in your home directory. An example is provided below

jobs_limit_db: 1
shared_tmpdir: /fastdata/mb1ims/tmp
tmpdir: /scratch

cluster:
    queue_manager: sge
    queue: NONE
    parallel_environment: smp
    memory_resource: rmem
    options: -P gen2reg

You can now run show the tasks to be run with python PATH/TO/REPO/pipeline_atacseq.py show full or run it with python PATH/TO/REPO/pipeline_atacseq.py make full.

On our cluster, a complete run takes around 2.5 days (node availability is never limiting).

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Notebooks		Notebooks
pipeline_atacseq		pipeline_atacseq
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
contigs.tsv		contigs.tsv
environment.yml		environment.yml
geneset_all.gtf.gz		geneset_all.gtf.gz
pipelineAtacseq.py		pipelineAtacseq.py
pipeline_atacseq.py		pipeline_atacseq.py
readme.md		readme.md
samples.tsv		samples.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notebooks

Notebooks

pipeline_atacseq

pipeline_atacseq

scripts

scripts

.gitignore

.gitignore

LICENSE

LICENSE

contigs.tsv

contigs.tsv

environment.yml

environment.yml

geneset_all.gtf.gz

geneset_all.gtf.gz

pipelineAtacseq.py

pipelineAtacseq.py

pipeline_atacseq.py

pipeline_atacseq.py

readme.md

readme.md

samples.tsv

samples.tsv

Repository files navigation

Pipeline and code for Alvarez-Benayas, Katsarou and Trasanidis et al

Dependency installation

Configuring the pipeline

Running the pipeline

About

Releases

Packages

Contributors 2

Languages

License

IanSudbery/alvarez_et_al

Folders and files

Latest commit

History

Repository files navigation

Pipeline and code for Alvarez-Benayas, Katsarou and Trasanidis et al

Dependency installation

Configuring the pipeline

Running the pipeline

About

Resources

License

Stars

Watchers

Forks

Languages