Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data set #126

Open
rehamFatima opened this issue Jun 26, 2018 · 24 comments
Open

Data set #126

rehamFatima opened this issue Jun 26, 2018 · 24 comments

Comments

@rehamFatima
Copy link

Hi,

I have been trying to run your pipeline on the mm10 data set (as setup by running install_genome_data.sh ). As there were no fast files, I made a bam file from the bed files and am running the pipeline using the following command :

$ bds atac.bds -species mm10 -gensz mm -bam1 ../genomes/mm10_enh_dhs.bam -chrsz ../genomes/mm10/mm10.chrom.sizes

I am not getting the output repository structure as mentioned on your main page, nor a Report.html file. If you could help with this, I would highly appreciate it.

Many thanks.
Kind Regards,
Reham

@leepc12
Copy link
Collaborator

leepc12 commented Jun 26, 2018

If you don't explicitly specify output directory with -out_dir [SOMEWHERE] then out/ will be generated on the working directory where you ran the pipeline command. Also skip -gensz mm. it's redundant.

@rehamFatima
Copy link
Author

out/ is generated, but it does not have all the files as listed on your main page. Could you please tell me what data have you used to generate all those files ? I am not quite sure whether the discrepancy is in the data set I am using or something is missing from the pipeline installation.

Thank you

@rehamFatima
Copy link
Author

rehamFatima commented Jun 29, 2018

Could you also please tell me which fastqs you have used as part of running the pipeline ? as the data set coming from the install_genome_data.sh only contains fasta and bed files.

Many thanks

@rehamFatima
Copy link
Author

I have tried running it with different fastqs from various sources, but that crashes down with Error (modules/align_bowtie2.bds, line 44, pos 3): Bowtie2 index (-bwt2_idx) doesn't exists! (file: .1.bt2 or .1.bt2l)

@leepc12
Copy link
Collaborator

leepc12 commented Jun 29, 2018

Did you specify species (hg38, mm10, ...) by -species [SPECIES]? Did you install genome data correctly? Did you see a "success" message at the end of the genome data installation shell script?

@rehamFatima
Copy link
Author

Thanks Jin, yes I have specified -species mm10 . I have reinstalled the genome data and did see the "success" message. It still does not contain any fastq files. Is it supposed to or are the fastq files to be retrieved from elsewhere ?

@leepc12
Copy link
Collaborator

leepc12 commented Jul 2, 2018

Please post a full log (including Bowtie2 index (-bwt2_idx) doesn't exists!) for debugging info. Also, what is the full command line that you used for running a pipeline?

@rehamFatima
Copy link
Author

The command is e.g :

bds atac.bds -species mm10 -se -gensz mm -fastq1 ../genomes/mm10_no_alt_analysis_set_ENCODE.fastq.gz -chrsz ../genomes/mm10/mm10.chrom.sizes

and the output is :

Picked up _JAVA_OPTIONS: -Xms8G -Xmx8G -XX:ParallelGCThreads=1


== git info
Latest git commit		: d0babc6d2cc0aa8038c5ad35a41ed9876f49619d (Tue Jun 5 18:40:20 2018)
	Reading parameters from section (default) in file(/homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env)...


== configuration file info
Hostname			: ebi5-164.ebi.ac.uk
Configuration file		: 
Environment file		: /homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env


== parallelization info
No parallel jobs		: false
Maximum # threads 		: 8


== cluster/system info
Walltime (general)		: 5h50m
Max. memory (general)		: 7G
Force to use a system		: local
Process priority (niceness)	: 0
Retiral for failed tasks	: 0
Submit tasks to a cluster queue	: 
Unlimited cluster mem./walltime	: false
Use --acount instead of SLURM partition		: false
Java temporary directory		: ${TMPDIR}


== shell environment info
Conda env. 			: bds_atac
Conda env. for python3		: bds_atac_py3
Conda bin. directory		: 

Shell cmd. for init.		: if [[ -f $(which conda) && $(conda env list | grep bds_atac | wc -l) != "0" ]]; then source activate bds_atac; sleep 5; fi;  export PATH=/homes/reham/ATAC-Seq/atac_dnase_pipelines/.:/homes/reham/ATAC-Seq/atac_dnase_pipelines/modules:/homes/reham/ATAC-Seq/atac_dnase_pipelines/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)

Shell cmd. for init.(py3)	: if [[ -f $(which conda) && $(conda env list | grep bds_atac_py3 | wc -l) != "0" ]]; then source activate bds_atac_py3; sleep 5; fi;  export PATH=/homes/reham/ATAC-Seq/atac_dnase_pipelines/.:/homes/reham/ATAC-Seq/atac_dnase_pipelines/modules:/homes/reham/ATAC-Seq/atac_dnase_pipelines/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)

Shell cmd. for fin.		: TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."; sleep 0

Cluster task min. len.		: 60

Cluster task delay			: 0


== output directory/title info
Output dir.			: /homes/reham/ATAC-Seq/atac_dnase_pipelines/out
Title (prefix)			: atac_dnase_pipelines
	Reading parameters from section (default) in file(/homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env)...


== species settings
Species				: mm10
Species file			: 

Species name (WashU browser)	: mm10
Ref. genome seq. fasta		: 
Chr. sizes file			: ../genomes/mm10/mm10.chrom.sizes
Black list bed			: 
Ref. genome seq. dir.		: 


== ENCODE accession settings
ENCODE experiment accession	: 
ENCODE award RFA			: 
ENCODE assay category		: 
ENCODE assay title		: 
ENCODE award				: 
ENCODE lab				: 
ENCODE assembly genome		: 
ENCODE alias prefix		: KLAB_PIPELINE
ENCODE alias suffix		: 


== report settings
URL root for output directory	: 
Genome coord. for browser tracks	: 


== align multimapping settings
# alignments reported for multimapping	: 0


== align bowtie2 settings
Bowtie2 index			: 
Replacement --score-min for bowtie2	: 
Walltime (bowtie2)		: 47h
Max. memory (bowtie2)		: 12G
Extra param. (bowtie2)		: 
Disable index on memory (bowtie2)	: false


== adapter trimmer settings
Maximum allowed error rate for cutadapt	: 0.10
Minimum trim. length for cutadapt -m	: 5
Walltime (adapter trimming)		: 23h
Max. memory (adapter trimming)		: 12G


== postalign bam settings
MAPQ reads rm thresh.		: 30
Rm. tag reads with str.		: chrM
No dupe removal in filtering raw bam	: false
Walltime (bam filter)		: 23h
Max. memory (bam filter)	: 12G
Dup marker				: picard
Use sambamba markdup (instead of picard)	: false


== postalign bed/tagalign settings
Max. memory for UNIX shuf			: 12G
No --random-source for UNIX shuf		: false


== postalign cross-corr. analysis settings
Max. memory for UNIX shuf			: 12G
User-defined cross-corr. peak strandshift	: -1
Extra parameters for cross-corr. analysis	: 
Max. memory for cross-corr. analysis		: 15G
Stack size for cross-corr. analysis		:


== callpeak macs2 settings
Genome size (hs,mm)		: mm
Walltime (macs2)		: 23h
Max. memory (macs2)		: 15G
Cap number of peaks (macs2)	: 300K
Extra parameters for macs2 callpeak		: 


== callpeak naiver overlap settings
Bedtools intersect -nonamecheck	: false


== IDR settings
Append IDR threshold to IDR out_dir	: false


== ATAQC settings
TSS enrichment bed		: 
DNase bed for ataqc		: 
Promoter bed for ataqc		: 
Enhancer bed for ataqc		: 
Reg2map for ataqc			: 
Reg2map_bed for ataqc		: 
Roadmap metadata for ataqc	: 
Max. memory for ATAQC			: 20G
Walltime for ATAQC			: 47h


== atac pipeline settings
Type of pipeline			: atac-seq
Align only				: false
# reads to subsample replicates (0 if no subsampling)	: 0
# reads to subsample for cross-corr. analysis 	: 25000000
No pseudo replicates			: false
No ATAQC (advanced QC report)		: false
No Cross-corr. analysis			: false
Use CSEM for alignment			: false
Smoothing window for MACS2		: 150
DNase Seq				: false
IDR threshold				: 0.1
Force to use ENCODE3 parameter set	: false
Force to use ENCODE parameter set	: false
Disable genome browser tracks	: false
p-val thresh. for overlapped peaks	: 0.01
MACS2 p-val thresh. for peaks	: 0.01
MACS2 p-val thresh. for BIGWIGs		: 0.01
Enable IDR on called peaks	: false
Automatically find/trim adapters	: false

== checking atac parameters ...
00:00:00.925	Error (modules/align_bowtie2.bds, line 44, pos 3): Bowtie2 index (-bwt2_idx) doesn't exists! (file: .1.bt2 or .1.bt2l)

@leepc12
Copy link
Collaborator

leepc12 commented Jul 2, 2018 via email

@rehamFatima
Copy link
Author

rehamFatima commented Jul 2, 2018 via email

@leepc12
Copy link
Collaborator

leepc12 commented Jul 2, 2018 via email

@rehamFatima
Copy link
Author

rehamFatima commented Jul 4, 2018



## Get hostname with the following command: 
## $ hostname -f
##
## Configure environment per hostname:
## [hostname1]
## ...
##
## Use the same environment for multiple hostnames:
## [hostname2, hostname3, ...]
## ...
##
## Using group
## [hostname1, hostname2, ... : group]
## [group]
## ...
##
## Using an asterisk in hostnames (IMPORTANT: only one * is allowed in hostnames)
##
## [host*name1]
##
## [*hostname2, hostname3*]

# Stanford Kundaje group clusters (out of SGE)
[vayu, mitra, durga]
conda_env	= bds_atac
conda_env_py3	= bds_atac_py3
conda_bin_dir	= /software/miniconda3/bin
species_file	= $script_dir/species/kundaje.conf
unlimited_mem_wt= true 		# unlimited max. memory and walltime on Kundaje clusters
nice 		= 20
nth 		= 3

# Stanford Kundaje group clusters (controlled with SGE)
[nandi, kali, amold, wotan, kadru, surya, indra]
conda_env	= bds_atac
conda_env_py3	= bds_atac_py3
conda_bin_dir	= /software/miniconda3/bin
species_file	= $script_dir/species/kundaje.conf
unlimited_mem_wt= true 		# unlimited max. memory and walltime on Kundaje clusters
system 		= sge
nice 		= 20
nth 		= 3

# Stanford NEW SCG
[*.scg.stanford.edu, dper730xd*, hppsl230s*, dper910*, sgiuv*, sgisummit*, smsx10srw*]
conda_env	= bds_atac
conda_env_py3	= bds_atac_py3
species_file	= $script_dir/species/scg.conf
nth		= 4		# number of threads for each pipeline
system		= slurm		# force to use SLURM SCG
q_for_slurm_account = true	# use --account instead of -p (partition)
cluster_task_delay = 10 	# for NFS delayed write

# Stanford OLD SCG4
[scg*.stanford.edu, scg*.local, carmack.stanford.edu, crick.stanford.edu]
conda_env	= bds_atac
conda_env_py3	= bds_atac_py3
species_file	= $script_dir/species/scg.conf
nth		= 4		# number of threads for each pipeline
system		= sge		# force to use SGE (Sun Grid Engine) on SCG3/4 even though a user doesn't explicitly specify SGE on command line with 'bds -s sge atac.bds ...'
cluster_task_delay = 10 	# for NFS delayed write

# Stanford Sherlock clusters
[sherlock*.stanford.edu, sh-*.local, sh-*.int, sh-ln*.stanford.edu]
conda_env	= bds_atac
conda_env_py3	= bds_atac_py3
species_file	= $script_dir/species/sherlock.conf
nth		= 4		# number of threads for each pipeline
system		= slurm
cluster_task_delay = 30		# for NFS delayed write


# default
[default]
conda_env	= bds_atac
conda_env_py3	= bds_atac_py3
species_file	= /nfs/nobackup/ensembl/reham_ens/genome/bds_atac_species.conf	# DEF_SPECIES_FILE: DO NOT MODIFY THIS COMMENT (install_genome_data.sh WILL NOT WORK).```

@rehamFatima
Copy link
Author

rehamFatima commented Jul 4, 2018

$ ls -l /nfs/nobackup/ensembl/reham_ens/genome/mm10/

total 4327335
drwxr-sr-x 2 reham ensembl        347 Jun 29 17:32 ataqc
drwxr-sr-x 2 reham ensembl        429 Jul  2 10:01 bowtie2_index
-rw-r--r-- 1 reham ensembl       1263 Oct 22  2016 mm10.blacklist.bed.gz
-rw-r--r-- 1 reham ensembl       1405 Jul  2 10:01 mm10.chrom.sizes
-rw-r--r-- 1 reham ensembl   26078281 Jul  2 14:22 mm10_enh_dhs.bam
-rw-r--r-- 1 reham ensembl 2785490220 Jul  2 09:46 mm10_no_alt_analysis_set_ENCODE.fasta
-rw-r--r-- 1 reham ensembl       2511 Jul  2 10:01 mm10_no_alt_analysis_set_ENCODE.fasta.fai
-rw-r--r-- 1 reham ensembl  870379095 Jan 23  2016 mm10_no_alt_analysis_set_ENCODE.fasta.gz
drwxr-sr-x 2 reham ensembl       2544 Jul  2 09:46 seq

( I have had to reinstall the genome data to another location )

@rehamFatima
Copy link
Author

$ ls -l /nfs/nobackup/ensembl/reham_ens/genome
total 62103
-rw-r--r-- 1 reham ensembl     1124 Jul  2 10:01 bds_atac_species.conf
-rwxr--r-- 1 reham ensembl 26286699 Jun 29 17:45 ERR1659027_1.fastq
-rwxr--r-- 1 reham ensembl 26286699 Jun 29 17:45 ERR1659027_2.fastq
drwxr-sr-x 5 reham ensembl      410 Jul  2 14:22 mm10

and

ls -l /homes/reham/ATAC-Seq/genomes/
total 1901192
-rwxr--r-- 1 reham ensembl 1884648321 Jun 14 14:52 ENCFF124LBK.fastq.gz
drwxr-xr-x 3 reham ensembl       4096 Jun 29 17:32 mm10
-rw-r--r-- 1 reham ensembl   26079643 Jun 14 10:59 mm10_enh_dhs.bam
-rw-r--r-- 1 reham ensembl    2342848 Jun 29 16:30 mm10_enh_dhs.bam.bai
-rw-r--r-- 1 reham ensembl   26079835 Jun 29 16:29 mm10_enh_dhs.bam.sorted

I hope these will help.
Many thanks

@leepc12
Copy link
Collaborator

leepc12 commented Jul 4, 2018

Please add -species_file /nfs/nobackup/ensembl/reham_ens/genome/bds_atac_species.conf to the command line and try again.

@rehamFatima
Copy link
Author

rehamFatima commented Jul 6, 2018

Command :

bds atac.bds -species mm10 -species_file /nfs/../genome/bds_atac_species.conf -gensz mm -nth 1 -bam1 /nfs/../genome/mm10/mm10_enh_dhs.bam -system local -no_par

Output :

== checking input files ...

Rep1 bam (PE) :
	/nfs/../genome/mm10/mm10_enh_dhs.bam
Distributing 1 to ... 
{1=1}
awk: cmd. line:1: fatal: division by zero attempted
Task failed:
	Program & line     : 'modules/postalign_bam.bds', line 341
	Task Name          : 'dedup_bam_PE_2 rep1'
	Task ID            : 'atac.bds.20180706_093826_930/task.postalign_bam.dedup_bam_PE_2_rep1.line_341.id_10'
	Task PID           : '1925'
	Task hint          : 'samtools view -F 1804 -f 2 -b /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam > /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.nodup.bam; sambamba index -t 1 /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.nodup.bam; s'
	Task resources     : 'cpus: -1	mem: -1.0 B	wall-timeout: 8640000'
	State              : 'ERROR'
	Dependency state   : 'ERROR'
	Retries available  : '1'
	Input files        : '[/homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam]'
	Output files       : '[/homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.nodup.bam, /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/qc/rep1/mm10_enh_dhs.nodup.flagstat.qc, /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/qc/rep1/mm10_enh_dhs.nodup.pbc.qc]'
	Script file        : '/homes/reham/ATAC-Seq/atac_dnase_pipelines/atac.bds.20180706_093826_930/task.postalign_bam.dedup_bam_PE_2_rep1.line_341.id_10.sh'
	Exit status        : '1'
	Program            : 
		
		# SYS command. line 343
		
		 if [[ -f $(which conda) && $(conda env list | grep bds_atac | wc -l) != "0" ]]; then source activate bds_atac; sleep 5; fi;  export PATH=/homes/reham/ATAC-Seq/atac_dnase_pipelines/.:/homes/reham/ATAC-Seq/atac_dnase_pipelines/modules:/homes/reham/ATAC-Seq/atac_dnase_pipelines/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)
		
		# SYS command. line 350
		
		 samtools view -F 1804 -f 2 -b /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam > /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.nodup.bam
		
		# SYS command. line 352
		
		 sambamba index -t 1 /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.nodup.bam
		
		# SYS command. line 354
		
		 sambamba flagstat -t 1 /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.nodup.bam > /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/qc/rep1/mm10_enh_dhs.nodup.flagstat.qc
		
		# SYS command. line 365
		
		 sambamba sort -t 1 -n /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam -o /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam.tmp.bam
		
		# SYS command. line 367
		
		 bedtools bamtobed -bedpe -i /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam.tmp.bam | \
							awk 'BEGIN{OFS="\t"}{print $1,$2,$4,$6,$9,$10}' | \
							grep -v 'chrM' | sort | uniq -c | \
							awk 'BEGIN{mt=0;m0=0;m1=0;m2=0} ($1==1){m1=m1+1} ($1==2){m2=m2+1} {m0=m0+1} {mt=mt+$1} END{m1_m2=-1.0; if(m2>0) m1_m2=m1/m2; printf "%d\t%d\t%d\t%d\t%f\t%f\t%f\n",mt,m0,m1,m2,m0/mt,m1/m0,m1_m2}' > /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/qc/rep1/mm10_enh_dhs.nodup.pbc.qc
		
		# SYS command. line 371
		
		 rm -f /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam.tmp.bam
		
		# SYS command. line 373
		
		 TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."; sleep 0
	StdErr (100000000 lines)  :
		awk: cmd. line:1: fatal: division by zero attempted

Fatal error: modules/postalign_bam.bds, line 378, pos 4. Task/s failed.

Creating checkpoint file: Config or command line option disabled checkpoint file creation, nothing done.

@leepc12
Copy link
Collaborator

leepc12 commented Jul 9, 2018

Can you share your BAM files for debugging?

original:
/nfs/nobackup/ensembl/reham_ens/genome/mm10/mm10_enh_dhs.bam
processed:
/homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam.tmp.bam

Is your input BAM file filtered/deduped? any quality problem with the input BAM file?

@rehamFatima
Copy link
Author

PFA

BamFiles.zip

I just used the bed files within the mm10 genome installed and generated the bam file using bedtools bedtobam. Could you tell how or what should I be looking for in terms of quality problem with the bam file?
Thank you

@leepc12
Copy link
Collaborator

leepc12 commented Jul 10, 2018

That BED file in the mm10 genome data is not for testing purpose.

If you just wanted to test our pipeline. I recommend to use a new pipeline https://github.com/ENCODE-DCC/atac-seq-pipeline.

@rehamFatima
Copy link
Author

Thanks Jin, if you could please tell me what data set you used to generate the files on your main page. As I have tried even ATAC seq data sets from ENA.

@leepc12
Copy link
Collaborator

leepc12 commented Jul 11, 2018

We recommend to use our new pipeline at https://github.com/ENCODE-DCC/atac-seq-pipeline/. This new pipeline supports docker, DNANexus and Google Cloud platform so that you will avoid any dependency problems.

Please read through documentation and then find an example at /examples/scg, download data from the ENCODE portal and try running pipelines.

@rehamFatima
Copy link
Author

rehamFatima commented Jul 16, 2018

Thank you Jin. May I ask how the atac-seq uses phantompeakqualtools ? as it is seemingly tripping there.
Thanks

@leepc12
Copy link
Collaborator

leepc12 commented Jul 16, 2018

@rehamFatima: Sure, run_spp.R is used for cross-correlation analysis. Actual shell command lines are here https://github.com/kundajelab/atac_dnase_pipelines/blob/master/modules/postalign_xcor.bds#L113

What kind of error did you get?

@leepc12
Copy link
Collaborator

leepc12 commented Aug 16, 2018

@rehamFatima I actually don't understand why you want to test our pipeline with a BED file (or its BAM conversion) which is not designed to be used as a valid input to the pipeline.

The error you got in the issue #128 is input data quality problem. Please try with some samples on the ENCODE portal.

Instead of using this old pipeline, can you work with our new pipeline?

A very detailed tutorial with sample data is available at https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/hotfix_PIP-357/docs/tutorial_google.md

You may need to sign up for Google Cloud Platform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants