Skip to content

Latest commit

 

History

History
275 lines (180 loc) · 8.5 KB

README.md

File metadata and controls

275 lines (180 loc) · 8.5 KB

Methods for methylation and accessible chromatin region analyses:

Steps

  1. Download the genome fasta file and annotation files and creates the directory structure described in the script, using the file.

  2. Process the NAM genomes

    • alters the chromosome names in the NAM genome fasta files
    • creates a bedtools genome file for use in bedtools
    • creates bed files of 100 bp windows, sliding 20 bp, genome-wide
    • identifies all contiguous sequences (contigs) across the chromosomes
    • this script works in conjunction with the directory structure created by download_genomes.sh

Scripts:

Additional files:

Additional software:

  • bioawk 1.0
  • Biopython 1.68
  • BEDTools 2.28.0
  1. Process the NAM gene annotations

    • converts gff gene annotation to bed file
    • extracts gene information
    • change chromosome names
    • identify the chromosomes and contigs that have annotated genes and those that do not have annotated genes

Scripts:

  • process_gene_annotations.sh (provided here on github)

Additional files:

  • NAM_founder_genome_list.txt (provided here on github)

Additional software:

  • BEDTools 2.28.0
  1. Create Bowtie2 index files for all the NAM founders

    • creates a directory for each NAM founder which contains the index files for aligning reads with Bowtie2
    • Creates a separate script for each NAM founder and submits it separately

Scripts:

  • create_bowtie2_indices.NAM.v1.sh (provided here on github)

Additional files:

  • NAM_founder_genome_list.txt (provided here on github)

Additional software:

  • Bowtie2 2.3.5.1
  1. Create files of all cytosine coordinates in the NAM genomes

    • A separate script is created for each NAM founder and submitted separately
    • Split chromosomes into separate files for running fuznuc
    • Extract CG, CHG, and CHH coordinates from the genome fasta file
    • concatenate the split chromosomes back together to create single genome-wide file for each sequence context and NAM founder
    • this concatenated file is sorted and ready to be joined to the methylome data (which is done in a separate script)

Scripts:

  • Extract_all_Cs_NAM_genomes.v4.sh (provided here on github)

Additional files:

  • NAM_founder_genome_list.txt

Additional software:

  • pyfaidx 0.5.5.1
  • EMBOSS 6.6.0
  1. Processing the methylome CGmap files

    • A separate script is created for each tissue and submitted separately.
    • the CGmap files are listed in field 5 of the NAM_methylome_list_merge.v2.txt file.
    • These CGmap files are located in the CGmap_home directory
    • Convert CGmap files to allc bed files
    • correct scaffold names to match the reference genome
    • Merge the allc file biological replicates into single files
    • convert the allc file to a bed file
    • split the bed file by sequence contexts
    • Create the files containing coverage for all Cs in the genome, split into the three contexts, and also CHGCGN
    • Create coverage files for the three contexts

Scripts:

  • convert_CGmap_to_allc_then_merge_then_convert_bed.v2.sh (provided here on github)

Additional files:

  • NAM_methylome_list_merge.v2.txt (provided here on github)
  • CGmap files (generated from earlier steps in the methods)

Additional software:

  • methylpy 1.3.2
  • BEDTools 2.28.0
  • kenttools 1.04.00
  1. Call the unmethylated regions (UMRs)
    • Each NAM founder, mapped to both B73 and to its cognate reference genome, are submitted as separate scripts
    • submit_UMR_script.v2.sh is used for submitting the separate scripts
    • three variables are assigned in submit_UMR_script.v2.sh: $GENOME, $out_dir, and $prefix
    • the UMR_algorithm_all_contexts_v1.3.NAM_founders.sh scripts calls the UMRs in each sample
    • UMRs are called in both the CHG and CHG/CGN contexts, then merged together.
    • UMRs are split into those smaller than and larger than 150 bp
    • output file: ${out_dir}/${prefix}.final_UMRs.chg.chgcgn.genic-prox-dist.${size}.bed
    • output file description:
      field1: chrom
      field2: UMR start
      field3: UMR end
      field4: methylation count
      field5: read coverage count
      field6: informative cytosine count
      field7: percent methylation
      field8: distance of UMR to nearest annotated gene
      field9: "genic", "proximal", or "distal" designation

Scripts:

  • submit_UMR_script.v2.sh (provided here on github)
  • UMR_algorithm_all_contexts_v1.3.NAM_founders.sh (provided here on github)

Additional files:

  • NAM_methylome_list_merge.v2.txt (provided here on github)
  • allc.bed files generated in step 6
  • files of total cytosine coverage, generated in step 6
  • 100 bp sliding window bed files generated in step 2
  • bedtools genome files generated in step 2
  • list of contigs lacking genes, generated in step 2
  • bed files of genome N-gaps, generated in step 2
  • bed files of gene annotations, generated in step 3

Additional software:

  • BEDTools 2.28.0
  1. Clean up the UMR files
    • Remove the blacklisted regions from the previously called UMRs
    • Rename the UMRs to the standardized naming scheme
    • Convert the files to the formal BED format

Scripts:

  • remove_blacklist.change_filenames.convert_bed.NAM_UMRs.v1.sh (provided here on github)

Additional files:

  • methylomes_table_cross_ref.txt
  • UMR files generated in step 7
  • Blacklist files

Additional software:

  • BEDTools 2.28.0
  1. Calling differentially methylated regions (DMRs) and conserved UMRs

    • A separate script is generated and submitted for each NAM line
    • The coordinates of UMRs in B73 are split into quarters
    • The mCHG data from allc files in each NAM line is mapped to the B73 UMR quarters
    • Quarters meeting the mapping criteria are kept
    • B73 UMRs with all four quartiles below 20% mCHG or above 50% mCHG are kept, and designated conserved hypomethylated, or DMR, respectively

Scripts:

  • extract_NAM_DMRs.v1.sh (provided here on github)

Additional files:

  • NAM_mapping_list.txt (provided here on github)
  • 7-column UMR bed file generated in step 7
  • meth_*.ref_B73.allc.bed.chg file generated in step 6

Additional software:

  • BEDTools 2.29.2
  1. The initial ATAC-seq read processing
    • Run each ATAC rep (fastq file) through fastp in paired-end mode

Scripts:

  • ATAC_initial_read_process.NAM.v1.sh (provided here on github)

Additional files:

  • genome_mapping_combinations.txt (provided here on github)

Additional software:

  • fastp 0.20.0
  1. Map the ATAC-seq reads with bowtie2
    • Align each ATAC-seq replicate to the B73 reference genome and also the cognate reference genome
    • Each replicate is submitted as a separate script

Scripts:

  • ATAC_map_reads.NAM.v1.sh (provided here on github)

Additional files:

  • genome_mapping_combinations.txt (provided here on github)

Additional software:

  • Bowtie2 2.3.5.1
  1. Filter the aligned ATAC-seq reads
    • This script creates and submits a separate script for each NAM line, mapped to B73 reference and its cognate reference
    • Convert sam file to bam file
    • remove duplicates
    • filter by MAPQ >= 30
    • Create 1 bp resolution bigwig file

Scripts:

  • ATAC_process_bams_v3.sh (provided here on github)

Additional files:

  • genome_mapping_combinations.txt (provided here on github)

Additional software:

  • deepTools 3.3.1
  • SAMtools 1.10
  • sambamba 0.7.1
  1. Call accessible chromatin regions (ACRs) with MACS2 and quantify their overlap with UMRs
    • Call peaks on ATAC data with MACS2 using paired-end mode, q-value cutoff of 0.005, and effective genome size of 1.8e+9 bp
    • remove ACRs overlapping blacklisted N-gap regions
    • designate ACR as genic, proximal, or distal, depending on proximity to nearest annotated gene
    • calculate the percent overlap of ACRs with UMRs

Scripts:

  • macs2_call_acrs.nam_lines.v5.sh (provided here on github)

Additional files:

  • genome_mapping_combinations.txt (provided here on github)
  • UMR bed files generated in step 8
  • ATAC-seq bam files generated in step 11
  • list of contigs lacking genes, generated in step 2
  • bed files of genome N-gaps, generated in step 2
  • bed files of gene annotations, generated in step 3

Additional software:

  • MACS2 2.2.7.1
  • BEDTools 2.29.2