Skip to content

annedodson/smallRNA-seq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

C. elegans small RNA-seq analysis

This is a pipeline to analyze next-generation sequencing of small RNAs in C. elegans. The pipeline can be broken down into two major parts:

  1. Generate count matrices. Trims and maps reads to the C. elegans genome, then generates count matrices of the number of reads mapping antisense to each gene. This first part of the pipeline is designed to run in a high-performance computing cluster based on Linux and Slurm.

    • generate_count_matrices.sh is the main file for Part 1 and calls all the other Part 1 scripts. On line 6 of generate_count_matrices.sh, specify the full pathname of your project directory:

       # Assign a variable to the pathname of the project (this is the main directory). Change this to fit your own path.
       main_dir=<project pathname>
      
    • Then execute lines 8-16 of generate_count_matrices.sh to generate the following directory structure:

       project name
         ├── logs
         ├── meta
         ├── raw_data
         ├── results
         └── scripts
      
    • Before continuing on with the rest of generate_count_matrices.sh, make sure that:

      • You've copied your raw, demultiplexed fastq files into the raw_data directory.
      • All Part 1 scripts are in the scripts directory.
      • Your metadata file metadata.txt is in the meta directory. Column 1 of metadata.txt must contain the desired output filename, and there must also be a column containing the input filename. See metadata.txt in this repository for an example.
    • Note, this pipeline assumes the reads contain a 4-nucleotide-long barcode at the 5' end. If your reads do not contain a 5' barcode and instead begin immediately with the insert, make the following two changes:

      • Change line 36 in select_5prime_barcode.sh from:

         grep -B 1 -A 2 -e ^$barcode1 -e ^$barcode2 $input_path$input_file | sed '/^--/d' > $output_path$new_name

        to:

         cp $input_path$input_file $output_path$new_name

        With this change, running select_5prime_barcode.sh will simply assign new, meaningful names to the fastq files using metadata.txt and place them in a new directory in results called sort_5prime.

      • Change line 23 in trim_5prime.sh from:

         cutadapt -u 4 -o $output $1 > ${2}/logs/trim5/${base}.txt

        to:

         cp $1 $output

        With this change, running trim_5prime.sh will simply copy the fastq files into a new directory in results called trim3_trim5 and add "_trim5" to the end of each filename.

  2. Differential analysis and visualization. Uses the count matrices generated in Part 1 to perform a simple wild type vs. mutant analysis to identify genes that are differentially targeted by small RNAs. This part of the pipeline is designed to run as an RStudio project (DA_and_visualization.Rproj).

    • main_script.R is the main file for Part 2.

    • Before beginning, make sure the count matrices are in the data directory and that the metadata file (.csv format) is in the meta directory. Examples of these files can be found in the example_files directory.

    • Part 2 outputs include the following:

      • A table of normalized counts (median of ratios method)
      • A list of differentially targeted genes and their corresponding log2 fold changes and adjusted p-values
      • A biplot of the top two principal components determined by principal component analysis
      • A volcano plot of log2 fold change vs. significance, with labels for the top 10 significant genes
      • The option to plot normalized counts for any given gene (specified by WormBase Gene ID)

Software requirements

Software Version Used in
gcc 6.2.0 Part 1: Generate count matrices
python 2.7.12 Part 1: Generate count matrices
cutadapt 1.14 Part 1: Generate count matrices
fastqc 0.11.5 Part 1: Generate count matrices
bowtie 1.2.2 Part 1: Generate count matrices
samtools 1.9 Part 1: Generate count matrices
deeptools 3.0.2 Part 1: Generate count matrices
featureCounts 2.0.0 Part 1: Generate count matrices
R 3.5.1 Part 2: Differential analysis and visualization
DESeq2 1.22.2 Part 2: Differential analysis and visualization
tidyverse 1.2.1 Part 2: Differential analysis and visualization
ggrepel 0.8.1 Part 2: Differential analysis and visualization

About

Analysis of small RNA-seq in C. elegans

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published