Skip to content

Walks through installation and usage of FASTQC, MultiQC, Trimmomatic, and Salmon for transcriptomic data preprocessing. Includes Grid Engine shell scripts that can be looped over many files in a directory.

License

Notifications You must be signed in to change notification settings

julianneyang/transcriptomicsonhoffman

Repository files navigation

transcriptomicsonhoffman

After you log in to Hoffman2 and request a computational node:

Preprocessing the data

Assuming your fastq files are in your current working directory:

  1. Install FastQC (here we create a new conda env to install fastqc)
conda create -n fastqc fastqc
conda activate fastqc
  1. Inside a directory with the raw data files, run FastQC.

Interactive: creates directory called FastQC_output and stores fastqc reports in that directory

mkdir FastQC_output/
fastqc *.fastq.gz -o FastQC_output/

Job submission (recommended). Note, you may need to provide the full filepath to 1-FastQC.sh

qsub ../rna_scripts/FastQC.sh
  1. Aggregate quality reports for all samples by using multiQC (note: for some reason I had issues with forcing multiqc to use python 3.10 so I had to use the below workaround. MultiQC takes as input a directory full of report.html files.

Create a new conda environment and deactivate the old:

conda deactivate
conda create -n multiqc

For downloading MultiQC, do not use conda, it downloads an outdated version. Instead I used pip to install the development version, and I also forced installed to $PROJECT which has enough space as opposed to the default $HOME installation

pip install --upgrade --force-reinstall git+https://github.com/MultiQC/MultiQC.git -t /u/project/jpjacobs/jpjacobs/rna_seq/

You may need to find the exact filepath to multiqc via the following command:

which multiqc

To run interactively, Replace ~/.local/bin/multiqc with the exact filepath:

python ~/.local/bin/multiqc ./

Job submission (recommended). Do this within the directory where your outputs from FastQC are located.

cd FastQC_output
qsub multiqc.sh
  1. Copy the .html report over to your local directory with scp or push to Github from Hoffman. open report.html in a browser. For help interpreting multiqc results, see the following resoureces:

  2. Trim adapters and low-quality reads with Trimmomatic. Since we already have trimmomatic installed in the kneaddata env, we are going to activate the kneaddata env. Note that you can append additional parameters for Trimmomatic; the command embedded in trimmomatic.sh has very gentle trimming parameters and removes adapters assuming Illumina Hiseq was the sequencer.

conda activate kneaddata

Interactive:

trimmomatic PE JJ1715_393_S43_R1_001.fastq.gz JJ1715_393_S43_R2_001.fastq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:/u/home/j/jpjacobs/project-jpjacobs/software_rna_seq/Trimmomatic/trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10:2:True LEADING:3 TRAILING:3 MINLEN:36

Job submission for many files (assumes you are in the directory where your raw fastQ files are located). You may need to change the filepath to point to run_trimmomatic.sh

for f in *R1_001.fastq.gz; do name=$(basename $f R1_001.fastq.gz); qsub ../../../software_rna_seq/rna_scripts/3-trimmomatic.sh ${name}R1_001.fastq.gz ${name}R2_001.fastq.gz; done
  1. Install salmon (I downloaded the salmon-1.10.0_linux_x86_64.tar.gz to the software_rna_seq folder, then I unpacked it with tar) https://github.com/COMBINE-lab/salmon/releases
tar xzvf salmon-1.10.0_linux_x86_64.tar.gz
  1. Use salmon to index a mouse genome

Download transcriptome file (I tried gencode first but had a lot of warnings, so I switched to ensembl). Note I've provided these for you in this repo:

wget http://ftp.ensembl.org/pub/release-111/fasta/mus_musculus_c57bl6nj/cdna/Mus_musculus_c57bl6nj.C57BL_6NJ_v1.cdna.all.fa.gz

Download annotation file. Note I've provided this in the repo:

http://ftp.ensembl.org/pub/release-111/gtf/mus_musculus_c57bl6nj/Mus_musculus_c57bl6nj.C57BL_6NJ_v1.111.gtf.gz

Index transcriptome file. Note, I've provided it in this repo but feel free to build your own or update as new releases come out:

/u/home/j/jpjacobs/project-jpjacobs/software_rna_seq/salmon/salmon-latest_linux_x86_64
bin/salmon index -t Mus_musculus_c57bl6nj.C57BL_6NJ_v1.cdna.all.fa.gz -i Mus_musculus_c57bl6nj_index -p 8
  1. Run salmon on trimmed fastq files:
../salmon/salmon-latest_linux_x86_64/bin/salmon quant -i ../salmon/salmon-latest_linux_x86_64/Mus_musculus_c57bl6nj_index -l A -1 output_JJ1715_393_S43_R1_001.fastq_paired.fq.gz -2 output_JJ1715_393_S43_R2_001.fastq_paired.fq.gz -p 8 --gcBias --validateMappings -o JJ1715_393_quant

Job submission (Recommended)

for f in *R1_001.fastq_paired.fq.gz; do name=$(basename $f R1_001.fastq_paired.fq.gz); qsub ../rna_scripts/salmon.sh ${name}R1_001.fastq_paired.fq.gz ${name}R2_001.fastq_paired.fq.gz; done

Generating a count matrix

  1. Follow instructions in tximport.R and txmeta.R to generate TPM/ count matrices and gene-level annotations.

References:

About

Walks through installation and usage of FASTQC, MultiQC, Trimmomatic, and Salmon for transcriptomic data preprocessing. Includes Grid Engine shell scripts that can be looped over many files in a directory.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published