Skip to content

An advanced genotyping pipeline by next-generation sequencing (NGS) technology

Notifications You must be signed in to change notification settings

Hanryi/MarkerTech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Marker Tech

platform copyright

A cost-saving genotyping pattern based on the next-generation sequencing (NGS) technology. We also build an online analysis system which provides several services for using the technology in an easier way. However, in order to meet diverse and highly customized needs, the main source codes are provided for users to modify as needed. The code we provide include primer design function and iBP-seq (improved Bulked-PCR Sequencing) analysis pipeline which can do both genotyping and epigenotyping. The primer design function generates special primer sequences flanking the maker for each template sequence. The NGS data for the sequence analysis pipeline should be generated by using the primer design function.

Preparation

Primer design

  1. Environment
    • python>=3.6.0
      • primer3-py>=0.6.1
  2. Data
    • a template sequence file in fasta format containing template sequences (header format: >Name|301-304; Sequence length>=601)

iBP-seq (genotyping) analysis

Due to the limitation of some modules, such as pysam, we recommend running these main codes on a Linux operating system when you use them.

Both the genotyping service for iBP-seq and epigenotyping service require these following configurations:

  1. Environment
    • python>=3.6.0
      • openpyxl>=3.0.7
      • XlsxWriter>=1.4.0
      • seaborn
      • numpy
      • pandas
      • matplotlib==3.4.3
      • sklearn
      • pysam==0.16.0.1
    • fastp>=0.21.0
    • bwa>=0.7.17
    • samtools>=1.12
  2. Data
    • a pair of paired-end sequencing data in fastq format
    • a barcode file which order is consistent with the actual order of use (separator='\n')
    • a mutation table with header include four columns (CHROM, POS, REF, ALT)
    • optional: a reference sequence file in fasta format

iBP-seq (epigenotyping) analysis

  1. Environment
    • python>=3.6.0
      • openpyxl>=3.0.7
      • XlsxWriter>=1.4.0
      • seaborn
      • numpy
      • pandas
      • matplotlib==3.4.3
      • sklearn
      • pysam==0.16.0.1
    • fastp>=0.21.0
    • bsmark>=0.23.1
    • samtools>=1.12

Ready to work

Before starting the analysis, if the sequencing file is in .gz compressed format, it also needs to be decompressed.

gzip -dk <filename>.fq.gz

And make sure you have the following empty directories in the working directory. If not, create them first.

Genotyping :

mkdir qc fq umi bam csv xlsx png

Epigenotyping:

mkdir qc fq umi sam bam csv xlsx png

Usage

Primer design

The input parameters of PrimerDesigner.py are: strategy code name (iBP-seq: 3), input template file, file name (including .xlsx suffix), shortest primer length, optimal primer length, longest primer length, shortest amplicon length, optimal amplicon length, longest amplicon length, minimum Tm of primer, optimal Tm of primer, maximum Tm of primer.

An example of the command line is as follows:

python3 PrimerDesigner.py 3 testSeq.fa ./filename.xlsx 18 20 25 280 286 320 50 60 65

iBP-seq (genotyping) analysis

Quality control

Filter out low quality (lower than 20) and sequence length less than 72bp (2*20bp primer + 6bp UMI + 18Bbp bridge + 8bp barcode = 72bp) reads.

fastp -i <filename_1>.fq -I <filename_2>.fq -o qc/clean_1.fq -O qc/clean_2.fq -q 20 --length_required 72 -j qc/<filename>.json -h qc/<filename>.html

Data division

Divide the mixed sequencing data to several files by using the barcode file and build-in bridge sequence (selected by the first parameter). The number of barcodes in the file decides the number of divided samples and the order of barcodes decides the order of each sample. Pair-end sequencing genome sequences are stored as pairs of fastq files in the fq directory and UMI sequences are stored as single fastq files in the umi directory.

python3 SeqPurifier.py iBP barcode/<barcode>.txt qc/clean_1.fq qc/clean_2.fq fq umi

Reference alignment

Align sequence to reference sequence or genome.

for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
	bwa mem -t 4 -M <Reference>.fa fq/${f}_1.fq fq/${f}_2.fq | samtools view -S -b - | samtools sort -o bam/${f}.sorted.bam -;
done

Build index for sorted bam file.

for b in $(ls bam/*);
do
	samtools index ${b};
done

Marker frequency

Calculate each mutated marker frequency for each sample.

python3.7 AltFraction.py iBP barcode/<barcode>.txt mutation/<mut>.csv bam umi csv

KDE Cluster

Identify different genotypes or methylation types by the minimum point of kernel density estimation (KDE) from the sample population.

for t in $(ls csv/*);
do
	python3.7 Minima.py ${t} xlsx png;
done

iBP-seq (epigenotyping) analysis

This pipeline is an application of iBP-seq (genotyping) method in processing methylation data.

Quality control

fastp -i <filename_1>.fq -I <filename_2>.fq -o qc/clean_1.fq -O qc/clean_2.fq -q 20 --length_required 72 -j qc/<filename>.json -h qc/<filename>.html

Data division

python3 SeqPurifier.py mBP barcode/<barcode>.txt qc/clean_1.fq qc/clean_2.fq fq umi

Reference alignment

Reference split
python3 MethReference.py barcode/<barcode>.txt ref/<ref>.fa
# Build index
for i in $(ls -d ref/*);
do
	bismark_genome_preparation ${i}/ ;
done
Alignment
for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
	ref_name=$(sed -n $[${f}+1]p barcode/<barcode>.txt | sed "s/.*,//" | sed "s/\r//");
	lineage=$(python3 -c "print('/'.join(sorted('${ref_name}'.split('/'))))" | sed "s/\//_/g");
	bismark -N 0 -L 20 --bowtie2 --non_bs_mm /<abs_path>/ref/"${lineage}"/ --sam -o sam/ --q -1 fq/${f}_1.fq -2 fq/${f}_2.fq;
done

SAM processing

# Text processing
for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++)); 
do
    grep "@" sam/${f}_1_bismark_bt2_pe.sam > sam/${f}-header;
    grep "XA:Z:0" sam/${f}_1_bismark_bt2_pe.sam > sam/${f}-temp1;
    grep "XB:Z:0" sam/${f}_1_bismark_bt2_pe.sam > sam/${f}-temp2;
    awk 'NR==FNR {f1[$1]=$0;next} {idx=$1;if(idx in f1)print$0}' sam/${f}-temp1 sam/${f}-temp2 > sam/${f}-temp4;
    awk 'NR==FNR {f1[$1]=$0;next} {idx=$1;if(idx in f1)print$0}' sam/${f}-temp2 sam/${f}-temp1 > sam/${f}-temp3;
    cat sam/${f}-header sam/${f}-temp3 sam/${f}-temp4 > sam/${f}_1_bismark_bt2_pe_nonmismatch.sam; 
done

# SAM to BAM
for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
    samtools sort -n -O bam -o bam/${f}_1_bismark_bt2_pe_nonmismatch.bam sam/${f}_1_bismark_bt2_pe_nonmismatch.sam;
    lineage=$(sed -n $[${f}+1]p barcode/barcode.txt | sed "s/.*,//" | sed "s/\r//" | sed "s/\//_/g");
    bismark_methylation_extractor --no_overlap --comprehensive --bedGraph --cytosine_report --CX_context --genome_folder /<abs_path>/ref/${lineage}/ -p bam/${f}_1_bismark_bt2_pe_nonmismatch.bam -o bam/;
    gunzip bam/${f}_1_bismark_bt2_pe_nonmismatch.bismark.cov.gz;
    python3 magic.py bam/${f}_1_bismark_bt2_pe_nonmismatch.bismark.cov bam/${f}_1_bismark_bt2_pe_nonmismatch.CX_report.txt Me-information/${f}_result;done
    samtools sort bam/${f}_1_bismark_bt2_pe_nonmismatch.bam -o bam/${f}_1_bismark_bt2_pe_nonmismatch.sorted.bam;
    samtools index bam/${f}_1_bismark_bt2_pe_nonmismatch.sorted.bam;
done

Methylation activity

for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
	python3 MethFraction.py umi/${f}.fq bam/${f}_1_bismark_bt2_pe_nonmismatch.sorted.bam Me-information/${f}_result csv/${f}.csv;
done

Histogram plot

for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
	python3 MethHist.py csv/${f}.csv png;
done

Copyright

copyright (c) Li Lab of HZAU, All rights reserved.