Skip to content

lguillier/LISEQ-codes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LISEQ-codes

The 'LISEQ-codes' project corresponds to a repository of the differents codes used to produce the results presented in the EFSA external report “Closing gaps for performing a risk assessment on Listeria monocytogenes in ready-to-eat (RTE) foods: activity 3, the comparison of isolates from different compartments along the food chain, and from humans using whole genome sequencing (WGS) analysis”. The report can be found in EFSA supporting publications: http://onlinelibrary.wiley.com/doi/10.2903/sp.efsa.2017.EN-1151/full.

The project has involved four institutes: Statens Serum Institut, Public Health England, Anses and University of Aberdeen.

The main objective of the study was to compare L. monocytogenes isolates collected in the EU from ready-to-eat (RTE) foods, compartments along the food chain and from human cases by the use of whole genome sequencing (WGS). In order to trace and share the methodology used to produce results, source codes are provided.

The global methodology is shared below, the input data files and source codes can be found in 'chapter x/' directories (the names of these directories corresponding to chapter number of external EFSA report).

You can find the latest version of the tool at https://github.com/lguillier/LISEQ-codes

Chapter 3

Bio-informatic workflows can be accessed on PHE-bioinformatic: https://github.com/phe-bioinformatics

Chapter 4

The four trees generated in this chapter were generated with RAXML. A. Stamatakis: "RAxML Version 8: A tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies". In Bioinformatics, 2014

The SNP multifasta files used are available there: https://github.com/lguillier/LISEQ-code/chapter4 The following command lines were used to produce newick files (T-threads used for inference and N-number of bootstrap should be adapted)

## CC101
raxml -T 7 -A S16 -m GTRGAMMA -s 2016-06-08.listeria_cc101_denovo_snps.CC101-list.txt.fa -n CC101_100BS-GAMMA -f a -x 12345 -N 100 -p 12345  -k
cp RAxML_bestTree.CC101_100BS-GAMMA RAxML_bestTree.CC101_100BS-GAMMA.nwk

## CC121
raxml -T 6 -A S16 -m GTRGAMMA -s 2016-06-08.listeria_cc121_snps.CC121-list.txt.fa -n CC121_1000BS-GAMMA -f a -x 12345 -N 1000 -p 12345  -k
cp RAxML_bestTree.CC121_1000BS-GAMMA RAxML_bestTree.CC121_1000BS-GAMMA.nwk

## CC4
raxml -T 30 -A S16 -m GTRGAMMA -s 2016-06-08.listeria_cc4_snps.CC4-list.txt.fa -n CC4_5000BS-GAMMA -f a -x 12345 -N 5000 -p 12345  -k
cp RAxML_bestTree.CC4_5000BS-GAMMA RAxML_bestTree.CC4_5000BS-GAMMA.nwk

## CC8
raxml -T 35 -A S16 -m GTRGAMMA -s 2016-06-08.listeria_cc8_snps.CC8-list.txt.fa -n CC8_2500BS-GAMMA -f a -x 12345 -N 2500 -p 12345  -k
cp RAxML_bestTree.CC8_2500BS-GAMMA RAxML_bestTree.CC8_2500BS-GAMMA.nwk

## CC9
raxml -T 35 -A S16 -m GTRGAMMA -s 2016-06-08.listeria_cc9_snps.CC9-list.txt.fa -n CC9_2500BS-GAMMA -f a -x 12345 -N 2500 -p 12345  -k
cp RAxML_bestTree.CC9_2500BS-GAMMA RAxML_bestTree.CC9_2500BS-GAMMA.nwk

The '.nwk' files were then labbelled and colored according to r. file (all elements in chapter4 folder: https://github.com/lguillier/LISEQ-codes/tree/master/Chapter4)

Chapter 7

There were five attribution models that were used in the study. These are listed below together with links to the appropriate programs.

The Dutch model

An example of the Dutch model for 30 locus rMLST, 864 isolates and 10,000 runs, is given in “DutchModelrMLST_v5_SampleSizeCorrection_Attribution_Human.7z” which is available at https://github.com/lguillier/LISEQ-codes/tree/master/Chapter7 This needs to be extracted using ZIP software. This will produce a .xlsm Excel file. The program runs under VBA Excel. The input data have to be placed in spreadsheet “Program”, starting with column “X” in the format given in the example. The attribution scores will be displayed in the columns J,K,L,… depending on the number of sources. These scores have to copied to the spreadsheet “Results” where the attribution graphic is displayed.

The modified Hald model

The modified Hald model runs under WinBugs which can be downloaded from http://www.mrc-bsu.cam.ac.uk/software/bugs/the-bugs-project-winbugs/ The prevalence sub-model, “Attribution_Listeria_Prevalence_Hald_EFSA.odc” has to run first and the results have to be fed into the main model, “Attribution_Listeria_RealModel_Hald_EFSA.odc” Both files are available at https://github.com/lguillier/LISEQ-codes/tree/master/Chapter7

STRUCTURE model

The program can be downloaded from http://pritchardlab.stanford.edu/structure.html

The Asymmetric Island (AI) model (iSource)

The program can be downloaded from http://www.danielwilson.me.uk/iSource.html

The Aberdeen model

The model was implemented in Mathematica and it can be run in any Linux system with Mathematica installed. An example of the model and associated files (“AbdnAttribution_Mathematica.tgz”) are available at https://github.com/lguillier/LISEQ-codes/tree/master/Chapter7 In the following instructions on how to install the files and to run the model are given.

*** Installation:

  • Uncompress the file in the directory you wish. On a Unix shell:
   $ tar -xzf AbdnAttribution_Mathematica.tgz
  • Enter in the program directory:
	 $ cd AbdnAttribution_Mathematica
  • The package contains the following:

    • AbdnAttribution.m : Mathematica script to run the Aberdeen attribution method. It does not need to be edited. Make sure that AbdnAttribution.m is executable. This can be achieved by executing the following command once in the unix shell:

    $ chmod a+x AbdnAttribution.m

    • Input_AbdnAttribution.ini : Editable file containing the setting to run source attribution.

    • Directory MLST with the data used for Listeria source attribution based on MLST information. This will be used as a benchmark to illustrate the functioning of the program below.

*** Initialisation:

Open Input_AbdnAttribution.ini. This can be done with any text editor. The format of the file is:

1 "**** Parameters for source attribution with AbdnAttribution.m ****"
2 "****"
3 "**** Please note: only lines 6, 8 and 10 should be edited! ****"
4 "****"
5 "---- Data directory/file ----" 
6       MLST/MLST_AbdnAttribution
7 "---- Reservoir to be attributed ----"
8       Poultry
9 "---- Number of iterations for sample size correction ----"
10      10000

Line 6: indicates [data directory]/[data file], i.e. the directory where the input and output data is stored (MLST in our example) and the name of the file containing the data (MLST_AbdnAttribution).

Line 8: Reservoir whose samples will be attributed. In the Listeria dataset, these can be: Bovine, Fish, Human, Ovine, Poultry, Swine

Line 10: Number of iterations for sample size correction.

*** Input data format (see example in the directory MLST, file MLST_AbdnAttribution.csv)

The input data is stored in a comma-separated-values (csv) file.

Each line of the file contains:

[Reservoir name], loci 1, loci 2, ....., loci n

The isolates for each reservoir type should be consecutive in the file. However, the particular order of reservoirs does not matter. For instance, in MLST_AbdnAttribution.csv they were grouped in the following order: Bovine, Fish, Human, Ovine, Poultry, Swine

*** Running the code:

	$ ./AbdnAttribution.m

*** Output results:

Results are output to an *.xml file which contains the name of the attributed reservoir (can be directly open with Excel or Libreoffice calc).

The output file is stored in [data directory]/[data file]. In the MLST example, it is MLST_AbdnAttribution_Poultry.xls

Chapter 8

Networks

In order to establish links between strains, one need pairwised distance betweens strains for each CC, and the R script given below. It should be noted that the script will produce 4 networks: for 5, 10, 20, and 25 SNP thresholds respectively. Only 25 SNP threshold is used for linking strains together in chapter 8.

## library needed
library(edgebundleR)

## set file path
setwd("PATH_TO_ADAPT")

## Read SNP pairwise matrix
a<-read.table("cc2_final.txt") # File name to de adapted
colnames(a)=rownames(a)
am<-as.matrix(a)

## Generate networks
g25<-edgebundle(1/(am+1),cutoff=(1/26),tension=0.5,fontsize = 14)
g25
saveEdgebundle(g25,file = "CC2_25SNP.html") # output file name can be adapts

g20<-edgebundle(1/(am+1),cutoff=(1/21),tension=0.5,fontsize = 14) # 
g20
saveEdgebundle(g20,file = "CC2_20SNP.html")

g15<-edgebundle(1/(am+1),cutoff=(1/16),tension=0.5,fontsize = 14)
g15
saveEdgebundle(g15,file = "CC2_15SNP.html")

g10<-edgebundle(1/(am+1),cutoff=(1/11),tension=0.5,fontsize = 14)
g10
saveEdgebundle(g10,file = "CC2_10SNP.html")

g5<-edgebundle(1/(am+1),cutoff=(1/6),tension=0.5,fontsize = 14)
g5
saveEdgebundle(g5,file = "CC2_5SNP.html")

Timeline

The R scipt used to generate timelines in chapter 8 is also provided in https://github.com/lguillier/LISEQ-codes/tree/master/Chapter8. The timelines produced are also in this folder (name gives indication of the CC and the cluster number of Table 8.1 of the report)

Chapter 9

Marker presence/absence

In order to determine the presence/absenece of putative marker(s). The following script can be run in Linux Environement. Following programs and their dependencies need to be prior installed: bowtie2, samtools.

Path name, name of fasta file that contains marker(s), names of the two paired-end read files should be adapted. The fasta file for markers for persitent is in https://github.com/lguillier/LISEQ-codes/tree/master/Chapter9

# Path 
cd /adapt_your_path

# Unzip .gz paired-end files
#gzip -d *.gz

# Buiding Bowtie index for index
bowtie2-build persistent.fasta persist

# Mapped the reads on reference marker sequence
bowtie2 -q -x persist -1 150880_RL1500036102-1.bacterial-fastq-only.1-0-0.processed.R1.fastq -2 150880_RL1500036102-1.bacterial-fastq-only.1-0-0.processed.R2.fastq -S RL15000361.sam

# Samtools (creating bam)
samtools view -bS RL15000361.sam > RL15000361.bam

# Sorted bam
samtools sort RL15000361.bam RL15000361.sorted
samtools view -F 4 RL15000361.sam
samtools view -bS -F 4 RL15000361.sam > RL15000361.bam
samtools view RL15000361.bam | cut -f3 | sort | uniq

For IGV vizualization of mapping, the .fasta file and then .bam should be loaded.

Releases

No releases published

Packages

No packages published