/
main.tex
750 lines (601 loc) · 29.8 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
% This file was created (at least in part) by the script ParseMdtoLatex by Louis du Plessis
% (Available from https://github.com/taming-the-beast)
\documentclass[11pt]{article}
\input{preamble}
% Add your bibtex library here
\addbibresource{master-refs}
%%%%%%%%%%%%%%%%%%%%
% Do NOT edit this %
%%%%%%%%%%%%%%%%%%%%
\begin{document}
\renewcommand{\headrulewidth}{0.5pt}
\headsep = 20pt
\lhead{ }
\rhead{\textsc {BEAST v2 Tutorial}}
\thispagestyle{plain}
%%%%%%%%%%%%%%%%%%
% Tutorial title %
%%%%%%%%%%%%%%%%%%
\begin{center}
% Enter the name of your tutorial here
\textbf{\LARGE Tutorial using BEAST v2.4.7}\\\vspace{2mm}
% Enter a short description of your tutorial here
\textbf{\textcolor{mycol}{\Large SCOTTI Tutorial}}\\
\vspace{4mm}
% Enter the names of all the authors here
{\Large {\em Louis du Plessis and Nicola de Maio}}
\end{center}
Transmission tree reconstruction with the structured coalescent
%%%%%%%%%%%%%%%%%
% Tutorial body %
%%%%%%%%%%%%%%%%%
\section{Background}\label{background}
When applying phylodynamic models to pathogen genetic sequences
collected during an outbreak, we usually make the assumption that the
inferred genealogy approximates the true transmission tree. In most
cases the structure of the true transmission tree will be similar to the
inferred genealogy and inferences about epidemiological parameters
(e.g.~the $ R_e $) are valid. However, due to complicating
factors, such as within-host diversity and evolution, non-sampled
patients and transmission bottlenecks, it is difficult to reconstruct
transmission chains or draw conclusions about the transmission dynamics
between infected patients. In particular, there may be discrepancies
between the epidemiological and phylogenetic relatedness of hosts and
infection times are often biased.
SCOTTI (Structured COalescent Transmission Tree Inference)
\citep{deMaio2016} is a BEAST2 package that was developed to provide
more accurate reconstructions of transmission trees. The underlying
model is a structured coalescent model, where each host is modeled as a
population and migrations between populations represent new infections.
New sequencing technologies and protocols are making it easier to sample
the within-host diversity of an outbreak and SCOTTI can take advantage
of multiple sequences from each host to better resolve transmission
events. Furthermore, SCOTTI can model non-sampled hosts by dynamically
increasing or decreasing the number of populations (hosts). (In other
structured models available in BEAST2 the number of populations remain
constant and each population needs to be represented by at least one
sampled sequence). In addition to genetic sequences, SCOTTI also uses
epidemiological data about host exposure times, only allowing hosts to
transmit the disease during periods when they are infectious. Thus,
SCOTTI is able to model within-host diversity as well as non-sampled
hosts and multiply infected hosts. However, it is currently not able to
model transmission bottlenecks.
To make the inference tractable SCOTTI uses the same techniques as BASTA
\citep{deMaio2015}, which is a computationally efficient approximation
to the structured coalescent. In addition, a number of simplifying
assumptions are made. It is assumed that all hosts have the same
infection rate and that it stays constant over the course of infection.
It is also assumed that infection is equally likely between every pair
of hosts. Finally, it is assumed that all hosts have the same effective
population size, $ N_e $. Thus, all hosts have the same
within-host genetic diversity and thus all hosts have equal and constant
within-host dynamics. \clearpage
\section{Programs used in this
Exercise}\label{programs-used-in-this-exercise}
\subsubsection{BEAST2 - Bayesian Evolutionary Analysis Sampling Trees
2}\label{beast2---bayesian-evolutionary-analysis-sampling-trees-2}
BEAST2 (\url{http://www.beast2.org}) is a free software package for
Bayesian evolutionary analysis of molecular sequences using MCMC and
strictly oriented toward inference using rooted, time-measured
phylogenetic trees. This tutorial is written for BEAST v2.4.7 \citep{BEAST2book2014}.
\subsubsection{BEAUti2 - Bayesian Evolutionary Analysis
Utility}\label{beauti2---bayesian-evolutionary-analysis-utility}
BEAUti2 is a graphical user interface tool for generating BEAST2 XML
configuration files.
Both BEAST2 and BEAUti2 are Java programs, which means that the exact
same code runs on all platforms. For us it simply means that the
interface will be the same on all platforms. The screenshots used in
this tutorial are taken on a Mac OS X computer; however, both programs
will have the same layout and functionality on both Windows and Linux.
BEAUti2 is provided as a part of the BEAST2 package so you do not need
to install it separately.
\subsubsection{TreeAnnotator}\label{treeannotator}
TreeAnnotator is used to summarise the posterior sample of trees to
produce a maximum clade credibility tree. It can also be used to
summarise and visualise the posterior estimates of other tree parameters
(e.g.~node height).
TreeAnnotator is provided as a part of the BEAST2 package so you do not
need to install it separately.
\subsubsection{Tracer}\label{tracer}
Tracer (\url{http://tree.bio.ed.ac.uk/software/tracer}) is used to
summarise the posterior estimates of the various parameters sampled by
the Markov Chain. This program can be used for visual inspection and to
assess convergence. It helps to quickly view median estimates and 95\%
highest posterior density intervals of the parameters, and calculates
the effective sample sizes (ESS) of parameters. It can also be used to
investigate potential parameter correlations. We will be using Tracer
v1.6.0.
\subsubsection{FigTree}\label{figtree}
FigTree (\url{http://tree.bio.ed.ac.uk/software/figtree}) is a program
for viewing trees and producing publication-quality figures. It can
interpret the node-annotations created on the summary trees by
TreeAnnotator, allowing the user to display node-based statistics
(e.g.~posterior probabilities). We will be using FigTree v1.4.3.
\subsubsection{Python}\label{python}
Python (\url{https://www.python.org}) is an interpreted programming
language that is often used to write scripts for processing text files.
We will use two Python scripts during the tutorial. Both scripts should
work with Python 2.7.x and Python 3.x. There is also a third, optional,
script that makes use of the \lstinline!graph-tool! package to produce a
better looking figure.
Python should already be installed on most Mac OS X or Linux systems.
Both graphviz and graph-tool are available on Homebrew, and graphviz can
also be installed using \lstinline!apt-get! on Linux systems.
\clearpage
\section{Practical: SCOTTI tutorial}\label{practical-scotti-tutorial}
In this tutorial we analyse an outbreak of Foot and Mouth Disease Virus
(FMDV) that occurred in the South of England in 2007. FMDV is a highly
infectious disease that affects cloven-hoofed animals (such as cattle,
sheep, pigs, deer etc.) and can be easily spread between farms through
contact with contaminated vehicles or animal feed. The usual response to
FMDV is to cull all exposed livestock and quarantine surrounding farms,
thus the effects of the disease can be devastating to the farming sector
(the 2001 outbreak in the United Kingdom resulted in the culling of more
than 10 million cows and sheep and cost £8 billion). It is therefore
extremely important to trace the source and spread of the disease
between farms. Because of the high genetic variability of FMDV an
appreciable amount of genetic variation will accumulate over the course
of an outbreak. Thus, epidemiological and evolutionary dynamics occur on
the same timescale and we are dealing with a measurably evolving
population, which we can analyse in BEAST.
When analysing an FMDV outbreak we can treat each infected farm as an
infected host, with the transmission tree describing infections between
farms. In fact, when discussing livestock diseases it is common to refer
to infected farms as cases, instead of individual infected animals.
(This analogy holds because the genetic diversity of the virus within a
farm is much lower than between farms, thus each farm behaves more like
a single host).
\subsection{The Data}\label{the-data}
We analyse an outbreak of Foot and Mouth Disease Virus (FMDV) that
occurred in the South of England. The outbreak contained two distinct
clusters, in August and September of 2007, respectively. The dataset
contains 11 viral sequences from 10 farms. Four sequences were sampled
during the first cluster and a further 7 during the second cluster. In
addition, we also have the earliest and latest possible dates during
which each farm was infected with the disease, informed by the culling
time and first appearance of symptoms (Figure \ref{fig:outbreak}). The
data were first analysed in \citep{Cottam2008PlosPath} and later
reanalysed using SCOTTI in \citep{deMaio2016}.
\begin{figure}
\centering
\includegraphics[width=0.800000\textwidth]{figures/Outbreak.png}
\caption{Exposure times of infected farms and sampling dates of the viral sequences (Figure taken from \citep{Cottam2008PlosPath}). The orange shading estimates the time when animals showing symptoms of FMDV were present on a farm. Light blue shading indicates estimates for the incubation time of each farm. The dark blue shading is the infection date estimates in \citep{Cottam2008PlosPath}. The haplotype network of the sequenced strains is superimposed on the exposure times. The red dots indicate the dates of the sampled sequences.}
\label{fig:outbreak}
\end{figure}
\subsection{Creating the input files using the included Python
script}\label{creating-the-input-files-using-the-included-python-script}
Unlike most BEAST2 packages, SCOTTI does not have a BEAUti interface.
Although a BEAUti interface makes it much easier, it is not the only way
to create the input XML file for a BEAST2 analysis. Many advanced users
prefer to create the XML file in a text editor as this provides them
with greater flexibility and oversight. Even if you use BEAUti to create
your configuration files it is always a good idea to open the XML file
in a text editor and check that everything is where it should be. At
first glance a BEAST2 XML file may seem bewildering, but the file has a
rigid structure and after a while you will be able to interpret the
different elements of the configuration file. Being able to read and
understand a configuration file gives you a much better understanding of
how the different parts of an analysis fit together. In addition, it
gives you access to models (such as SCOTTI) that do not have BEAUti
interfaces. Once you are familiar with the structure of BEAST2 XML files
and you know the different inputs of the models you want to use you will
find that it is often faster to directly edit an XML file than to create
it in BEAUti.
In this tutorial we we will start at the shallow end of the pool and use
a Python script, \lstinline!SCOTTI_generate_xml.py!, to create the XML
configuration file. It is included with the SCOTTI package and is also
stored in the \lstinline!scripts/! directory of this tutorial.
\subsubsection{Installing the SCOTTI
package}\label{installing-the-scotti-package}
Before we can use SCOTTI we have to install the package somewhere where
BEAST2 can find it. Although we won't use BEAUti to create the input
configuration file, we still have to use BEAUti to install the package.
We will be using SCOTTI version 1.1.1 or above
for this tutorial.
\begin{framed}
Open \textbf{BEAUti} and open the \textbf{BEAST2 Package Manager} by
navigating to \textbf{File \textgreater{} Manage Packages}.
Install \textbf{SCOTTI} by selecting it and clicking the
\textbf{Install/Upgrade} button (Figure \ref{fig:packagemanager})
\end{framed}
\begin{figure}
\centering
\includegraphics[width=0.800000\textwidth]{figures/PackageManager.png}
\caption{Installing the SCOTTI package.}
\label{fig:packagemanager}
\end{figure}
\subsubsection{Running the Python
script}\label{running-the-python-script}
\begin{framed}
Open a terminal if you are using Mac OS X or Linux, or a Command Prompt
if you are using Windows.
Navigate to the directory where the \lstinline!SCOTTI_generate_xml.py!
script is stored.
Type \lstinline!python SCOTTI_generate_xml.py --help!
\end{framed}
This will display help for the different input arguments of the script:
\begin{lstlisting}[language=bash]
usage: SCOTTI_generate_xml.py [-h] [--fasta FASTA] [--output OUTPUT]
[--overwrite] [--dates DATES] [--hosts HOSTS]
[--hostTimes HOSTTIMES] [--maxHosts MAXHOSTS]
[--unlimLife] [--penalizeMigration]
[--numIter NUMITER]
[--mutationModel MUTATIONMODEL]
[--fixedAs FIXEDAS] [--fixedCs FIXEDCS]
[--fixedGs FIXEDGS] [--fixedTs FIXEDTS]
[--tracelog TRACELOG] [--screenlog SCREENLOG]
[--treelog TREELOG]
optional arguments:
-h, --help show this help message and exit
--fasta FASTA, -f FASTA
Input fasta alignment. If not specified, looks in th
working directory for SCOTTI_aligment.fas
--output OUTPUT, -o OUTPUT
output file that will contain the newly created SCOTTI
xml to be run with BEAST2.
--overwrite, -ov Overwrite output files.
--dates DATES, -d DATES
Input csv file with sampling dates associated to each
sample name (same name as fasta file). If not
specified, looks for SCOTTI_dates.csv
--hosts HOSTS, -ho HOSTS
Input csv file with sampling hosts associated to each
sample name (same name as fasta file). If not
specified, looks for SCOTTI_hosts.csv
--hostTimes HOSTTIMES, -ht HOSTTIMES
Input csv file with the earliest times in which hosts
are infectable, and latest time when hosts are
infective. Use same host names as those used for the
sampling hosts file. If not specified, looks for
SCOTTI_hostTimes.csv
...
...
...
--treelog TREELOG, -tl TREELOG
Logging frequency in the tree file.
Default=numIter/1000.
\end{lstlisting}
We see that we have to provide the script with four input files:
\begin{itemize}
\item
A fasta file, containing the genetic sequences.
\item
A dates file, containing the sampling dates for each sequence.
\item
A hosts file, containing a mapping from each sequence to the host it
was sampled from.
\item
A hostTimes file, containing the earliest and latest dates when hosts
are infectious. Thus, these dates correspond to the introduction and
removal of hosts to the outbreak (for instance in a hospital setting
these dates will correspond to a patient's admission and release
dates).
\end{itemize}
The dates, hosts and hostTimes files need to be csv-files
(comma-separated values). All dates have to be provided in numbers. All
four input files for the FMDV outbreak are provided in the
\lstinline!data/! directory of this tutorial.
The sequence file:
\begin{lstlisting}
>IP3c-46
-----------------GGTCTCACCCCTAGTAAGCCAACGACAGTCCCTGCGTTGCACTCCACA...
>IP5-51
-----------------GGTCTCACCCCTAGTAAGCCAACGACAGTCCCTGCGTTGCACTCCACA...
>IP8-62
-----------------GGTCTCACCCCTAGTAAGCCAACGACAGTCCCTGCGTTGCACTCCACA...
...
>IP1b-4
-----------------GGTCTCACCCCTAGTAAGCCAACGACAGTCCCTGCGTTGCACTCCACA...
\end{lstlisting}
The dates file:
\begin{lstlisting}
IP3c-46, 46.0
IP5-51, 51.0
IP8-62, 62.0
...
IP6b-54, 54.0
\end{lstlisting}
The hosts file:
\begin{lstlisting}
IP1b-3, IP1b
IP1b-4, IP1b
IP2b-6, IP2b
...
IP8-62, IP8
\end{lstlisting}
The hostTimes file:
\begin{lstlisting}
IP1b, -26.0, 7.0
IP2b, -16.0, 9.0
IP2c, -9.0, 10.0
...
IP8, 43.0, 63.0
\end{lstlisting}
Note that the script assumes that times are specified forward-in-time,
thus the largest times are the ones closest to the present. In this
analysis we specify time in days from the start of the outbreak (day 0).
Thus, in the hostTimes file, where times are negative, that indicates
that we believe those farms could have been infected weeks before the
outbreak was first detected.
\begin{framed}
Type in:
\lstinline!python SCOTTI_generate_xml.py --fasta ../data/FMDV.fasta --dates ../data/FMDV_dates.csv --hosts ../data/FMDV_hosts.csv --hostTimes ../data/FMDV_hostTimes.csv --output FMDV --maxHosts 20 --numIter 2000000 --tracelog 2000 --treelog 20000 --screenlog 20000!
\end{framed}
This will create the input XML file for the BEAST2 analysis. You may
have to change the paths to the input files before the command will work
(the \lstinline!--fasta!, \lstinline!--dates!, \lstinline!--hosts! and
\lstinline!--hostTimes! arguments). We also specified the following
additional optional arguments:
\begin{itemize}
\item
\lstinline!--output!: The name of the output XML file and the
\lstinline!.log! and \lstinline!.tree! files that will be produced by
BEAST2.
\item
\lstinline!--maxHosts!: The maximum number of hosts in the outbreak,
including observed and unobserved hosts. Since we have data from 10
farms and no cases were found in any other farms it is unlikely that
more than 20 farms were infected. We can also make the analysis run
faster by restricting the maximum number of hosts to a smaller number.
\item
\lstinline!--numIter!: We restrict the length of the chain to
2'000'000 states.
\item
\lstinline!--traceLog!, \lstinline!--treelog!,
\lstinline!--screenlog!: The frequency at which states are saved in
the log file and tree files or output to the screen. By default the
Python script will scale these frequencies to log 10'000 states and
1'000 trees. Since our chain only includes 2'000'000 states this
results in states being sampled too frequently which in turn results
in high auto-correlation between sampled states and low ESS values.
\end{itemize}
\begin{framed}
\textbf{Topic for discussion:} Read the help for the
\lstinline!--unlimLife! and \lstinline!--penalizeMigration! input
arguments to the Python script. Can you figure out what these arguments
do? How do you think setting these arguments to true will affect the
results?
\end{framed}
\subsection{Running the analysis}\label{running-the-analysis}
We should now be ready to run the analysis in BEAST2.
\begin{framed}
Open \textbf{BEAST2} and choose \lstinline!FMDV.xml! as the BEAST XML
file. If BEAGLE is installed check the box to use it. Then click
\textbf{Run} (Figure \ref{fig:beastmcmc}).
\end{framed}
\begin{figure}
\centering
\includegraphics[width=0.500000\textwidth]{figures/BEASTMCMCseed.png}
\caption{Running the analysis in BEAST2.}
\label{fig:beastmcmc}
\end{figure}
If you use the seed 2007 you should be able to duplicate the results in
the next section. The analysis should take between 10 and 20 minutes to
run. While the analysis is running open the XML file in a text editor
and see if you can identify the different model components.
\begin{framed}
\textbf{Topic for discussion:} The Python script we used to create the
configuration file only allows us to choose between a Jukes-Cantor or
HKY site model, without any rate heterogeneity or invariant sites. In
addition, it only allows one locus/partition in the alignment and uses a
strict clock. It also doesn't give us any options for setting different
priors or operators.
Do you think there are situations where you would want to use a
different site or clock model? What about the model priors? In
particular, which priors would you want to change?
Suppose you wanted to perform a SCOTTI analysis using a GTR substitution
with a lognormal relaxed clock. How would you go about it?
\end{framed}
\subsection{Analysing the output in Tracer and
FigTree}\label{analysing-the-output-in-tracer-and-figtree}
\begin{framed}
Open \textbf{Tracer} and load the file \lstinline!FMDV.log!. Click on
the \textbf{Trace} tab and examine the parameter traces.
\end{framed}
The Python script only logs a few key parameters (Figure
\ref{fig:tracer}). Note that not all parameters have perfect mixing. For
a real analysis we would run the chain much longer and aim to collect at
least 10'000 samples (this log file contains only 1'000 samples).
\begin{figure}
\centering
\includegraphics[width=0.800000\textwidth]{figures/Tracer.png}
\caption{The results of the analysis in Tracer.}
\label{fig:tracer}
\end{figure}
\begin{framed}
Click on \textbf{Estimates} and select \textbf{migmodel.numPops}.
Since this is an integer parameter click on \textbf{(I)nt} at the bottom
of the panel (Figure \ref{fig:numPops}).
\end{framed}
\begin{figure}
\centering
\includegraphics[width=0.800000\textwidth]{figures/Tracer_numPops.png}
\caption{Estimates for the number of infected farms.}
\label{fig:numPops}
\end{figure}
This parameter estimates the number of hosts in the outbreak. It is
bounded below by 10 (the number of sampled farms) and above by 20, which
was our input upper bound. However, it is difficult to interpret this
parameter as the number of farms that were infected, because (i) one
non-sampled deme may actually represent a series of hosts with no time
overlap, or a single host with unlimited exposure time (like
environmental contamination); (ii) non-sampled hosts are not necessarily
infected; (iii) many non-sampled hosts might just represent one host
with large population size, like an environmental endemic source.
The parameter \textbf{migModel.popSize} logs the effective population
size of the hosts, which is directly proportional to the genetic
diversity of the virus within each farm. We see that the effective
population size is quite low, which is probably due to the relatively
short period of the outbreak. The equal infection rate between farms is
logged by \textbf{migModel.rateMig} and the clock rate by
\textbf{mutationRate}. If the clock rate seems low for a virus, keep in
mind that we measure time in days. Thus, the units for substitutions are
in substitutions/site/day instead of the usual substitutions/site/year.
\begin{framed}
Select \textbf{tree.height} (Figure \ref{fig:tmrca})
\end{framed}
\begin{figure}
\centering
\includegraphics[width=0.800000\textwidth]{figures/Tracer_TMRCA.png}
\caption{Estimates for the TMRCA.}
\label{fig:tmrca}
\end{figure}
This parameter is the TMRCA of the sequences included in the
transmission tree. The median estimate is 67.36 days, with a 95\% HPD
between 60.76 and 78.55 days. Since the most recent sample was taken on
day 62 of the outbreak (IP8-62), this indicates that it is unlikely that
the outbreak started more than a week before it was discovered and
confirmed, which is an encouraging sign for the surveillance of foot and
mouth disease in the United Kingdom.
\begin{framed}
Open \textbf{TreeAnnotator} and set the \textbf{Burnin percentage} to
10. Leave the other options unchanged.
Load the file \lstinline!FMDV.trees! and change the \textbf{Output file}
to \lstinline!FMDV.MCC.tree! (Figure \ref{fig:treeannotator}).
\end{framed}
\begin{figure}
\centering
\includegraphics[width=0.500000\textwidth]{figures/TreeAnnotator.png}
\caption{Creating the MCC tree.}
\label{fig:treeannotator}
\end{figure}
\begin{framed}
Open \textbf{FigTree} and load \lstinline!FMDV.MCC.tree!.
Click on the arrow to the right \textbf{Appearance} and in the
\textbf{Colour by} dropdown box select \textbf{host}. Check
\textbf{Gradient} and increase the \textbf{Line Weight} to \textbf{5}.
Increase the \textbf{Font Size} under \textbf{Tip Labels} and change
\textbf{Display} to \textbf{host}. Check \textbf{Node Labels} and select
\textbf{host.prob} from the dropdown box and increase the \textbf{Font
size}.
Under \textbf{Node Shape} select \textbf{Circle} and set the size by
\textbf{host.prob} and the colour by \textbf{host}.
Check \textbf{Legend} and select \textbf{host} as \textbf{Attribute}.
\end{framed}
\begin{figure}
\centering
\includegraphics[width=1.000000\textwidth]{figures/FigTree.png}
\caption{The MCC tree in FigTree.}
\label{fig:figtree}
\end{figure}
We see that in the MCC tree (Figure \ref{fig:figtree}) one node is an
unsampled node (between the two outbreak clusters). Thus, there is
evidence of a non-sampled farm between IP2b and IP5. We further see that
while the posterior probabilities for some internal nodes are quite
high, it is below 0.5 for several nodes. Note that since SCOTTI does not
model the transmission bottleneck it cannot estimate the time of
infection. Thus, it is better to display the tree with a gradient
between hosts than with a discrete colouring.
\begin{framed}
Under \textbf{Node Labels} change the \textbf{Display} attribute first
to \textbf{host.set} and then to \textbf{host.set.prob} (Figure
\ref{fig:hostsets})
\end{framed}
\begin{figure}
\centering
\includegraphics[width=1.000000\textwidth]{figures/FigTreeSet.png}
\caption{The sets of probable hosts in FigTree.}
\label{fig:hostsets}
\end{figure}
This displays respectively the set of hosts with a non-zero posterior
probability and the posterior probability for those nodes. We see that
although many internal nodes have a large set of possible hosts, only
two or three hosts have reasonably large posterior probabilities for
each internal node.
\subsection{Constructing the transmission
network}\label{constructing-the-transmission-network}
The MCC tree is a summary of the set of posterior trees. Although it is
useful it is not necessarily the best way to look at the transmission
tree. We will use a Python script to construct a transmission network
that shows the probabilities of transmissions between every pair of
farms. We will use the script
\lstinline!Make_transmission_tree_alternative.py!, which uses the
\lstinline!graphviz! package. If this package is not installed you will
have to install it, either through Homebrew or by installing the
Anaconda Python distribution. Alternatively, if you have previously
installed the \lstinline!graph-tool! package you may use the script
\lstinline!Make_transmission_tree.py!, which will produce better looking
figures.
\begin{framed}
Open a terminal if you are using Mac OS X or Linux, or a Command Prompt
if you are using Windows.
Navigate to the directory where the
\lstinline!Make_transmission_tree_alternative.py! script is stored.
Type in
\lstinline!python Make_transmission_tree_alternative.py --input ../precooked_runs/FMDV.trees --outputF FMDV_transmission --burnin 10!
\end{framed}
This will create 3 files:
\begin{itemize}
\item
\lstinline!FMDV_transmission_direct_transmissions.jpg!
\item
\lstinline!FMDV_transmission_indirect_transmissions!
\item
\lstinline!FMDV_transmission_network.txt!
\end{itemize}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/FMDV_transmission_direct_transmissions.jpg}
\caption{The direct transmission network using graphviz.}
\label{fig:hostsets}
\end{figure}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/FMDV3_out_direct_transmissions.png}
\caption{The direct transmission network using graph-tool.}
\label{fig:hostsets}
\end{figure}
The network of direct transmissions is shown in
\href{fig:transmissiontree1}{Figure 10}. By default only transmissions
with a probability bigger than 10\% are shown. We can identify a few key
points from the transmission network. First, the chance of a direct
transmission between the first and the second cluster (betwen IP2b and
IP5) is quite low, thus there is a large chance that there are some
non-sampled farms. There also appears to be evidence for non-sampled
farm(s) between IP5 and the other farms in the second cluster. Secondly,
it appears likely that IP3b was responsible either directly or
indirectly, for the infection of IP6b, IP7 and IP8. In several parts of
the network there are a number of possible transmission routes, however
in most cases one route is more likely than others. For example, it is
more likely that IP1b infected IP2c and that IP7 infected IP8 than
vice-versa. It also appears likely that IP1b was the first infected
farm. This information is also summarised in the text file
\lstinline!FMDV_transmission_network.txt!.
Although this information is very useful for informing us about the
transmission dynamics of an outbreak we should be careful not to
over-interpret the results. There is still a lot of uncertainty in the
transmission network, thus a link in the network does not necessarily
indicate a definite transmission pair. Moreover, the presence of
non-sampled hosts make it difficult to unambiguously identify
transmission pairs or superspreaders.
\clearpage
\section{Useful Links}\label{useful-links}
\begin{itemize}
\item
SCOTTI website with documentation:
\url{https://bitbucket.org/nicofmay/scotti/}
\item
\href{http://www.beast2.org/book.html}{Bayesian Evolutionary Analysis
with BEAST 2} \citep{BEAST2book2014}
\item
BEAST 2 website and documentation: \url{http://www.beast2.org/}
\item
Join the BEAST user discussion:
\url{http://groups.google.com/group/beast-users}
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%
% Tutorial disclaimer %
%%%%%%%%%%%%%%%%%%%%%%%
% Please do not change the license
% Add the author names and relevant links
% Add any other aknowledgments here
\href{http://creativecommons.org/licenses/by/4.0/}{\includegraphics[scale=0.8]{figures/ccby.pdf}} This tutorial was written by Louis du Plessis and Nicola de Maio for \href{https://taming-the-beast.github.io}{Taming the BEAST} and is licensed under a \href{http://creativecommons.org/licenses/by/4.0/}{Creative Commons Attribution 4.0 International License}.
%%%%%%%%%%%%%%%%%%%%
% Do NOT edit this %
%%%%%%%%%%%%%%%%%%%%
Version dated: \today
%%%%%%%%%%%%%%%%
% REFERENCES %
%%%%%%%%%%%%%%%%
\printbibliography[heading=relevref]
\end{document}