/
main.tex
1277 lines (1052 loc) · 52.6 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% This file was created (at least in part) by the script ParseMdtoLatex by Louis du Plessis
% (Available from https://github.com/taming-the-beast)
\documentclass[11pt]{article}
\input{preamble}
% Add your bibtex library here
\addbibresource{master-refs}
%%%%%%%%%%%%%%%%%%%%
% Do NOT edit this %
%%%%%%%%%%%%%%%%%%%%
\begin{document}
\renewcommand{\headrulewidth}{0.5pt}
\headsep = 20pt
\lhead{ }
\rhead{\textsc {BEAST v2 Tutorial}}
\thispagestyle{plain}
%%%%%%%%%%%%%%%%%%
% Tutorial title %
%%%%%%%%%%%%%%%%%%
\begin{center}
% Enter the name of your tutorial here
\textbf{\LARGE Tutorial using BEAST v2.7.x}\\\vspace{2mm}
% Enter a short description of your tutorial here
\textbf{\textcolor{mycol}{\Large Skyline plots}}\\
\vspace{4mm}
% Enter the names of all the authors here
{\Large {\em Nicola F. Müller and Louis du Plessis}}
\end{center}
Inference of past population dynamics using Bayesian Coalescent Skyline
and Birth-Death Skyline plots.
%%%%%%%%%%%%%%%%%
% Tutorial body %
%%%%%%%%%%%%%%%%%
\hypertarget{background}{%
\section{Background}\label{background}}
Population dynamics influence the shape of the tree and consequently,
the shape of the tree contains some information about past population
dynamics. The so-called Skyline methods allow to extract this
information from phylogenetic trees in a non-parametric manner. It is
non-parametric since there is no underlying system of differential
equations governing the inference of these dynamics. In this tutorial we
will look at two different methods to infer these dynamics from sequence
data. The first one is the Coalescent Bayesian Skyline plot
\citep{Drummond2005}, which is based on the coalescent model, and the
second one is the Birth-Death Skyline plot \citep{Stadler2013} based on
the birth-death model. The conceptual difference between the coalescent
and birth-death approaches lies in the direction of the flow of time. In
the coalescent, time is modeled to move backwards, from the present to
the past, while in the birth-death approach it is modeled to go
forwards. Two other fundamental differences are the parameters that are
inferred and the way sampling is treated. \clearpage
\hypertarget{programs-used-in-this-exercise}{%
\section{Programs used in this
Exercise}\label{programs-used-in-this-exercise}}
\hypertarget{beast2---bayesian-evolutionary-analysis-sampling-trees-2}{%
\subsubsection{BEAST2 - Bayesian Evolutionary Analysis Sampling Trees
2}\label{beast2---bayesian-evolutionary-analysis-sampling-trees-2}}
BEAST2 (\url{http://www.beast2.org}) is a free software package for
Bayesian evolutionary analysis of molecular sequences using MCMC and
strictly oriented toward inference using rooted, time-measured
phylogenetic trees. This tutorial is written for BEAST v2.7.x \citep{Bouckaert2014 Bouckaert2019}.
\hypertarget{beauti---bayesian-evolutionary-analysis-utility}{%
\subsubsection{BEAUti - Bayesian Evolutionary Analysis
Utility}\label{beauti---bayesian-evolutionary-analysis-utility}}
BEAUti2 is a graphical user interface tool for generating BEAST2 XML
configuration files.
Both BEAST2 and BEAUti2 are Java programs, which means that the exact
same code runs on all platforms. For us it simply means that the
interface will be the same on all platforms. The screenshots used in
this tutorial are taken on a Mac OS X computer; however, both programs
will have the same layout and functionality on both Windows and Linux.
BEAUti2 is provided as a part of the BEAST2 package so you do not need
to install it separately.
\hypertarget{tracer}{%
\subsubsection{Tracer}\label{tracer}}
Tracer (\url{http://beast.community/tracer}) is used to summarise the
posterior estimates of the various parameters sampled by the Markov
Chain. This program can be used for visual inspection and to assess
convergence. It helps to quickly view median estimates and 95\% highest
posterior density intervals of the parameters, and calculates the
effective sample sizes (ESS) of parameters. It can also be used to
investigate potential parameter correlations. We will be using Tracer
v1.7.x.
\hypertarget{r-rstudio}{%
\subsubsection{R / RStudio}\label{r-rstudio}}
We will be using \href{/href\%7Bhttps://www.r-project.org}{R} to analyze
the output of the Birth-Death Skyline plot.
\href{https://www.rstudio.com/}{RStudio} provides a user-friendly
graphical user interface to R that makes it easier to edit and run
scripts. (It is not necessary to use RStudio for this tutorial).
\clearpage
\hypertarget{practical-bayesian-and-birth-death-skyline-plots}{%
\section{Practical: Bayesian and Birth-Death Skyline
Plots}\label{practical-bayesian-and-birth-death-skyline-plots}}
In this tutorial we will estimate the dynamics of the Egyptian Hepatitis
C epidemic from genetic sequence data collected in 1993.
The aim of this tutorial is to:
\begin{itemize}
\item
Learn how to infer population dynamics;
\item
Get to know how to choose the set-up of a skyline analysis;
\item
Get to know the advantages and disadvantages of the Coalescent
Bayesian Skyline Plot and the Birth-Death Skyline.
\end{itemize}
\hypertarget{the-data}{%
\subsection{The Data}\label{the-data}}
The dataset consists of an alignment of 63 Hepatitis C sequences sampled
in 1993 in Egypt \citep{Ray2000}. This dataset has been used previously
to test the performance of skyline methods
\citep{Drummond2005 Stadler2013}.
With an estimated 15-25\%, Egypt has the highest Hepatits C prevalence
in the world. In the mid 20th century, the prevalence of Hepatitis C
increased drastically (see Figure \ref{fig:prevalence} for estimates).
We will try to infer this increase from sequence data.
The alignment file can be downloaded from the Taming the BEAST website
at \url{https://taming-the-beast.org/tutorials/Skyline-plots/} by
downloading the file \passthrough{\lstinline!hcv.nexus!} from the
left-hand panel, under the heading \textbf{Data}.
\begin{figure}
\centering
\includegraphics[width=0.500000\textwidth]{figures/Estimated_number_hcv.png}
\caption{The growth of the effective population size of the Hepatitis C epidemic in Egypt \citep{Pybus2003}.}
\label{fig:prevalence}
\end{figure}
\hypertarget{creating-the-analysis-files-with-beauti}{%
\subsection{Creating the Analysis Files with
BEAUti}\label{creating-the-analysis-files-with-beauti}}
We will use BEAUti to generate the configuration file for BEAST2 from
the sequence alignment.
\hypertarget{install-beast-2-packages}{%
\subsubsection{Install BEAST 2
packages}\label{install-beast-2-packages}}
While the coalescent-based Bayesian Skyline Plot is integrated in the
BEAST2 core, we need to install the BDSKY package, which contains the
Birth-Death Skyline model. Installation of packages is done using the
package manager, which is integrated into BEAUti.
\begin{framed}
Open the \textbf{BEAST2 Package Manager} by navigating to \textbf{File
\textgreater{} Manage Packages}.
Install the \textbf{BDSKY} package by selecting it and clicking the
\textbf{Install/Upgrade} button (Figure \ref{fig:install}).
\end{framed}
After the installation of a package, the program is on your computer,
but BEAUti is unable to load the template files for the newly installed
model unless it is restarted. So, let's restart BEAUti to make sure we
have the \textbf{BDSKY} model at hand.
\begin{framed}
Close the \textbf{BEAST2 Package Manager} and \textbf{\emph{restart}}
BEAUti to fully load the \textbf{BDSKY} package.
\end{framed}
\begin{figure}
\centering
\includegraphics[width=0.750000\textwidth]{figures/install_bdsky.png}
\caption{Install the BDSKY package which contains the Birth-Death Skyline model.}
\label{fig:install}
\end{figure}
\clearpage
\hypertarget{setting-up-the-coalescent-bayesian-skyline-analysis}{%
\subsubsection{Setting up the Coalescent Bayesian Skyline
analysis}\label{setting-up-the-coalescent-bayesian-skyline-analysis}}
To start we have to import the alignment into BEAUti.
\begin{framed}
In the \textbf{Partitions} panel, import the nexus file with the
alignment by navigating to \textbf{File \textgreater{} Import Alignment}
in the menu and then finding the \passthrough{\lstinline!hcv.nexus!}
file on your computer \textbf{or} simply drag and drop the file into the
\textbf{BEAUti} window.
\end{framed}
BEAUti will recognize the sequences from the
\passthrough{\lstinline!*.nexus!} file as nucleotide data. It will do so
for sequence files with the character set of \textbf{A C G T N}, where
\textbf{N} indicates an unknown nucleotide. As soon as other non-gap
characters are included (e.g.~using \textbf{R} or \textbf{Y} to indicate
purines and pyramidines) BEAUti will not recognize the data as
nucleotides anymore (unless the type of data is specified in the
\passthrough{\lstinline!*.nexus!} file) and open a dialogue box to
confirm the data type.
The sequences were all sampled in 1993 so we are dealing with a
homochronous alignment and do not need to specify tip dates.
\begin{framed}
Skip the \textbf{Tip Dates} panel and navigate to the \textbf{Site
Model} panel.
\end{framed}
The next step is to specify the model of nucleotide evolution (the site
model). We will be using the GTR model, which is the most general
reversible model and estimates transition probabilities between
individual nucleotides separately. That means that the transition
probabilities between e.g.~\textbf{A} and \textbf{T} will be inferred
separately to the ones between \textbf{A} and \textbf{C}, however
transition probabilities from \textbf{A} to \textbf{C} will be the same
as \textbf{C} to \textbf{A} etc. (Note that the transition probabilities
here refer to the transition between \emph{any} two states in the
continuous time Markov-chain stochastic process that is used for the
substitution model, and not specifically to \emph{transitions} in the
context of genetics, i.e.~mutations from purines to purines or
pyramidines to pyramidines). Additionally, we allow for rate
heterogeneity among sites. We do this by changing the Gamma Category
Count to 4 (normally between 4 and 6).
\begin{framed}
Change the \textbf{Gamma Category Count} to 4, make sure that the
estimate box next to the \textbf{Shape} parameter of the Gamma
distribution is ticked and set \textbf{Subst Model} to \textbf{GTR}.
Make sure that the estimate box is ticked for all but one of the 6 rates
(there should be 5 ticked boxes) and that \textbf{Frequencies} are
estimated (Figure \ref{fig:model}).
\end{framed}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/choose_gtr.png}
\caption{Set a GTR site model with a Gamma Category Count of 4.}
\label{fig:model}
\end{figure}
\begin{framed}
\textbf{Topic for discussion:} Why are only 5 of the 6 rates of the GTR
model estimated?
\end{framed}
Because our sequences are contemporaneous (homochronous data) there is
no information in our dataset to estimate the clock rate (for more
information on this refer to the
\href{../Prior-selection/}{prior-selection} tutorial) and we have to use
external information to calibrate the clock. We will use an estimate
inferred in \citep{Pybus2001} to fix the clock rate. In this case all
the samples were contemporaneous (sampled at the same time) and the
clock rate is simply a scaling of the estimated tree branch lengths (in
substitutions/site) into calendar time.
\begin{framed}
Navigate to the \textbf{Clock Model} panel.
Leave the clock model as a \textbf{Strict Clock} and set
\textbf{Clock.rate} to 0.00079 s/s/y (Figure \ref{fig:clockmodel}).
(Note that BEAUti is smart enough to know that the clock rate cannot be
estimated on this dataset and grays out the estimate checkbox).
\end{framed}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/set_clockrate.png}
\caption{Set the clock rate to 0.00079 s/s/y.}
\label{fig:clockmodel}
\end{figure}
Now we are ready to set up the Coalescent Bayesian Skyline as a
tree-prior.
\begin{framed}
Navigate to the \textbf{Priors} panel and select \textbf{Coalescent
Bayesian Skyline} as the tree prior (Figure \ref{fig:coalescent}).
\end{framed}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/choose_bsp.png}
\caption{Choose the Coalescent Bayesian Skyline as a tree prior.}
\label{fig:coalescent}
\end{figure}
The Coalescent Bayesian Skyline divides the time between the present and
the root of the tree (the tMRCA) into segments, and estimates a
different effective population size (\passthrough{$ N_e $})
for each segment. The endpoints of segments are tied to the branching
times (also called coalescent events) in the tree (Figure
\ref{fig:coal_events}), and the size of segments is measured in the
number of coalescent events included in each segment. The Coalescent
Bayesian Skyline groups coalescent events into segments and jointly
estimates the \passthrough{$ N_e $} (\textbf{bPopSizes}
parameter in BEAST) and the size of each segment (\textbf{bGroupSizes}
parameter). To set the number of segments we have to change the
dimension of \textbf{bPopSizes} and \textbf{bGroupSizes} (note that the
dimension of both parameters always has to be the same). Note that the
length of a segment is not fixed, but dependent on the timing of
coalescent events in the tree (Figure \ref{fig:coal_events}), as well as
the number of events contained within a segment (\textbf{bGroupSizes}).
\begin{figure}
\centering
\includegraphics[width=0.750000\textwidth]{figures/coalescent_intervals.png}
\caption{Example tree where the red dotted lines show the time-points of coalescent events.}
\label{fig:coal_events}
\end{figure}
\begin{framed}
To change the number of segments we have to navigate to the
\textbf{Initialialization} panel, which is by default not visible.
Navigate to \textbf{View \textgreater{} Show Initialization Panel} to
make it visible and navigate to it (Figure \ref{fig:init}).
Set the dimension of \textbf{bPopSizes} and \textbf{bGroupSizes} to 4
(the default value is 5) after expanding the boxes for the two
parameters (Figure \ref{fig:dimensions}).
\end{framed}
\begin{figure}
\centering
\includegraphics[width=0.250000\textwidth]{figures/goto_initialization.png}
\caption{Show the initialization panel.}
\label{fig:init}
\end{figure}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/set_dimension.png}
\caption{Set the dimension of bPopSizes and bGroupSizes to 4.}
\label{fig:dimensions}
\end{figure}
This sets the number of segments equal to 4 (the parameter dimension),
which means \passthrough{$ N_e $} will be allowed to change
3 times between the tMRCA and the present (if we have
\passthrough{$ d $} segments,
\passthrough{$ N_e $} is allowed to change
\passthrough{$ d-1 $} times).
We can leave the rest of the priors as they are and save the XML file.
We want to shorten the chain length and decrease the sampling frequency
so the analysis completes in a reasonable time and the output files stay
small. (Keep in mind that it will be necessary to run a longer chain for
parameters to mix properly).
\begin{framed}
Navigate to the \textbf{MCMC} panel.
Change the \textbf{Chain Length} from 10'000'000 to 3'000'000.
Click on the arrow next to the \textbf{tracelog} and change the
\textbf{File Name} to \passthrough{\lstinline!\$(filebase).log!} and set
the \textbf{Log Every} to 3'000.
Click on the arrow next to the \textbf{treelog} and change the
\textbf{File Name} to \passthrough{\lstinline!\$(filebase)-\$(tree).log!}
and set the \textbf{Log Every} to 3'000.
Leave all other settings at their default values and save the file as
\passthrough{\lstinline!hcv_coal.xml!}.
(Note that since BEAST 2.7 the filenames used here are the default
filenames and should not need to be changed!)
\end{framed}
When we run the analysis \passthrough{lstinline!\$(filebase)!} in the
name of the \passthrough{\lstinline!*.log!} and
\passthrough{\lstinline!*.trees!} files will be replaced by the name of
the XML file. This is a good idea, since it makes it easy to keep track
of which XML files produced which output files.
Now we are ready to run the analysis.
\begin{framed}
Start \textbf{BEAST2} and choose the file
\passthrough{\lstinline!hcv_coal.xml!}.
If you have \textbf{BEAGLE} installed tick the box to \textbf{Use BEAGLE
library if available}, which will make the analysis run faster.
Hit \textbf{Run} to start the analysis.
\end{framed}
The analysis will take about 10 minutes to complete. Read through the
next section while waiting for your results or start preparing the XML
file for the \protect\hyperlink{sec:bdsky}{birth-death skyline}
analysis.
\hypertarget{the-coalescent-bayesian-skyline-parameterization}{%
\subsubsection{The Coalescent Bayesian Skyline
parameterization}\label{the-coalescent-bayesian-skyline-parameterization}}
The Coalescent Bayesian Skyline model uses the Kingman coalescent for
each segment, which assumes that the sequences are a small sample drawn
from a haploid population evolving under Wright-Fisher dynamics (Figure
\ref{fig:wrightfisher}). The model works by calculating the probability
of observing the tree under this assumption. This essentially boils down
to repeatedly asking the question of how likely it is for two lineages
to coalesce (have a common ancestor) in a given time.
\begin{figure}
\centering
\includegraphics[width=0.500000\textwidth]{figures/Rosenberg2002.png}
\caption{The basic principle behind the coalescent. Figure from \citep{Rosenberg2002}.}
\label{fig:wrightfisher}
\end{figure}
The effective population size (\passthrough{$ N_e $}) is
the inverse of the rate of coalescence
\passthrough{$ \lambda $}. The larger
\passthrough{$ N_e $} is the less likely lineages are to
coalesce. Thus, intervals in a sampled tree with many branching events
often coincide with periods when the population size was small.
Similarly, few branching events occur during periods of large population
size. (Note that these results are conditioned on sampling only a small
fraction of the population).
\begin{equation}
\lambda = \frac{1}{N_e}
\end{equation}
For an SIR model (\textbf{S}usceptible, \textbf{I}nfected and
\textbf{R}ecovered), \passthrough{$ N_e $} is proportional
to the overall population size \passthrough{$ N $} and the
number of infected \passthrough{$ I $} and inversely
proportional to the transmission rate
\passthrough{$ \theta $}.
\begin{equation}
N_e = \frac{I}{\theta} \frac{N}{S}
\end{equation}
Estimates of \passthrough{$ N_e $} therefore do not
directly tell us something about the number of infected, nor the
transmission rate. However, changes in
\passthrough{$ N_e $} can be informative about changes in
the transmission rate or the number of infected (if they do not cancel
out).
The Coalescent Bayesian Skyline model allows
\passthrough{$ N_e $} to change over time in a
nonparametric fashion (i.e.~we do not have to specify an equation
governing changes in \passthrough{$ N_e $} over time).
Another way to think about the model is as maximally-parameterized,
since it infers \passthrough{$ d $} change-point times
(segment boundaries) and a value for \passthrough{$ N_e $}
in each segment. This makes the Bayesian Skyline flexible enough to
model very complicated \passthrough{$ N_e $} dynamics,
provided that enough segments are specified. It may be tempting to
specify the maximum dimension for the model (each group contains only
one coalescent event, thus \passthrough{$ N_e $} changes at
each branching time in the tree), making it as flexible as possible.
This is the parameterization used by the Classic Skyline plot
\citep{Pybus2000}, which is the direct ancestor of the Coalescent
Bayesian Skyline plot. However, the only informative events used by the
Coalescent Bayesian Skyline plot are the coalescent events. Thus, using
a maximally-flexible parameterization with only one informative event
per segment often leads to erratic and noisy estimates of
\passthrough{$ N_e $} over time (especially if segments are
very short, see Figure \ref{fig:coal_events}). Grouping segments
together leads to smoother and more robust estimates.
Choosing the dimension for the Bayesian Skyline can be rather arbitrary.
If the dimension is chosen too low, not all population size changes are
captured, but if it is chosen too large, there may be too little
information in a segment to support a robust estimate. When trying to
decide if the dimension is appropriate it may be useful to consider the
average number of informative (coalescent) events per segment. (A tree
of \passthrough{$ n $} taxa has
\passthrough{$ n-1 $} coalescences, thus
\passthrough{$ N_e $} in each segment is estimated from on
average \passthrough{$ \frac\{n-1\}\{d\} $} informative
data points). Would this number of random samples drawn from a
hypothetical distribution allow you to accurately estimate the
distribution? If not, consider decreasing the dimension. There are
descendants of the coalescent skyline in BEAST that either estimate the
number of segments (Extended Bayesian Skyline \citep{Heled2008}) or do
not require the number of segments to be specified (Skyride
\citep{Minin2008}), but instead makes very strong prior assumptions
about changes in \passthrough{$ N_e $}.
\hypertarget{exploring-the-results-of-the-coalescent-bayesian-skyline-analysis}{%
\subsubsection{Exploring the results of the Coalescent Bayesian Skyline
analysis}\label{exploring-the-results-of-the-coalescent-bayesian-skyline-analysis}}
For the reconstruction of the population dynamics, we need two files,
the \passthrough{\lstinline!*.log!} file and the
\passthrough{\lstinline!*.trees!} file. The log file contains the
information about the group sizes and population sizes of each segment,
while the trees file is needed for the times of the coalescent events.
\begin{framed}
Load the logfile into \textbf{Tracer} to check mixing and parameter
estimates (Figure \ref{fig:tracer_bsp}).
\end{framed}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/bsp_tracer.png}
\caption{Loading the log file into Tracer.}
\label{fig:tracer_bsp}
\end{figure}
Because we shortened the chain most parameters have very low ESS values.
If you like, you can compare your results with the example results we
obtained with identical settings and a chain of 30,000,000
(\passthrough{\lstinline!hcv_coal_30M.log!}).
\begin{framed}
Navigate to \textbf{Analysis \textgreater{} Bayesian Skyline
Reconstruction}. From there open the \passthrough{\lstinline!*.trees!}
file. To get the correct dates in the analysis we should specify the
\textbf{Age of the youngest tip}. In our case it is 1993, the year where
all the samples were taken. If the sequences were sampled at different
times (heterochronous data), the age of the youngest tip is the time
when the most recent sample was collected.
Press \textbf{OK} to reconstruct the past population dynamics (Figure
\ref{fig:trees}).
\end{framed}
\begin{figure}
\centering
\includegraphics[width=0.750000\textwidth]{figures/open_trees.png}
\caption{Reconstructing the Bayesian Skyline plot in Tracer.}
\label{fig:trees}
\end{figure}
The output will have the years on the x-axis and the effective
population size on the y-axis. By default, the y-axis is on a log-scale.
If everything worked as it is supposed to work you will see a sharp
increase in the effective population size in the mid 20th century,
similar to what is seen on Figure \ref{fig:skyline}.
(Note that the reconstruction will only work if the
\passthrough{\lstinline!*.log!} and \passthrough{\lstinline!*.trees!}
files contain the exact same number of states and both files were logged
at the same frequency).
\begin{figure}
\centering
\includegraphics[width=0.750000\textwidth]{figures/skyline_analysis.png}
\caption{Coalescent Bayesian Skyline analysis output. The black line is the median estimate of the estimated effective population size (can be changed to the mean estimate). The two blue lines are the upper and lower bounds of the 95\% HPD interval. The x-axis is the time in years and the y-axis is on a log-scale.}
\label{fig:skyline}
\end{figure}
There are two ways to save the analysis, it can either be saved as a
\passthrough{\lstinline!*.pdf!} for display purposes or as a tab
delimited file.
\begin{framed}
Navigate to \textbf{File \textgreater{} Export Data Table}.
Enter the filename as \passthrough{\lstinline!hcv_coal.tsv!} and save
the file.
\end{framed}
The exported file will have five rows, the time, the mean, median, lower
and upper boundary of the 95\% HPD interval of the estimates, which you
can use to plot the data with other software (R, Matlab, etc).
\hypertarget{choosing-the-dimension}{%
\subsubsection{Choosing the Dimension}\label{choosing-the-dimension}}
If we compare the estimates of the population dynamics using different
dimensions, we see that most of the dynamics are already captured with
having only 2 dimensions, as shown in Figure \ref{fig:comparison}.
Adding more dimensions only changes the inferred effective population
size before 1900. Note that adding more dimensions adds a slight dip
before the increase in the effective population size (around 1900). When
comparing to the HPD intervals (Figure \ref{fig:skyline}) we see that
this dip is not significant and may not be indicative of a real decrease
in the effective population size before the subsequent increase.
\begin{figure}
\centering
\includegraphics[width=1.000000\textwidth]{figures/comparison_dimension.png}
\caption{Estimated mean effective population sizes using different dimensions.}
\label{fig:comparison}
\end{figure}
The choice of the number of dimensions can also have a direct effect on
how fast the MCMC converges (Figure \ref{fig:ess}). The slower
convergence with increasing dimension can be caused by e.g.~less
information per interval. To some extent it is simply caused by the need
to estimate more parameters though.
\begin{figure}
\centering
\includegraphics[width=0.500000\textwidth]{figures/ess_vs_dim_coal.png}
\caption{The ESS value of the posterior after running an MCMC chain with `$ 10^7 $` samples, logged every `$ 10^3 $` steps and a burnin of 10\% for using different dimensions of the Coalescent Bayesian Skyline.}
\label{fig:ess}
\end{figure}
\clearpage
\hypertarget{setting-up-the-birth-death-skyline-analysis}{%
\subsubsection{Setting up the Birth-Death Skyline
analysis}\label{setting-up-the-birth-death-skyline-analysis}}
In the first analysis, we used the coalescent approach to estimate
population dynamics. We now want to repeat the analysis using the
Birth-Death Skyline model. We will use the same model setup as in the
previous analysis and only change the tree prior.
\begin{framed}
Restart \textbf{BEAUti}, load \passthrough{\lstinline!hcv.nexus!} as
before and set up the same site and clock model as in the Coalescent
Bayesian Skyline analysis.
\end{framed}
We will need to set the prior to \textbf{Birth Death Skyline
Contemporary}, since the sequences were all sampled at the same point in
time. For heterochronous data (sequences sampled at different times), we
would use \textbf{Birth Death Skyline Serial}. As with the Coalescent
Bayesian Skyline, we need to set the number of dimensions. Here we set
the dimension for \passthrough{$ R_e $}, the effective
reproduction number, which denotes the average number of secondary
infections caused by an infected person at a given time during the
epidemic, i.e.~an \passthrough{$ R_e $} of 2 would mean
that every infected person causes two new infections on average. In
other words, an \passthrough{$ R_e $} above 1 means that
the number of cases are increasing, therefore the disease will cause an
exponentially growing epidemic, and an
\passthrough{$ R_e $} below 1 means that the epidemic will
die out.
\begin{framed}
Navigate to the \textbf{Priors} panel and select \textbf{Birth Death
Skyline Contemporary} as the tree prior (Figure \ref{fig:bdsky}).
Then, click on the button where it says \textbf{initial = {[}2.0{]}
{[}0.0, Infinity{]}} next to \textbf{reproductiveNumber}. A pop-up
window will open which allows us to change the dimension of the
parameter (Figure \ref{fig:dimensions_bdsky}). In this case we will keep
the default dimension of 10.
Press \textbf{OK} to close the pop-up window.
\end{framed}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/choose_bdsky.png}
\caption{Setting the prior on the tree to the Birth-Death Skyline.}
\label{fig:bdsky}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=0.750000\textwidth]{figures/choose_dimension_bdsky.png}
\caption{Setting the dimension of the reproductiveNumber parameter.}
\label{fig:dimensions_bdsky}
\end{figure}
This means that \passthrough{$ R_e $} will be allowed to
change at 9 equally spaced times between the origin of the epidemic and
the present time. Choosing this dimension can again be arbitrary and may
require the testing of a few different values. Too few intervals and not
all rate shifts are captured. Too many intervals and the intervals may
not contain enough information to infer parameters. (As with setting the
dimension of the Coalescent Bayesian Skyline the dimension of
\passthrough{$ R_e $} can also be set in the initialization
panel).
Besides \passthrough{$ R_e $}
(\textbf{reproductiveNumber}), the \textbf{Birth Death Skyline
Contemporary} model has 3 more parameters,
\textbf{becomeUninfectiousRate} (the rate at which infected patients
become uninfectious, \passthrough{$ \delta $}, through
recovery, death or isolation), \textbf{rho} (the proportion of lineages
sampled in the present, \passthrough{$ \rho $}) and the
\textbf{origin} (the time at which the index case became infected, which
is always earlier than the tMRCA of the tree). We may know some of these
parameters from literature or be able to estimate them from external
sources. For example, the average time that patients are able to
transmit a disease is informative about the
\textbf{becomeUninfectiousRate}. This prior knowledge we can incorporate
in our analysis by setting appropriate priors for these parameters.
We will use a lognormal prior for \passthrough{$ R_e $}.
This is a good prior distribution to use for rates since it is always
positive (a rate cannot be negative) and has a long tail defined over
all positive numbers. The long tail allows arbitrarily high estimates of
\passthrough{$ R_e $}, but does not place much weight on
very high rates. This agrees with our prior knowledge about
\passthrough{$ R_e $} (most diseases have an
\passthrough{$ R_e $} between 1.2 and 5. Measles is one of
the most infectious diseases we know about and has
\passthrough{$ R_e \approx 18 $}). If an epidemic is
neither growing or declining, it has an
\passthrough{$ R_e $} of 1, which we will use as a null
hypothesis, by setting a prior on \passthrough{$ R_e $}
centered around 1 (we assume that if there isn't a strong signal in an
interval for an epidemic to grow or decline that
\passthrough{$ R_e = 1 $}, i.e.~the epidemic size stays
constant). Note that this prior is used for each of the
\passthrough{$ R_e $} intervals (the Birth-Death Skyline
assumes that \passthrough{$ R_e $} is independent in each
of the intervals).
\begin{framed}
Select a \textbf{Log Normal} distribution for the
\textbf{reproductiveNumber} prior.
Click on the arrow to the left of \textbf{reproductiveNumber} to open
all the options for \passthrough{$ R_e $} settings
Set \textbf{M} to 0, which results in a median of 1. We set \textbf{S}
to 1.25, which places most weight below 7.82 (95\% quantile). (Figure
\ref{fig:r0prior}).
\end{framed}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/bdsky_prior_r0.png}
\caption{Setting the `$ R_e $` prior.}
\label{fig:r0prior}
\end{figure}
For the becoming uninfectious rate we will again use a log normal prior.
The inverse of the becoming uninfectious rate is the average infectious
period. In some patients an HCV infection only lasts a few weeks, while
in others it is a chronic infection lasting for many years. Setting
\passthrough{$ M=0 $} and
\passthrough{$ S=1.25 $} results in the same prior we used
for the \passthrough{$ R_e $}. In terms of the becoming
uninfectious rate, this translates to the 95\% quantiles for the
infectious period falling between 0.0862 years (31.5 days) and 11.59
years, with a median of 1 year. We will see later that there is a strong
signal in the data for a longer becoming uninfectious period.
\begin{framed}
Set the same prior for \textbf{becomeUninfectiousRate} as for
\textbf{reproductiveNumber} (Log Normal, with M=0.0, S=1.25) (Figure
\ref{fig:bURprior})
\end{framed}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/bdsky_prior_uninf.png}
\caption{Setting the becoming uninfectious rate prior.}
\label{fig:bURprior}
\end{figure}
The sampling proportion, \passthrough{$ \rho $}, represents
the proportion of HCV cases in Egypt in 1993 that are included in the
analysis. In 1993 Egypt had a population of roughly 60 million people,
and with a prevalence of at least 15\% this translates into millions of
cases, while we only have 63 sequences.
We will use a beta distribution for the prior on
\passthrough{$ \rho $}. Beta distributions are a very
flexible class of distributions that are only defined between 0 and 1,
making them ideal to use for proportions.
\begin{framed}
Select a \textbf{Beta} distribution for the \textbf{rho} prior.
Click on the arrow to the left of \textbf{rho} to open all the options
for the prior settings.
Alpha to 1 and Beta to 9999, reflecting our prior knowledge that our
dataset represents only a miniscule fraction of cases (Figure
\ref{fig:rhoprior}).
\end{framed}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/bdsky_prior_rho.png}
\caption{Setting the prior on `$ \rho $`.}
\label{fig:rhoprior}
\end{figure}
Finally, we need to set a prior for the origin of the epidemic. We will
once again use a log normal distribution for this parameter. Note that
the origin also has to be positive and needs to be bigger than the MRCA
of the tree. We know that HCV has been circulating in Egypt for at least
a hundred years, so we set a prior with a median value greater than 100.
\begin{framed}
Set a \textbf{Log Normal} prior for \textbf{origin} with \textbf{M = 5}
and \textbf{S = 0.5} (Figure \ref{fig:oriprior}), resulting in a median
prior estimate for the origin of 148 years.
\end{framed}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/bdsky_prior_ori.png}
\caption{Setting the prior on the origin of the epidemic.}
\label{fig:oriprior}
\end{figure}
The rest of the priors pertain to the site model parameters and we can
leave them as they are.
\begin{framed}
Navigate to the \textbf{MCMC} panel.
Change the \textbf{Chain Length} from 10'000'000 to 3'000'000.
Click on the arrow next to the \textbf{tracelog} and change the
\textbf{File Name} to \passthrough{lstinline!\$(filebase).log!} and set
the \textbf{Log Every} to 3'000.
Click on the arrow next to the \textbf{treelog} and change the
\textbf{File Name} to \passthrough{lstinline!\$(filebase)-\$(tree).log!}
and set the \textbf{Log Every} to 3'000.
Leave all other settings at their default values and save the file as
\passthrough{\lstinline!hcv_bdsky.xml!}.
(Note that since BEAST 2.7 the filenames used here are the default
filenames and should not need to be changed!)
\end{framed}
Now we are ready to run the analysis.
\begin{framed}
Start \textbf{BEAST2} and choose the file
\passthrough{\lstinline!hcv_bdsky.xml!}.
If you have \textbf{BEAGLE} installed tick the box to \textbf{Use BEAGLE
library if available}, which will make the analysis run faster.
Hit \textbf{Run} to start the analysis.
\end{framed}
Look at the topics for discussion below and read through the next
section while waiting for the analysis to finish.
\begin{framed}
\textbf{Topics for discussion:}
\begin{itemize}
\item
We set a prior on \passthrough{$ R_e $} in the
Birth-Death Skyline analysis, but did not set any prior for
\passthrough{$ N_e $} in the Coalescent Bayesian Skyline
analysis. Is there a prior on \passthrough{$ N_e $}? If
so, what is it?
\item
We fixed the clock rate to an independent estimate and set a strict
clock. If we had strong prior knowledge that there is substitution
rate variation over time in the Egyptian HCV epidemic, could we use a
relaxed clock here?
\end{itemize}
\end{framed}
\hypertarget{the-birth-death-skyline-parameterization}{%
\subsubsection{The Birth-Death Skyline
parameterization}\label{the-birth-death-skyline-parameterization}}
The birth-death model is parameterized very differently from the
coalescent model, using per lineage rates and an explicit sampling model
(whereas the coalescent model conditions on the samples). This makes the
birth-death model more powerful, but also much more complex. A basic
birth-death model has a birth rate
(\passthrough{$ \lambda $}), the rate at which lineages are
added to the tree, and a death rate
(\passthrough{$ \delta $}), the rate at which lineages are
removed from the tree (Figure \ref{fig:bd_model}). In an infectious
disease epidemic \passthrough{$ \lambda $} can be thought
of as the transmission rate, the rate at which infected individuals
infect susceptibles, while \passthrough{$ \delta $} can be
thought of as the becoming uninfectious rate, the rate at which infected
individuals recover, die or are isolated. In species tree inferences
these rates can be thought of in terms of speciation and extinction.
\begin{figure}
\centering
\includegraphics[width=0.250000\textwidth]{figures/bd_model.png}
\caption{A schematic of the birth-death model.}
\label{fig:bd_model}
\end{figure}
The \textbf{Birth Death Skyline Contemporary} model we used was
parameterized in terms of \passthrough{$ R_e $} and
\passthrough{$ \delta $}. Recall that
\passthrough{$ R_e > 1 $} means that an epidemic will keep
growing. We can see this from the definition of
\passthrough{$ R_e $} as the ratio of the birth and death
rates.
\begin{equation}
R_{e} = \frac{\lambda}{\delta}
\end{equation}
if \passthrough{$ \lambda > \delta $} then
\passthrough{$ R_e > 1 $}
epidemic grows
if \passthrough{$ \lambda = \delta $} then
\passthrough{$ R_e = 1 $}
epidemic stays constant
if \passthrough{$ \lambda < \delta $} then
\passthrough{$ R_e < 1 $}
epidemic declines
We used this paramerization simply because it is often easier to specify
priors for \passthrough{$ R_e $} than the transmission
rate, and because \passthrough{$ R_e $} is often more
informative for prevention efforts. In addition, the model also has a
sampling probability (\passthrough{$ \rho $}) parameter,
which in our analysis describes how likely it is that a person infected
with HCV in Egypt in 1993 was sampled in our dataset. The final
parameter is the origin. Whereas coalescent models work backward-in-time
from the sampled sequences, birth-death models work forward-in-time from
the origin. Hence, the model needs an origin time, which can also be
jointly estimated along with the other parameters. The origin will
always be at least as big, and usually bigger, than the tMRCA of the
sampled tree, since the sampled tree is by definition smaller than the
complete tree.
You may have noticed that there are many Birth-Death Skyline models
available in BEAUti. For example, the \textbf{Birth Death Skyline
Contemporary BDSParam} model is parameterized in terms of
\passthrough{$ \lambda, \delta $} and
\passthrough{$ \rho $} and is usually more appropriate for
macroevolutionary studies. The \textbf{Birth Death Skyline Serial} model
assumes that the data are heterochronous (sampled at different times).
It assumes that:
\begin{equation}
\delta = \psi + \mu
\end{equation}
where \passthrough{$ \psi $} is the rate at which lineages
are sampled through time and \passthrough{$ \mu $} is the
rate at which lineages are removed from the tree for any other reason
(death, recovery, extinction etc.). (In this case the
\passthrough{$ \rho $} parameter is no-longer available by
default, because samples are collected through time, and not just at one
timepoint). By default, the model is parameterized in terms of
\passthrough{$ R_e , \delta $} and
\passthrough{$ p $}, the sampling proportion:
\begin{equation}
p = \frac{\psi}{\psi + \mu}
\end{equation}
The sampling proportion is the proportion of all removed lineages that
were sampled, and can be used to obtain a rough estimate of the total
population size. This model is useful for studying infectious disease
dynamics, because samples are often collected over the course of an
epidemic. It can also be used for macro-evolutionary studies, when
fossil data (morphological traits or ancient DNA) are incorporated. In
that case a parameterization in terms of
\passthrough{$ \lambda, \mu $} and
\passthrough{$ \psi $} is preferable.
You can also see that the model \textbf{Birth Death Skyline Serial}
assumes that upon sampling a lineage is removed from the tree (e.g.~in a
disease model the sampled individual cannot transmit the disease after
sampling). The consequence for the phylogeny is that a sampled lineage
cannot be a direct ancestor of any other lineage in the tree. This
assumption can be relaxed, but we will not do so during this tutorial.
The Birth-Death Skyline model is very flexible and allows any or all of
these rates to change independently over time. This is done by dividing
the time from the origin to the most recent sample into dimension
\passthrough{$ d $} equally spaced intervals (see Figure
\ref{fig:bdsky_principle}). The rates are then allowed to change between
intervals. Since some rates (e.g.~\passthrough{$ \lambda $}
and \passthrough{$ \delta $}) are highly correlated, it is
not always a good idea to let all rates change over time because it can