/
main.tex
1413 lines (1146 loc) · 54.5 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% This file was created (at least in part) by the script ParseMdtoLatex by Louis du Plessis
% (Available from https://github.com/taming-the-beast)
\documentclass[11pt]{article}
\input{preamble}
% Add your bibtex library here
\addbibresource{master-refs}
%%%%%%%%%%%%%%%%%%%%
% Do NOT edit this %
%%%%%%%%%%%%%%%%%%%%
\begin{document}
\renewcommand{\headrulewidth}{0.5pt}
\headsep = 20pt
\lhead{ }
\rhead{\textsc {BEAST v2 Tutorial}}
\thispagestyle{plain}
%%%%%%%%%%%%%%%%%%
% Tutorial title %
%%%%%%%%%%%%%%%%%%
\begin{center}
% Enter the name of your tutorial here
\textbf{\LARGE Tutorial using BEAST v2.7.x}\\\vspace{2mm}
% Enter a short description of your tutorial here
\textbf{\textcolor{mycol}{\Large Introduction to BEAST2}}\\
\vspace{4mm}
% Enter the names of all the authors here
{\Large {\em Jūlija Pečerska, Veronika Bošková and Louis du Plessis}}
\end{center}
This is a simple introductory tutorial to help you get started with
using BEAST2 and its accomplices.
%%%%%%%%%%%%%%%%%
% Tutorial body %
%%%%%%%%%%%%%%%%%
\hypertarget{background}{%
\section{Background}\label{background}}
Before diving into performing complex analyses with BEAST2 one needs to
understand the basic workflow and concepts. While BEAST2 tries to be as
user-friendly as possible, the amount of possibilities can be
overwhelming.
In this simple tutorial you will get acquainted with the basic workflow
of BEAST2 and the software tools most commonly used to interpret the
results of analyses. Bear in mind that this tutorial is designed only to
help you get started using BEAST2. This tutorial does not discuss all
the choices and concepts in detail, as they are discussed in further
tutorials. Interspersed throughout the tutorial are topics for
discussion. These discussion topics are optional, however if you work
through them you will have a better understanding of the concepts
discussed in this tutorial. Feel free to skip the discussion topics and
come back to them later, while running the analysis file, or after
finishing the whole tutorial.
\clearpage
\hypertarget{programs-used-in-this-exercise}{%
\section{Programs used in this
Exercise}\label{programs-used-in-this-exercise}}
\hypertarget{beast2---bayesian-evolutionary-analysis-sampling-trees-2}{%
\subsubsection{BEAST2 - Bayesian Evolutionary Analysis Sampling Trees
2}\label{beast2---bayesian-evolutionary-analysis-sampling-trees-2}}
BEAST2 (\url{http://www.beast2.org}) is a free software package for
Bayesian evolutionary analysis of molecular sequences using MCMC and
strictly oriented toward inference using rooted, time-measured
phylogenetic trees. This tutorial is written for BEAST v2.7.x \citep{Bouckaert2014, Bouckaert2019}.
\hypertarget{beauti2---bayesian-evolutionary-analysis-utility}{%
\subsubsection{BEAUti2 - Bayesian Evolutionary Analysis
Utility}\label{beauti2---bayesian-evolutionary-analysis-utility}}
BEAUti2 is a graphical user interface tool for generating BEAST2 XML
configuration files.
Both BEAST2 and BEAUti2 are Java programs, which means that the exact
same code runs on all platforms. For us it simply means that the
interface will be the same on all platforms. The screenshots used in
this tutorial are taken on a Mac OS X computer; however, both programs
will have the same layout and functionality on both Windows and Linux.
BEAUti2 is provided as a part of the BEAST2 package so you do not need
to install it separately.
\hypertarget{treeannotator}{%
\subsubsection{TreeAnnotator}\label{treeannotator}}
TreeAnnotator is used to summarise the posterior sample of trees to
produce a maximum clade credibility tree. It can also be used to
summarise and visualise the posterior estimates of other tree parameters
(e.g.~node height).
TreeAnnotator is provided as a part of the BEAST2 package so you do not
need to install it separately.
\hypertarget{tracer}{%
\subsubsection{Tracer}\label{tracer}}
Tracer (\url{http://beast.community/tracer}) is used to summarise the
posterior estimates of the various parameters sampled by the Markov
Chain. This program can be used for visual inspection and to assess
convergence. It helps to quickly view median estimates and 95\% highest
posterior density intervals of the parameters, and calculates the
effective sample sizes (ESS) of parameters. It can also be used to
investigate potential parameter correlations. We will be using Tracer
v1.7.x.
\hypertarget{figtree}{%
\subsubsection{FigTree}\label{figtree}}
FigTree (\url{http://beast.community/figtree}) is a program for viewing
trees and producing publication-quality figures. It can interpret the
node-annotations created on the summary trees by TreeAnnotator, allowing
the user to display node-based statistics (e.g.~posterior
probabilities). We will be using FigTree v1.4.x.
\hypertarget{densitree}{%
\subsubsection{DensiTree}\label{densitree}}
Bayesian analysis using BEAST2 provides an estimate of the uncertainty
in tree space. This distribution is represented by a set of trees, which
can be rather large and difficult to interpret. DensiTree is a program
for qualitative analysis of sets of trees. DensiTree allows to quickly
get an impression of properties of the tree set such as well-supported
clades, distribution of tree heights and areas of topological
uncertainty.
DensiTree is provided as a part of the BEAST2 package so you do not need
to install it separately.
\clearpage
\hypertarget{practical-running-a-simple-analysis-with-beast2}{%
\section{Practical: Running a simple analysis with
BEAST2}\label{practical-running-a-simple-analysis-with-beast2}}
This tutorial will guide you through the analysis of an alignment of
sequences sampled from twelve primate species. The aim of this tutorial
is to co-estimate the gene phylogeny and the rate of evolution. More
generally, this tutorial aims to introduce new users to a basic workflow
and point out the steps towards performing a full analysis of sequencing
data within a Bayesian framework using BEAST2.
After completing this tutorial you should be able to:
\begin{itemize}
\item
Set up all the components of a simple BEAST2 analysis in BEAUti2
\item
Use a calibration node to calibrate the molecular clock
\item
Use Tracer, FigTree and DensiTree to check convergence and analyse
results
\end{itemize}
\hypertarget{the-data}{%
\subsection{The Data}\label{the-data}}
Before we can start, we need to download the input data for the
tutorial. For this tutorial we use a single NEXUS file,
\passthrough{\lstinline!primate-mtDNA.nex!}, which contains a sequence
alignment and metadata of the twelve primate mitochondrial genomes we
will be analysing. Among other information, the metadata contains
information to partition the alignment into 4 regions:
\begin{itemize}
\item
Non-coding region
\item
1st codon positions
\item
2nd codon positions
\item
3rd codon positions
\end{itemize}
The alignment file can be downloaded from the Taming the BEAST website
at \url{https://taming-the-beast.org/tutorials/Introduction-to-BEAST2/}
or from Github.
\begin{framed}
\textbf{Downloading from taming-the-beast.org}
A link to the alignment file,
\passthrough{\lstinline!primate-mtDNA.nex!}, is on the left-hand panel,
under the heading \textbf{Data}. \textbf{Right-click} on the link and
select \textbf{``Save Link As\ldots{}''} (Firefox and Chrome) or
\textbf{``Download Linked File As\ldots{}''} (Safari) and save the file
to a convenient location on your local drive. Note that some browsers
will automatically change the extension of the file from
\passthrough{\lstinline!.nex!} to \passthrough{\lstinline!.nex.txt!}. If
this is the case, simply rename the file again.
Alternatively, if you \textbf{left-click} on the link most browsers will
display the alignment file. You can then press \textbf{File
\textgreater{} Save As} to store a local copy of the file. Note that
some browsers will inject an HTML header into the file, which will make
it unusable in BEAST2 (making this the less preferable option for
downloading data files).
In the same way you can also download example
\passthrough{\lstinline!.xml!} files for the analyses in this tutorial,
as well as \emph{pre-cooked} output \passthrough{\lstinline!.log!} and
\passthrough{\lstinline!.trees!} files. We recommend only downloading
these files to check your results or if you become seriously stuck.
\end{framed}
\begin{framed}
\textbf{Downloading from Github}
If you navigate to the Github repository you can either download the raw
data files directly from \textbf{Github} or clone/download the
repository to your local drive.
Note that this tutorial is distributed under a \textbf{CC BY 4.0}
license, which gives anyone the right to freely use (and modify) it, as
long as appropriate credit is given and the updated material is licensed
in the same fashion.
\end{framed}
\hypertarget{creating-the-analysis-file-with-beauti}{%
\subsection{Creating the Analysis File with
BEAUti}\label{creating-the-analysis-file-with-beauti}}
To run analyses with BEAST, one needs to prepare a configuration file in
XML format that contains everything BEAST2 needs to run an analysis. A
BEAST2 XML file contains:
\begin{itemize}
\item
The data (typically a sequence alignment)
\item
The model specification
\item
Initial values and parameter constraints
\item
Settings of the MCMC algorithm
\item
Output options
\end{itemize}
Even though it is possible to create such files from scratch in a text
editor, it can be complicated and is not exactly straightforward. BEAUti
is a user-friendly program designed to aid you in producing a valid
configuration file for BEAST. If necessary, that file can later be
edited by hand, but it is recommended to use BEAUti for generating the
files (at least for the initial round of analysis).
\begin{framed}
Begin by starting \textbf{BEAUti2}.
\end{framed}
\hypertarget{importing-the-alignment}{%
\subsubsection{Importing the alignment}\label{importing-the-alignment}}
To give BEAST2 access to the data, the alignment has to be added to the
configuration file.
\begin{framed}
Drag and drop the file \passthrough{\lstinline!primate-mtDNA.nex!} into
the open BEAUti window (it should be on the \textbf{Partitions} tab).
Alternatively, use \textbf{File \textgreater{} Import Alignment} or
click on the \textbf{+} in the bottom left-hand corner of the window,
then locate and click on the alignment file.
\end{framed}
Once you have done that, the data should appear in the BEAUti window
which should look as shown in Figure \ref{fig:data}.
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/data.png}
\caption{Data imported into BEAUti2.}
\label{fig:data}
\end{figure}
\hypertarget{setting-up-shared-models}{%
\subsubsection{Setting up shared
models}\label{setting-up-shared-models}}
A common way to account for site-to-site rate heterogeneity (variation
in substitution rates between different sites) is to use a Gamma site
model. In this model, one assumes that rate variation follows a Gamma
distribution. To make the analysis tractable the Gamma distribution is
discretised to a small number of bins (4-6 usually). The mean of each
bin then acts as a multiplier for the overall substitution rate. The
transition probabilities are then calculated for each scaled
substitution rate. To calculate the likelihood for a site,
\textbf{P(data \textbar{} tree, substitution model)} is calculated under
each Gamma rate category and the results are summed up to average over
all possible rates. This is a handy approach if one suspects that some
sites are mutating faster than others but the precise position of these
sites in the alignment is unknown or random.
Another way to account for site-to-site rate heterogeneity is to split
the alignment into explicit partitions, and specify an independent
substitution model for each partition. This is especially relevant when
one knows exactly which positions in the alignment are expected to
evolve at different rates. In our example, we split the alignment into
coding and non-coding regions, and further split the coding region into
1st, 2nd and 3rd codon positions. This information is encoded as
metadata into the \passthrough{\lstinline!.nex!} file, which BEAUti
automatically processes to produce the four partitions in the
\textbf{Partitions} tab as shown in Figure \ref{fig:data}.
\begin{framed}
\textbf{Double-click} on the different partitions (under the
\textbf{File} column) to view the individual alignments.
\end{framed}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/2ndpos.png}
\caption{The partition for the 2nd codon positions in the coding region of the primate mtDNA alignment.}
\label{fig:2ndpos}
\end{figure}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/3rdpos.png}
\caption{The partition for the 3rd codon positions in the coding region of the primate mtDNA alignment.}
\label{fig:3rdpos}
\end{figure}
By looking at the alignments for the 2nd and 3rd codon positions (Figure
\ref{fig:2ndpos} and Figure \ref{fig:3rdpos}) we can immediately see a
clear difference between the two codon positions. For the 2nd codon
position many of the ancestral relationships are clear from shared
substitutions between groups, for instance between the great apes
(\emph{Homo sapiens}, \emph{Pan}, \emph{Gorilla} and \emph{Pongo} -
humans, chimpanzees, gorillas and orangutans) and between the old-world
monkeys (\emph{Macaca fuscata}, \emph{M. mulatta}, \emph{M.
fascicularis} and \emph{M. sylvanus} - the macaques). The lesser apes
(represented by \emph{Hylobates} - gibbons) share most substitutions
with the great apes, but occasionally share a substitution with the
macaques. For the 3rd codon position there are many more substitutions
and the groupings are not as clear.
\begin{framed}
\textbf{Topic for discussion:} Do you think there is a good case for
using independent substitution models on the different partitions in the
\passthrough{\lstinline!.nex!} file? Do you think this is sufficient for
taking all site-to-site rate variation into account?
How would you account for rate variation between sites \emph{within}
each partition?
\end{framed}
Since all of the sequences in this dataset are from the mitochondrial
genome (which is not believed to undergo recombination in birds and
mammals) they all share the same ancestry. By default, BEAST2 would
recover a separate, independent time-calibrated tree for each partition,
so we need to make sure that it uses all data to recover only a single
shared tree. For the sake of simplicity, we will also assume that the
partitions have the same evolutionary branch-rate distribution, and
hence share the clock model as well.
To make sure that the partitions share the same evolutionary history we
need to link the \textbf{clock model} and the \textbf{tree} in BEAUti.
\begin{framed}
Select all four data partitions the \textbf{Partitions} panel (use
\textbf{shift+click}) and click the \textbf{Link Trees} and \textbf{Link
Clock Models} buttons.
\end{framed}
You will see that the \textbf{Clock Model} and the \textbf{Tree} columns
in the table both changed to say \passthrough{\lstinline!noncoding!}.
Now we will rename both models such that the following options and
generated log files are easier to read. The resulting setup should look
as shown in Figure \ref{fig:link}.
\begin{framed}
Click on the first drop-down menu in the \textbf{Clock Model} column and
rename the shared clock model to \passthrough{\lstinline!clock!}.
Likewise, rename the shared tree to \passthrough{\lstinline!tree!}.
\end{framed}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/link.png}
\caption{Linked clock and tree models.}
\label{fig:link}
\end{figure}
\hypertarget{setting-the-substitution-model}{%
\subsubsection{Setting the substitution
model}\label{setting-the-substitution-model}}
In this analysis all of our sequences come from extant species and were
thus all sampled in the present day (assumed to be
\passthrough{$ t = 0 $}). Therefore we do not need to set
sampling dates and we skip the \textbf{Tip Dates} panel. Next, we need
to set the substitution model in the \textbf{Site Model} tab.
\begin{framed}
Select the \textbf{Site Model} tab.
\end{framed}
The options available in this panel depend on whether the alignment data
are in nucleotides, amino-acids, binary data or general data. The
settings available after loading the alignment will contain the default
values we normally want to modify.
The panel on the left shows each partition. Remember that we did not
link the substitution models in the previous step for the different
partitions, so each partition evolves under a different substitution
model, i.e.~we assume that different positions in the alignment
accumulate substitutions differently. We will need to set the site model
separately for each part of the alignment as these models are unlinked.
However, we think that all partitions evolve according to the same
general model (albeit with different parameter values).
\begin{framed}
Make sure that \passthrough{\lstinline!noncoding!} is selected.
\begin{itemize}
\item
Check the \textbf{estimate} checkbox for \textbf{Substitution Rate}.
\item
Set the \textbf{Gamma Category Count} to 4.
\item
Check the \textbf{estimate} box for the \textbf{Shape} parameter (it
should already be checked).
\item
Select \textbf{HKY} in the \textbf{Subst Model} drop-down menu.
\item
Ensure that the \textbf{estimate} box for \textbf{Kappa} is checked
(it should already be checked).
\item
Select \textbf{Empirical} from the \textbf{Frequencies} drop-down
menu.
\end{itemize}
Note that when you checked \textbf{estimate} for the substitution rate
the greyed out \textbf{Fix mean substitution rate} box at the bottom of
the window was also automatically checked.
\end{framed}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/substitution.png}
\caption{Site model setup.}
\label{fig:subst}
\end{figure}
The panel should now look like Figure \ref{fig:subst}.
We are using an HKY substitution model with empirical frequencies. This
will fix the frequencies to the proportions observed in the partition.
This approach means that we can get a good fit to the data without
explicitly estimating these parameters. To model site-to-site rate
variation within each partition we use a discrete Gamma site model with
4 categories. Now we \emph{could} repeat the above steps for each of the
remaining partitions or we can take a shortcut.
\begin{framed}
Select the remaining three partitions (use \textbf{shift+click}). The
window will now look like Figure \ref{fig:clone}.
Select \passthrough{\lstinline!noncoding!} and click \textbf{OK} to to
clone the site model for the other three partitions from
\passthrough{\lstinline!noncoding!}.
\end{framed}
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/clone.png}
\caption{Shortcut to clone site models between partitions.}
\label{fig:clone}
\end{figure}
\begin{framed}
\textbf{Topic for discussion:} Can you figure out the reason why
\textbf{Fix mean substitution rate} was automatically checked when you
checked \textbf{estimate} for the substitution rate? Don't worry if you
can't figure it out, it is explained in detail in later tutorials.
\end{framed}
\hypertarget{setting-the-clock-model}{%
\subsubsection{Setting the clock model}\label{setting-the-clock-model}}
Next, select the \textbf{Clock Model} tab at the top of the main window.
This is where we set up the molecular clock model. For this exercise we
are going to leave the selection at the default value of a strict
molecular clock, because this data is very clock-like and does not need
rate variation among branches to be included in the model.
\begin{framed}
Click on the \textbf{Clock Models} tab and view the setup \emph{(but
don't do anything)}.
\end{framed}
Note that the \textbf{estimate} box next to \textbf{Mean clock rate} is
not checked.
\hypertarget{setting-priors}{%
\subsubsection{Setting priors}\label{setting-priors}}
The \textbf{Priors} tab allows prior distributions to be specified for
each model parameter. The model selections made in the \textbf{Site
Model} and \textbf{Clock Model} tabs determine which parameters are
included in the model. For each of these parameters a prior distribution
needs to be specified. It is also possible to specify hyperpriors (and
hyper-hyperpriors etc.) for each of the model parameters. We also need
to specify a prior for the \textbf{Tree}. In this example the tree prior
is a null model for species diversity over time within the primates.
Here we specify that we wish to use the Calibrated Yule model as the
tree prior. This is a simple model of speciation that is generally
appropriate when considering sequences from different species.
\begin{framed}
Go to the \textbf{Priors} tab and select \textbf{Calibrated Yule Model}
in the drop-down menu next to \textbf{Tree.t:tree}.
\end{framed}
The \textbf{birthRate} parameter measures the rate of speciation in the
calibrated Yule model. We will set an unspecific Gamma prior for this
parameter.
\begin{framed}
For \textbf{birthRateY.t:tree} select \textbf{Gamma} from the drop-down
menu
\begin{itemize}
\item
Expand the options for \textbf{birthRateY.t:tree} using the arrow
button on the left.
\item
Set the \textbf{Alpha} (shape) parameter to \textbf{0.001} and the
\textbf{Beta} (scale) parameter to \textbf{1000}.
\end{itemize}
\end{framed}
Note that BEAUti displays a plot of the prior distribution on the right,
as well as a few of its quantiles. This is for easy reference and can
help us to decide if a prior is appropriate. The unspecific Gamma prior
we are using is defined on \passthrough{$ (0,\infty) $} and
has a very broad range of values included between its 2.5\% and 97.5\%
quantiles.
If we wanted to add a hyperprior on one of the parameters of the Gamma
prior we would check the \textbf{estimate} box on the right of the
parameter. We could also change the initial values or limits of the
model parameters by clicking on the boxes next to the drop-down menus.
Do \textbf{not} do this here, as we are \textbf{not} adding any
hyperpriors or changing limits in this analysis!
\textbf{We will leave the rest of the priors on their default values!}
The BEAUti panel should look as shown in Figure \ref{fig:priors}.
Please note that in general using default priors is frowned upon as
priors are meant to convey your prior knowledge of the parameters. It is
important to know what information the priors add to the MCMC analysis
and whether this fits your particular situation. In our case the default
priors are suitable for this particular analysis, however for further,
more complex analyses, we will require a clear idea of what the priors
mean. Getting this understanding is difficult and comes with experience.
The topic of choosing priors is discussed in more detail in later
tutorials.
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/priors.png}
\caption{Prior setup.}
\label{fig:priors}
\end{figure}
\hypertarget{adding-a-calibration-node}{%
\subsubsection{Adding a calibration
node}\label{adding-a-calibration-node}}
Since all of the samples come from a single time point, there is no
information on the actual height of the phylogenetic tree in time units.
That means the tree height (tMRCA) and substitution rate parameters will
not be distinguishable and BEAST2 will only be able to estimate their
product. To allow BEAST2 to separate these two parameters we need to
input additional information that will help calibrate the tree in time.
In a Bayesian analysis, additional information from external sources
should be encoded in the form of a prior distribution. Thus, we will
have to add a new prior to the model.
\begin{framed}
To add an extra prior to the model, press the \textbf{+ Add Prior}
button below the list of priors. If this doesn't automatically open a
\textbf{Taxon set editor} window, select \textbf{MRCA Prior} from the
drop-down menu.
You will see a dialogue box (\textbf{Taxon set editor}) that allows you
to select a subset of taxa from the phylogenetic tree. Once you have
created a taxon set you will be able to add calibration information for
its most recent common ancestor (MRCA) later on.
\begin{itemize}
\item
Set the \textbf{Taxon set label} to
\passthrough{\lstinline!human-chimp!}.
\item
Locate \passthrough{\lstinline!Homo\_sapiens!} in the left hand side
list and click the \textbf{\textgreater\textgreater{}} button to add
it to the \passthrough{\lstinline!human-chimp!} taxon set.
\item
Locate \passthrough{\lstinline!Pan!} in the left hand side list and
click the \textbf{\textgreater\textgreater{}} button to add it to the
\passthrough{\lstinline!human-chimp!} taxon set.
\end{itemize}
\end{framed}
The taxon set should now look like Figure \ref{fig:taxa}.
\begin{framed}
Click the \textbf{OK} button to add the newly defined taxon set to the
prior list.
\end{framed}
\begin{figure}
\centering
\includegraphics[width=0.600000\textwidth]{figures/taxa.png}
\caption{Calibration node taxon set.}
\label{fig:taxa}
\end{figure}
The new node we have added is a calibrated node on the human-chimpanzee
split to be used in conjunction with the Calibrated Yule prior. In order
for that to work we need to enforce monophyly. This will constrain the
tree topology so that the human-chimp grouping is kept monophyletic
during the course of the MCMC analysis.
\begin{framed}
Check the \textbf{monophyletic} checkbox next to the
\textbf{human-chimp.prior}.
\end{framed}
We now need to specify a prior distribution on the calibration node
based on our prior knowledge from fossils in order to calibrate our
tree. We will use a Normal distribution with mean 6 MYA and a standard
deviation of 0.5 million years. This will give a central 95\% range of
about 5-7 million years, which roughly corresponds to the current
consensus estimate of the date of the most recent common ancestor of
humans and chimpanzees.
\begin{framed}
Select \textbf{Normal} from the drop-down menu to the right of the newly
added \textbf{human-chimp.prior}.
\begin{itemize}
\item
Expand the distribution options using the arrow button on the left.
\item
Set the \textbf{Mean} of the distribution to \textbf{6}.
\item
Set the \textbf{Sigma} of the distribution to \textbf{0.5}.
\end{itemize}
\end{framed}
The final setup of the calibration node prior should look as shown in
Figure \ref{fig:calibration}.
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/calibration.png}
\caption{Calibration node prior setup.}
\label{fig:calibration}
\end{figure}
\begin{framed}
\textbf{Topic for discussion:} After setting the calibration node prior
a new prior for \passthrough{\lstinline!clockrate.c:clock!} magically
appeared in the priors tab! If you go back to the \textbf{Clock model}
tab you'll see that the \textbf{estimate} box next to \textbf{Mean clock
rate} is now checked and greyed out (cannot be unchecked).
What just happened?
\end{framed}
\hypertarget{setting-the-mcmc-options}{%
\subsubsection{Setting the MCMC
options}\label{setting-the-mcmc-options}}
Finally, the \textbf{MCMC} tab allows us to control the length of the
MCMC chain and the frequency of stored samples. It also allows one to
change the output file names.
\begin{framed}
Go to the \textbf{MCMC} tab.
\end{framed}
The \textbf{Chain Length} parameter specifies the number of steps the
MCMC chain will make before finishing (i.e.~the number of accepted
proposals). This number depends on the size of the dataset, the
complexity of the model and the precision of the answer required. The
default value of 10'000'000 is arbitrary and should be adjusted
accordingly. For this small dataset we initially set the chain length to
1'000'000 such that this analysis will take only a few minutes on most
modern computers (rather than hours). We leave the \textbf{Store Every}
and \textbf{Pre Burnin} fields at their default values.
\begin{framed}
Set the \textbf{Chain Length} to 1'000'000.
\end{framed}
Below these general settings you will find the logging settings. Each
particular option can be viewed in detail by clicking the arrow to the
left of it. You can control the names of the log files and how often
values will be stored in each of the files.
Start by expanding the \textbf{tracelog} options. This is the log file
you will use later to analyse and summarise the results of the run. The
\textbf{Log Every} parameter for the log file should be set relative to
the total length of the chain. Sampling too often will result in very
large files with little extra benefit in terms of the accuracy of the
analysis. Sampling too sparsely will mean that the log file will not
record sufficient information about the distributions of the parameters.
We normally want to aim to store no more than 10'000 samples so this
should be set to no less than chain length/10'000. For this analysis we
will make BEAST2 write to the log file every 200 samples.
\begin{framed}
Expand the \textbf{tracelog} options.
\begin{itemize}
\item
Set the \textbf{Log Every} parameter to \textbf{200}.
\item
Leave the filename as is (\passthrough{\lstinline!$(filebase).log!}).
\end{itemize}
\end{framed}
Next, expand the \textbf{screenlog} options. The screen output is simply
for monitoring the program's progress. Since it is not so important,
especially if you run your analysis on a remote computer or a computer
cluster, the \textbf{Log Every} can be set to any value. However, if it
is set too small, the sheer quantity of information being displayed on
the screen will actually slow the program down. For this analysis we
will make BEAST2 log to screen every 1'000 samples, which is the default
setting.
\begin{framed}
Expand the \textbf{screenlog} options.
\begin{itemize}
\item
Leave the \textbf{Log Every} parameter at the default value of 1'000.
\end{itemize}
\end{framed}
Finally, we can also change the tree logging frequency by expanding
\textbf{treelog.t:tree}. For big trees with many taxa each individual
tree will already be quite large, thus if you log lots of trees the tree
files can easily become extremely large. You will be amazed at how
quickly BEAST can fill up even the biggest of drives if the tree logging
frequency is too high! For this reason it is often a good idea to set
the tree logging frequency lower than the trace log (especially for
analyses with many taxa). However, be careful, as the post-processing
steps of some models (such as the Bayesian skyline plot) require the
trace and tree logging frequencies to be identical!
\begin{framed}
Expand the \textbf{treelog.t:tree} options.
\begin{itemize}
\item
Set the \textbf{File Name} to
\passthrough{\lstinline!primate-mtDNA.trees!}.
\item
Leave the \textbf{Log Every} parameter at the default value of 1'000.
\end{itemize}
\end{framed}
The final setup should look as in Figure \ref{fig:logs}.
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/logs.png}
\caption{Logging options.}
\label{fig:logs}
\end{figure}
\hypertarget{generating-the-xml-file}{%
\subsubsection{Generating the XML file}\label{generating-the-xml-file}}
We are now ready to create the BEAST2 XML file. This is the final
configuration file BEAST2 can use to execute the analysis.
\begin{framed}
Save the XML file under the name
\passthrough{\lstinline!primates-mtDNA.xml!} using \textbf{File
\textgreater{} Save}.
\end{framed}
\clearpage
\hypertarget{running-the-analysis}{%
\subsection{Running the analysis}\label{running-the-analysis}}
Now run BEAST2 and provide your newly created XML file as input. You can
also change the \textbf{random number seed} for the run. This number is
the starting point of a pseudo-random number chain BEAST2 will use to
generate the samples. As computers are unable to generate truly random
numbers, we have to resort to generating deterministic sequences of
numbers that only look random, but will be identical when the starting
seed is the same. If your MCMC run converges to the true posterior then
you will be able to draw the same conclusions regardless of which random
seed is provided. However, if you want to exactly reproduce the results
of a run you need to start it with the same random number seed.
\begin{framed}
Run the \textbf{BEAST2} program.
\begin{itemize}
\item
Select \passthrough{\lstinline!primates-mtDNA.xml!} as the
\textbf{Beast XML File}.
\item
Set the \textbf{Random number seed} to \textbf{777} (or pick your
favourite number).
\item
Check the \textbf{Use BEAGLE library if available} checkbox. If you
have previously installed BEAGLE this will make the analysis run
faster.
\end{itemize}
\end{framed}
\begin{figure}
\centering
\includegraphics[width=0.800000\textwidth]{figures/beast.png}
\caption{BEAST2 setup for the analysis.}
\label{fig:beast}
\end{figure}
The BEAST2 window should look similar to Figure \ref{fig:beast}.
\begin{framed}
Run \textbf{BEAST2} by clicking the \passthrough{\lstinline!Run!}
button.
\end{framed}
BEAST2 will run until the specified number of steps in the chain is
reached. While it is running, it will print the screenlog values to a
console and store the tracelog and tree log values to files located in
the same folder as the configuration XML file. The screen output will
look approximately as shown in Figure \ref{fig:beast_out}.
\begin{figure}
\centering
\includegraphics[width=0.800000\textwidth]{figures/beast_out.png}
\caption{BEAST2 screen output for the analysis.}
\label{fig:beast_out}
\end{figure}
The window will remain open when BEAST2 finished running the analysis.
When you try to close it, you may see BEAST2 asking the question: ``Do
you wish to save?''. Note that your log and trees files are always
saved, no matter what answer you choose for this question. Thus, the
question is only restricted to saving the BEAST2 screen output (which
contains some information about the hardware configuration, initial
values, operator acceptance rates and running time that are not stored
in the other output files).
\begin{framed}
\textbf{Topic for discussion:} While the analysis is running see if you
can identify which parts of the setup in BEAUti are concerned with the
data, the model and the MCMC algorithm.
Open the XML file in your favourite text editor. Can you recognize any
of the values you set in BEAUti? Can you identify the data, model
specification and MCMC settings in the XML file?
Can you find the likelihood, priors and hyperpriors in the XML file?
\end{framed}
\clearpage
\hypertarget{analysing-the-results}{%
\subsection{Analysing the results}\label{analysing-the-results}}
Once BEAST2 has finished running, open Tracer to get an overview of
BEAST2 output. When the main window has opened, choose
\passthrough{\lstinline!File > Import Trace File...!} and select the
file called \passthrough{\lstinline!primate-mtDNA.log!} that BEAST2 has
created, or simply drag the file from the file manager window into
Tracer.
\begin{framed}
Open \textbf{Tracer}. Drag and drop the
\passthrough{\lstinline!primate-mtDNA.log!} file that BEAST2 created
into the open Tracer window.
Alternatively, use \textbf{File \textgreater{} Import Trace
File\ldots{}} (or press the \textbf{+} button below the \textbf{Trace
Files} panel) then locate and click on
\passthrough{\lstinline!primate-mtDNA.log!}.
\end{framed}
The Tracer window should look as shown in Figure \ref{fig:tracer_bad}.
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/tracer_bad.png}
\caption{Tracer showing a summary of the BEAST2 run of primate data with MCMC chain length of 1'000'000.}
\label{fig:tracer_bad}
\end{figure}
Tracer provides a few useful summary statistics on the results of the
analysis. On the left side in the top window it provides a list of log
files loaded into the program at the moment. The window below shows the
list of statistics logged in each file. For each statistic it gives a
list of summary values such as the mean, standard error, median, and
others it can compute from the data. The summary values are displayed in
the top right window and a histogrom showing the distribution of the
statistic is in the bottom right window.
The log file contains traces for the posterior (this is the natural
logarithm of the product of the tree likelihood and the prior density),
prior, the likelihood, tree likelihoods and other continuous parameters.
Selecting a trace on the left brings up the summary statistics for this
trace on the right hand side. When first opened, the \textbf{posterior}
trace is selected and various statistics of this trace are shown under
the \textbf{Estimates} tab.
For each loaded log file we can specify a \textbf{Burn-In}, which is
shown in the file list table (top left) in Tracer. The burn-in is
intended to give the Markov Chain time to reach its equilibrium
distribution, particularly if it has started from a bad starting point.
A bad starting point may lead to over-sampling regions of the posterior
that actually have very low probability, before the chain settles into
the equilibrium distribution. Burn-in allows us to simply discard the
first \emph{N} samples of a chain and not use them to compute the
summary statistics. Determining the number of samples to discard is not
a trivial problem and depends on the size of the dataset, the complexity
of the model and the length of the chain. A good rule of thumb is to
always throw out at least the first 10\% of the whole chain length as
the burn-in (however, in some cases it may be necessary to discard as
much as 50\% of the MCMC chain).
Select the \textbf{TreeHeight} statistic in the left hand list to look
at the tree height estimated jointly for all partitions in the
alignment. Tracer plots the (marginal posterior) histogram for the
selected statistic and also give you summary statistics such as the mean
and median. The 95\% HPD stands for \emph{highest posterior density
interval} and represents the most compact interval on the selected
statistic that contains 95\% of the posterior density. It can be loosely
thought of as a Bayesian analogue to a confidence interval. The
\textbf{TreeHeight} statistic gives the marginal posterior distribution
of the age of the root of the entire tree (that is, the tMRCA).
\begin{framed}
Select \textbf{TreeHeight} in the bottom left hand list in Tracer and
view the different summary statistics on the right.
\end{framed}
You can also compare estimates of different parameters in Tracer. Once a
trace file is loaded into the program you can, for example, compare
estimates of the different mutation rates corresponding to the different
partitions in the alignment.
\begin{framed}
Select all four mutation rates by clicking the first mutation rate
(\textbf{mutationRate.noncoding}), then holding \textbf{Shift} and
clicking the last mutation rate (\textbf{mutationRate.3rdpos}).
Select the \textbf{Marginal Density} tab on the right to view the four
distributions together.
Select different options in the \textbf{Display} drop-down menu to
display the posterior distributions in different ways.
\end{framed}
You will be able to see all four distributions in one plot, similar to
what is shown in Figure \ref{fig:tracer_comparison}.
\begin{figure}
\centering
\includegraphics[max width=\textwidth, max height=0.9\textheight]{figures/Tracer_comparison_KDE.png}