docs/articles/tutorial.html

<!DOCTYPE html>
<!-- Generated by pkgdown: do not edit by hand --><html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>admixr - tutorial • admixr</title>
<!-- jquery --><script src="https://code.jquery.com/jquery-3.1.0.min.js" integrity="sha384-nrOSfDHtoPMzJHjVTdCopGqIqeYETSXhZDFyniQ8ZHcVy08QesyHcnOUpMpqnmWq" crossorigin="anonymous"></script><!-- Bootstrap --><link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script><!-- Font Awesome icons --><link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.3/css/font-awesome.min.css" rel="stylesheet" integrity="sha384-T8Gy5hrqNKT+hzMclPo118YTQO6cYprQmhrYwIiQ/3axmI1hQomh7Ud2hPOy8SP1" crossorigin="anonymous">
<!-- clipboard.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/1.7.1/clipboard.min.js" integrity="sha384-cV+rhyOuRHc9Ub/91rihWcGmMmCXDeksTtCihMupQHSsi8GIIRDG0ThDc3HGQFJ3" crossorigin="anonymous"></script><!-- sticky kit --><script src="https://cdnjs.cloudflare.com/ajax/libs/sticky-kit/1.1.3/sticky-kit.min.js" integrity="sha256-c4Rlo1ZozqTPE2RLuvbusY3+SU1pQaJC0TjuhygMipw=" crossorigin="anonymous"></script><!-- pkgdown --><link href="../pkgdown.css" rel="stylesheet">
<script src="../pkgdown.js"></script><meta property="og:title" content="admixr - tutorial">
<meta property="og:description" content="">
<meta name="twitter:card" content="summary">
<!-- mathjax --><script src="https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script><!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
    <div class="container template-article">
      <header><div class="navbar navbar-default navbar-fixed-top" role="navigation">
  <div class="container">
    <div class="navbar-header">
      <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar">
        <span class="icon-bar"></span>
        <span class="icon-bar"></span>
        <span class="icon-bar"></span>
      </button>
      <span class="navbar-brand">
        <a class="navbar-link" href="../index.html">admixr</a>
        <span class="label label-default" data-toggle="tooltip" data-placement="bottom" title="Released package">0.7.1</span>
      </span>
    </div>

    <div id="navbar" class="navbar-collapse collapse">
      <ul class="nav navbar-nav">
<li>
  <a href="../index.html">
    <span class="fa fa-home fa-lg"></span>
     
  </a>
</li>
<li>
  <a href="../reference/index.html">Reference</a>
</li>
<li class="dropdown">
  <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false">
    Articles
     
    <span class="caret"></span>
  </a>
  <ul class="dropdown-menu" role="menu">
<li>
      <a href="../articles/tutorial.html">admixr - tutorial</a>
    </li>
  </ul>
</li>
<li>
  <a href="../news/index.html">Changelog</a>
</li>
      </ul>
<ul class="nav navbar-nav navbar-right">
<li>
  <a href="https://github.com/bodkan/admixr">
    <span class="fa fa-github fa-lg"></span>
     
  </a>
</li>
      </ul>
</div>
<!--/.nav-collapse -->
  </div>
<!--/.container -->
</div>
<!--/.navbar -->

      
      </header><div class="row">
  <div class="col-md-9 contents">
    <div class="page-header toc-ignore">
      <h1>admixr - tutorial</h1>
                        <h4 class="author">Martin Petr</h4>
            
            <h4 class="date">2018-12-02</h4>
      
      <small class="dont-index">Source: <a href="https://github.com/bodkan/admixr/blob/master/vignettes/tutorial.Rmd"><code>vignettes/tutorial.Rmd</code></a></small>
      <div class="hidden name"><code>tutorial.Rmd</code></div>

    </div>

    
<p>This vignette describes how to perform various population admixture analyses with the <em>admixr</em> package, using the ADMIXTOOLS software suite for the underlying calculations.</p>
<div id="introduction" class="section level2">
<h2 class="hasAnchor">
<a href="#introduction" class="anchor"></a>Introduction</h2>
<p><a href="https://github.com/DReichLab/AdmixTools/">ADMIXTOOLS</a> is a widely used software package for calculating admixture statistics and testing population admixture hypotheses. However, although powerful and comprehensive, it is not exactly known for being user-friendly.</p>
<p>A typical ADMIXTOOLS workflow often involves a combination of <code>sed</code>/<code>awk</code>/shell scripting and manual editing to create different configuration files. These are then passed as command-line arguments to one of ADMIXTOOLS commands, and control how to run a particular analysis. The results are then redirected to another file, which has to be parsed by the user to extract values of interest, often using command-line utilities again or (worse) by manual copy-pasting. Finally, the processed results are analysed in R, Excel or another program.</p>
<p>This workflow is very cumbersome, especially if one wants to explore many hypotheses involving different combinations of populations. Most importantly, however, it makes it difficult to follow the rules of best practice for reproducible science, as it is nearly impossible to construct fully automated reproducible “pipelines”.</p>
<p>This R package makes it possible to perform all stages of ADMIXTOOLS analyses entirely from R, completely removing the need for “low level” configuration of individual ADMIXTOOLS programs and allowing users to focus on the analysis itself.</p>
</div>
<div id="installation" class="section level2">
<h2 class="hasAnchor">
<a href="#installation" class="anchor"></a>Installation</h2>
<p><strong>Note that in order to use the <em>admixr</em> package, you need a working installation of ADMIXTOOLS!</strong> You can find installation instructions <a href="https://github.com/DReichLab/AdmixTools/blob/master/README.INSTALL">here</a>.</p>
<p><strong>Furthermore, you need to make sure that R can find ADMIXTOOLS binaries on the <code>$PATH</code>.</strong> If this is not the case, running <code>library(admixr)</code> will show a warning message with instructions on how to fix this.</p>
<p>To install <em>admixr</em> from GitHub you need to install the package <code>devtools</code> first. To do this, you can simply run (in R):</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb1-1" data-line-number="1"><span class="kw">install.packages</span>(<span class="st">"devtools"</span>)</a>
<a class="sourceLine" id="cb1-2" data-line-number="2">devtools<span class="op">::</span><span class="kw"><a href="http://www.rdocumentation.org/packages/devtools/topics/install_github">install_github</a></span>(<span class="st">"bodkan/admixr"</span>)</a></code></pre></div>
<p>Furthermore, if you want to follow the examples in this vignette, you will need the <a href="https://www.tidyverse.org">tidyverse</a> collection of packages for convenient manipulation and plotting of data, which you can install with:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb2-1" data-line-number="1"><span class="kw">install.packages</span>(<span class="st">"tidyverse"</span>)</a></code></pre></div>
<p>When everything is ready, you can run the following code to make functions in both packages available:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb3-1" data-line-number="1"><span class="kw">library</span>(admixr)</a>
<a class="sourceLine" id="cb3-2" data-line-number="2"><span class="kw">library</span>(tidyverse)</a>
<a class="sourceLine" id="cb3-3" data-line-number="3">── Attaching packages ────────────────────────────────────────── tidyverse <span class="dv">1</span>.<span class="fl">2.1</span> ──</a>
<a class="sourceLine" id="cb3-4" data-line-number="4">✔ ggplot2 <span class="dv">3</span>.<span class="fl">0.0</span>     ✔ purrr   <span class="dv">0</span>.<span class="fl">2.5</span></a>
<a class="sourceLine" id="cb3-5" data-line-number="5">✔ tibble  <span class="dv">1</span>.<span class="fl">4.2</span>     ✔ dplyr   <span class="dv">0</span>.<span class="fl">7.8</span></a>
<a class="sourceLine" id="cb3-6" data-line-number="6">✔ tidyr   <span class="dv">0</span>.<span class="fl">8.2</span>     ✔ stringr <span class="dv">1</span>.<span class="fl">3.1</span></a>
<a class="sourceLine" id="cb3-7" data-line-number="7">✔ readr   <span class="dv">1</span>.<span class="fl">2.1</span>     ✔ forcats <span class="dv">0</span>.<span class="fl">3.0</span></a>
<a class="sourceLine" id="cb3-8" data-line-number="8">── Conflicts ───────────────────────────────────────────── <span class="kw">tidyverse_conflicts</span>() ──</a>
<a class="sourceLine" id="cb3-9" data-line-number="9">✖ dplyr<span class="op">::</span><span class="kw"><a href="http://dplyr.tidyverse.org/reference/filter.html">filter</a></span>() masks stats<span class="op">::</span><span class="kw"><a href="http://www.rdocumentation.org/packages/stats/topics/filter">filter</a></span>()</a>
<a class="sourceLine" id="cb3-10" data-line-number="10">✖ dplyr<span class="op">::</span><span class="kw"><a href="http://dplyr.tidyverse.org/reference/lead-lag.html">lag</a></span>()    masks stats<span class="op">::</span><span class="kw"><a href="http://www.rdocumentation.org/packages/stats/topics/lag">lag</a></span>()</a></code></pre></div>
</div>
<div id="a-note-about-eigenstrat-format" class="section level2">
<h2 class="hasAnchor">
<a href="#a-note-about-eigenstrat-format" class="anchor"></a>A note about EIGENSTRAT format</h2>
<p>ADMIXTOOLS software uses a peculiar set of genetic file formats, which may seem strange if you are used to working with <a href="http://samtools.github.io/hts-specs/VCFv4.3.pdf">VCF files</a>. However, the basic idea remains the same - we want to store and access SNP data (REF/ALT alleles) of a set of individuals at a defined set of genomic positions.</p>
<p>EIGENSTRAT datasets always contain three kinds of files:</p>
<ul>
<li>
<code>ind</code> file - specifies a unique name, sex (optional - can be simply “U” for “undefined”) and label (such as population assignment) of each sample;</li>
<li>
<code>snp</code> file - specifies the positions of SNPs, REF/ALT alleles etc.;</li>
<li>
<code>geno</code> file - contains SNP data (one row per site, one character per sample) in a dense string-based format:
<ul>
<li>0: individual is homozygous ALT</li>
<li>1: individual is a heterozygote</li>
<li>2: individual is homozygous REF</li>
<li>9: missing data</li>
</ul>
</li>
</ul>
<p>Therefore, a VCF file is essentially a combination of all three files in a single package. Luckily for us, all three EIGENSTRAT files usually share a common path and prefix and should be placed in a single directory.</p>
<p>Let’s first download a small testing SNP dataset using a built-in <em>admixr</em> function <code><a href="../reference/download_data.html">download_data()</a></code>. This function downloads the data into a temporary directory (you can specify the destination using its <code>dirname</code> argument). In addition to this, the function returns a shared prefix of the whole dataset.</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb4-1" data-line-number="1">(prefix &lt;-<span class="st"> </span><span class="kw"><a href="../reference/download_data.html">download_data</a></span>())</a>
<a class="sourceLine" id="cb4-2" data-line-number="2">[<span class="dv">1</span>] <span class="st">"/var/folders/kk/s4cwdkx90pscz314mp0hhz480000gn/T//Rtmpiusthx/snps/snps"</span></a></code></pre></div>
<p>We can verify that there are indeed three files with this prefix:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb5-1" data-line-number="1"><span class="kw">list.files</span>(<span class="dt">path =</span> <span class="kw">dirname</span>(prefix), <span class="dt">pattern =</span> <span class="kw">basename</span>(prefix), <span class="dt">full.names =</span> <span class="ot">TRUE</span>)</a>
<a class="sourceLine" id="cb5-2" data-line-number="2">[<span class="dv">1</span>] <span class="st">"/var/folders/kk/s4cwdkx90pscz314mp0hhz480000gn/T//Rtmpiusthx/snps/snps.geno"</span></a>
<a class="sourceLine" id="cb5-3" data-line-number="3">[<span class="dv">2</span>] <span class="st">"/var/folders/kk/s4cwdkx90pscz314mp0hhz480000gn/T//Rtmpiusthx/snps/snps.ind"</span> </a>
<a class="sourceLine" id="cb5-4" data-line-number="4">[<span class="dv">3</span>] <span class="st">"/var/folders/kk/s4cwdkx90pscz314mp0hhz480000gn/T//Rtmpiusthx/snps/snps.snp"</span> </a></code></pre></div>
<p>Let’s look at their contents:</p>
<div id="ind-file" class="section level4">
<h4 class="hasAnchor">
<a href="#ind-file" class="anchor"></a><code>ind</code> file</h4>
<pre><code>Chimp        U  Chimp
Mbuti        U  Mbuti
Yoruba       U  Yoruba
Khomani_San  U  Khomani_San
Han          U  Han
Dinka        U  Dinka
Sardinian    U  Sardinian
Papuan       U  Papuan
French       U  French
Vindija      U  Vindija
Altai        U  Altai
Denisova     U  Denisova</code></pre>
<p>The first column (sample name) and the third column (population label) are generally not the same (sample names often have numerical suffixes to make them unique, etc.), but we kept them the same for simplicity. Importantly, when specifying population/sample arguments in <em>admixr</em> functions, the information in the third column is what is used. For example, if you have individuals such as “French1”, “French2”, “French3” in the first column of an <code>ind</code> file, all three sharing a “French” population label in the third column, specifying “French” in an <em>admixr</em> function will combine all three samples in a single population and work with it as a whole, instead of working with each individual separately.</p>
</div>
<div id="snp-file-first-3-lines" class="section level4">
<h4 class="hasAnchor">
<a href="#snp-file-first-3-lines" class="anchor"></a><code>snp</code> file (first 3 lines)</h4>
<pre><code>1_832756    1   0.008328    832756  T   G
1_838931    1   0.008389    838931  A   C
1_843249    1   0.008432    843249  A   T</code></pre>
<p>The columns of this file are, in order:</p>
<ol style="list-style-type: decimal">
<li>SNP string ID</li>
<li>chromosome</li>
<li>genetic distance</li>
<li>position along a chromosome</li>
<li>reference allele</li>
<li>alternative allele</li>
</ol>
</div>
<div id="geno-file-first-3-lines" class="section level4">
<h4 class="hasAnchor">
<a href="#geno-file-first-3-lines" class="anchor"></a><code>geno</code> file (first 3 lines)</h4>
<pre><code>902021012000
922221211222
922222122222</code></pre>
</div>
</div>
<div id="philosophy-of-admixr" class="section level2">
<h2 class="hasAnchor">
<a href="#philosophy-of-admixr" class="anchor"></a>Philosophy of <em>admixr</em>
</h2>
<p>The goal of <em>admixr</em> is to make ADMIXTOOLS analyses as trivial to run as possible, without having to worry about par/pop/left/right configuration files (as they are known in the jargon of ADMIXTOOLS) and other low-level details.</p>
<p>The only interface between you and ADMIXTOOLS is the following set of R functions:</p>
<ul>
<li><code><a href="../reference/f4ratio.html">d()</a></code></li>
<li><code><a href="../reference/f4ratio.html">f4()</a></code></li>
<li><code><a href="../reference/f4ratio.html">f4ratio()</a></code></li>
<li><code><a href="../reference/f4ratio.html">f3()</a></code></li>
<li><code><a href="../reference/qpAdm.html">qpAdm()</a></code></li>
<li><code><a href="../reference/qpWave.html">qpWave()</a></code></li>
</ul>
<p>Anything that would normally require <a href="https://gaworkshop.readthedocs.io/en/latest/contents/06_f3/f3.html">dozens of lines of shell scripts</a> can be often accomplished by running a single line of R code.</p>
</div>
<div id="internal-representation-of-eigenstrat-data" class="section level2">
<h2 class="hasAnchor">
<a href="#internal-representation-of-eigenstrat-data" class="anchor"></a>Internal representation of EIGENSTRAT data</h2>
<p>As we saw above, each EIGENSTRAT dataset has three components. The way this data is internally represented in <em>admixr</em> is using a small S3 R object created using the <code>eigenstrat</code> constructor function. This function accepts the path and prefix of a trio of EIGENSTRAT snp/ind/geno files and returns an R object of the <code>EIGENSTRAT</code> class:</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb9-1" data-line-number="1">snps &lt;-<span class="st"> </span><span class="kw"><a href="../reference/eigenstrat.html">eigenstrat</a></span>(prefix)</a></code></pre></div>
<div class="sourceCode" id="cb10"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb10-1" data-line-number="1">snps</a>
<a class="sourceLine" id="cb10-2" data-line-number="2"><span class="co">#&gt; EIGENSTRAT object</span></a>
<a class="sourceLine" id="cb10-3" data-line-number="3"><span class="co">#&gt; =================</span></a>
<a class="sourceLine" id="cb10-4" data-line-number="4"><span class="co">#&gt; components:</span></a>
<a class="sourceLine" id="cb10-5" data-line-number="5"><span class="co">#&gt;   ind file: /var/folders/kk/s4cwdkx90pscz314mp0hhz480000gn/T//Rtmpiusthx/snps/snps.ind</span></a>
<a class="sourceLine" id="cb10-6" data-line-number="6"><span class="co">#&gt;   snp file: /var/folders/kk/s4cwdkx90pscz314mp0hhz480000gn/T//Rtmpiusthx/snps/snps.snp</span></a>
<a class="sourceLine" id="cb10-7" data-line-number="7"><span class="co">#&gt;   geno file: /var/folders/kk/s4cwdkx90pscz314mp0hhz480000gn/T//Rtmpiusthx/snps/snps.geno</span></a></code></pre></div>
<p>This object simply encapsulates the paths to all three EIGENSTRAT components and makes it easy to pass the data to different <em>admixr</em> functions.</p>
<p>The following couple of sections describe how to use the <em>admixr</em> package on a set of example analyses.</p>
</div>
<div id="d-statistic" class="section level2">
<h2 class="hasAnchor">
<a href="#d-statistic" class="anchor"></a><span class="math inline">\(D\)</span> statistic</h2>
<p>Let’s say we are interested in the following question: <em>"Which populations today show evidence of Neanderthal admixture?</em></p>
<p>One way of looking at this is using the following D statistic: <span class="math display">\[D(\textrm{present-day human W}, \textrm{African}, \textrm{Neanderthal}, \textrm{Chimp}).\]</span></p>
<p><span class="math inline">\(D\)</span> statistics are based on comparing the proportions of BABA and ABBA sites patterns observed in data:</p>
<p><span class="math display">\[D = \frac{\textrm{# BABA sites - # ABBA sites}}{\textrm{# BABA sites + # ABBA sites}}.\]</span></p>
<p>Significant departure of <span class="math inline">\(D\)</span> from zero indicates an excess of allele sharing between the first and the third population (positive <span class="math inline">\(D\)</span>), or an excess of allele sharing between the second and the third population (negative <span class="math inline">\(D\)</span>). If we get <span class="math inline">\(D\)</span> that is not significantly different from 0, this suggests that the first and second populations form a clade, and don’t differ in their genetic affinity to the third population (this is the null hypothesis that the data is compared against).</p>
<p>Therefore, our <span class="math inline">\(D\)</span> statistic above simply tests whether some modern humans today admixed with Neanderthals, which would increase their genetic affinity to this archaic group compared to Africans (whose ancestors never met Neanderthals).</p>
<p>Let’s save some population names first to make the code below more readable:</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb11-1" data-line-number="1">pops &lt;-<span class="st"> </span><span class="kw">c</span>(<span class="st">"French"</span>, <span class="st">"Sardinian"</span>, <span class="st">"Han"</span>, <span class="st">"Papuan"</span>, <span class="st">"Khomani_San"</span>, <span class="st">"Mbuti"</span>, <span class="st">"Dinka"</span>)</a></code></pre></div>
<p>Using the <em>admixr</em> package we can then calculate the <span class="math inline">\(D\)</span> statistic above simply by running:</p>
<div class="sourceCode" id="cb12"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb12-1" data-line-number="1">result &lt;-<span class="st"> </span><span class="kw"><a href="../reference/f4ratio.html">d</a></span>(<span class="dt">W =</span> pops, <span class="dt">X =</span> <span class="st">"Yoruba"</span>, <span class="dt">Y =</span> <span class="st">"Vindija"</span>, <span class="dt">Z =</span> <span class="st">"Chimp"</span>, <span class="dt">data =</span> snps)</a></code></pre></div>
<p>The result is a following <code>data.frame</code>:</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb13-1" data-line-number="1"><span class="kw">head</span>(result)</a></code></pre></div>
<table class="table">
<thead><tr class="header">
<th align="left">W</th>
<th align="left">X</th>
<th align="left">Y</th>
<th align="left">Z</th>
<th align="right">D</th>
<th align="right">stderr</th>
<th align="right">Zscore</th>
<th align="right">BABA</th>
<th align="right">ABBA</th>
<th align="right">nsnps</th>
</tr></thead>
<tbody>
<tr class="odd">
<td align="left">French</td>
<td align="left">Yoruba</td>
<td align="left">Vindija</td>
<td align="left">Chimp</td>
<td align="right">0.0313</td>
<td align="right">0.006933</td>
<td align="right">4.510</td>
<td align="right">15802</td>
<td align="right">14844</td>
<td align="right">487753</td>
</tr>
<tr class="even">
<td align="left">Sardinian</td>
<td align="left">Yoruba</td>
<td align="left">Vindija</td>
<td align="left">Chimp</td>
<td align="right">0.0287</td>
<td align="right">0.006792</td>
<td align="right">4.222</td>
<td align="right">15729</td>
<td align="right">14852</td>
<td align="right">487646</td>
</tr>
<tr class="odd">
<td align="left">Han</td>
<td align="left">Yoruba</td>
<td align="left">Vindija</td>
<td align="left">Chimp</td>
<td align="right">0.0278</td>
<td align="right">0.006609</td>
<td align="right">4.199</td>
<td align="right">15780</td>
<td align="right">14928</td>
<td align="right">487925</td>
</tr>
<tr class="even">
<td align="left">Papuan</td>
<td align="left">Yoruba</td>
<td align="left">Vindija</td>
<td align="left">Chimp</td>
<td align="right">0.0457</td>
<td align="right">0.006571</td>
<td align="right">6.953</td>
<td align="right">16131</td>
<td align="right">14721</td>
<td align="right">487694</td>
</tr>
<tr class="odd">
<td align="left">Khomani_San</td>
<td align="left">Yoruba</td>
<td align="left">Vindija</td>
<td align="left">Chimp</td>
<td align="right">0.0066</td>
<td align="right">0.006292</td>
<td align="right">1.051</td>
<td align="right">16168</td>
<td align="right">15955</td>
<td align="right">487564</td>
</tr>
<tr class="even">
<td align="left">Mbuti</td>
<td align="left">Yoruba</td>
<td align="left">Vindija</td>
<td align="left">Chimp</td>
<td align="right">-0.0005</td>
<td align="right">0.006345</td>
<td align="right">-0.074</td>
<td align="right">15751</td>
<td align="right">15766</td>
<td align="right">487642</td>
</tr>
</tbody>
</table>
<p>We can see that in addition to the input information, this <code>data.frame</code> contains additional columns:</p>
<ul>
<li>
<code>D</code> - <span class="math inline">\(D\)</span> statistic value</li>
<li>
<code>stderr</code> - standard error of the <span class="math inline">\(D\)</span> statistic calculated using the block jackknife</li>
<li>
<code>Zscore</code> - <span class="math inline">\(Z\)</span>-zscore value (number of standard errors the <span class="math inline">\(D\)</span> is from 0, i.e. how strongly do we reject the null hypothesis of no admixture)</li>
<li>
<code>BABA</code>, <code>ABBA</code> - counts of observed site patterns</li>
<li>
<code>nsnps</code> - number of SNPs used for a give calculation</li>
</ul>
<p>(Output tables from other <em>admixr</em> functions follow a very similar format.)</p>
<p>While we could certainly make some inferences by looking at the <span class="math inline">\(Z\)</span>-scores, tables in general are not the best representation of this kind of data, especially as the number of samples increases. This is how we can use the <a href="https://ggplot2.tidyverse.org"><code>ggplot2</code></a> package to plot the results:</p>
<div class="sourceCode" id="cb14"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb14-1" data-line-number="1"><span class="kw">ggplot</span>(result, <span class="kw">aes</span>(<span class="kw">fct_reorder</span>(W, D), D, <span class="dt">color =</span> <span class="kw">abs</span>(Zscore) <span class="op">&gt;</span><span class="st"> </span><span class="dv">2</span>)) <span class="op">+</span></a>
<a class="sourceLine" id="cb14-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span></a>
<a class="sourceLine" id="cb14-3" data-line-number="3"><span class="st">  </span><span class="kw">geom_hline</span>(<span class="dt">yintercept =</span> <span class="dv">0</span>, <span class="dt">linetype =</span> <span class="dv">2</span>) <span class="op">+</span></a>
<a class="sourceLine" id="cb14-4" data-line-number="4"><span class="st">  </span><span class="kw">geom_errorbar</span>(<span class="kw">aes</span>(<span class="dt">ymin =</span> D <span class="op">-</span><span class="st"> </span><span class="dv">2</span> <span class="op">*</span><span class="st"> </span>stderr, <span class="dt">ymax =</span> D <span class="op">+</span><span class="st"> </span><span class="dv">2</span> <span class="op">*</span><span class="st"> </span>stderr))</a></code></pre></div>
<p><img src="tutorial_files/figure-html/d_plot-1.png" width="672"></p>
<p>(If you want to more know about data analysis using R, including plotting with ggplot2, I highly recommend <a href="http://r4ds.had.co.nz">this</a> free book.)</p>
<p>We can see that all three Africans have <span class="math inline">\(D\)</span> values not significantly different from 0, meaning that the data is consistent with the null hypothesis of no Neanderthal ancestry in Africans. On the other hand, the test rejects the null hypothesis for all non-Africans today, suggesting that Neanderthals admixed with the ancestors of present-day non-Africans. In fact, this is a similar test to the one that was used as evidence supporting the Neanderthal admixture hypothesis in the first place!</p>
</div>
<div id="f_4-statistic" class="section level2">
<h2 class="hasAnchor">
<a href="#f_4-statistic" class="anchor"></a><span class="math inline">\(f_4\)</span> statistic</h2>
<p>An alternative way of addressing the previous question is to use the <span class="math inline">\(f_4\)</span> statistic, which is very similar to <span class="math inline">\(D\)</span> statistic and can be calculated as:</p>
<p><span class="math display">\[ f_4 = \frac{\textrm{# BABA sites - # ABBA sites}}{\textrm{# sites}}\]</span></p>
<p>Again, significant departure of <span class="math inline">\(f_4\)</span> from 0 is informative about gene flow, in a way analogous to <span class="math inline">\(D\)</span> statistic.</p>
<p>To repeat the previous analysis using <span class="math inline">\(f_4\)</span> statistic, we can run:</p>
<div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb15-1" data-line-number="1">result &lt;-<span class="st"> </span><span class="kw"><a href="../reference/f4ratio.html">f4</a></span>(<span class="dt">W =</span> pops, <span class="dt">X =</span> <span class="st">"Yoruba"</span>, <span class="dt">Y =</span> <span class="st">"Vindija"</span>, <span class="dt">Z =</span> <span class="st">"Chimp"</span>, <span class="dt">data =</span> snps)</a></code></pre></div>
<div class="sourceCode" id="cb16"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb16-1" data-line-number="1"><span class="kw">head</span>(result)</a></code></pre></div>
<table class="table">
<thead><tr class="header">
<th align="left">W</th>
<th align="left">X</th>
<th align="left">Y</th>
<th align="left">Z</th>
<th align="right">f4</th>
<th align="right">stderr</th>
<th align="right">Zscore</th>
<th align="right">BABA</th>
<th align="right">ABBA</th>
<th align="right">nsnps</th>
</tr></thead>
<tbody>
<tr class="odd">
<td align="left">French</td>
<td align="left">Yoruba</td>
<td align="left">Vindija</td>
<td align="left">Chimp</td>
<td align="right">0.001965</td>
<td align="right">0.000437</td>
<td align="right">4.501</td>
<td align="right">15802</td>
<td align="right">14844</td>
<td align="right">487753</td>
</tr>
<tr class="even">
<td align="left">Sardinian</td>
<td align="left">Yoruba</td>
<td align="left">Vindija</td>
<td align="left">Chimp</td>
<td align="right">0.001798</td>
<td align="right">0.000427</td>
<td align="right">4.209</td>
<td align="right">15729</td>
<td align="right">14852</td>
<td align="right">487646</td>
</tr>
<tr class="odd">
<td align="left">Han</td>
<td align="left">Yoruba</td>
<td align="left">Vindija</td>
<td align="left">Chimp</td>
<td align="right">0.001746</td>
<td align="right">0.000418</td>
<td align="right">4.178</td>
<td align="right">15780</td>
<td align="right">14928</td>
<td align="right">487925</td>
</tr>
<tr class="even">
<td align="left">Papuan</td>
<td align="left">Yoruba</td>
<td align="left">Vindija</td>
<td align="left">Chimp</td>
<td align="right">0.002890</td>
<td align="right">0.000417</td>
<td align="right">6.924</td>
<td align="right">16131</td>
<td align="right">14721</td>
<td align="right">487694</td>
</tr>
<tr class="odd">
<td align="left">Khomani_San</td>
<td align="left">Yoruba</td>
<td align="left">Vindija</td>
<td align="left">Chimp</td>
<td align="right">0.000436</td>
<td align="right">0.000415</td>
<td align="right">1.051</td>
<td align="right">16168</td>
<td align="right">15955</td>
<td align="right">487564</td>
</tr>
<tr class="even">
<td align="left">Mbuti</td>
<td align="left">Yoruba</td>
<td align="left">Vindija</td>
<td align="left">Chimp</td>
<td align="right">-0.000030</td>
<td align="right">0.000410</td>
<td align="right">-0.074</td>
<td align="right">15751</td>
<td align="right">15766</td>
<td align="right">487642</td>
</tr>
</tbody>
</table>
<p>We can see by comparing this to the <span class="math inline">\(D\)</span> statistic result above that we can make the same conclusions.</p>
<p>You might be wondering why we have both <span class="math inline">\(f_4\)</span> and <span class="math inline">\(D\)</span> if they are so similar. The truth is that <span class="math inline">\(f_4\)</span> is, among other things, directly informative about the amount of shared genetic drift (“branch length”) between pairs of populations, which is, in many cases, a very useful theoretical property. Other than that, it’s often a matter of personal preference and so <em>admixr</em> provides separate functions for calculating both.</p>
</div>
<div id="f_4-ratio-statistic" class="section level2">
<h2 class="hasAnchor">
<a href="#f_4-ratio-statistic" class="anchor"></a><span class="math inline">\(f_4\)</span>-ratio statistic</h2>
<p>Now we know that non-Africans today carry <em>some</em> Neanderthal ancestry. But what if we want to know <em>how much</em> Neanderthal ancestry they have? What proportion of their genomes is of Neanderthal origin?</p>
<p>Unfortunately, we don’t have enough space here to explain all the details about the inner workings of <span class="math inline">\(f_4\)</span>-ratio statistic. However, in general, when we are interested in estimating the <em>proportion</em> of ancestry in a population <span class="math inline">\(X\)</span> coming some parental lineage <span class="math inline">\(B\)</span>, we can use a ratio of two <span class="math inline">\(f_4\)</span> statistics.</p>
<p><span class="math display">\[f_4\textrm{-ratio} = \frac{f_4(A, O; X, C)}{f_4(A, O; B, C)}.\]</span></p>
<p>Using <code>amidxr</code>, we can calculate <span class="math inline">\(f_4\)</span>-ratios using the following code (<code>X</code> being a vector of samples for which we want to estimate Neanderthal ancestry):</p>
<div class="sourceCode" id="cb17"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb17-1" data-line-number="1">result &lt;-<span class="st"> </span><span class="kw"><a href="../reference/f4ratio.html">f4ratio</a></span>(<span class="dt">X =</span> pops, <span class="dt">A =</span> <span class="st">"Altai"</span>, <span class="dt">B =</span> <span class="st">"Vindija"</span>, <span class="dt">C =</span> <span class="st">"Yoruba"</span>, <span class="dt">O =</span> <span class="st">"Chimp"</span>, <span class="dt">data =</span> snps)</a></code></pre></div>
<p>The ancestry proportion (a number between 0 and 1) is given in the <code>alpha</code> column:</p>
<div class="sourceCode" id="cb18"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb18-1" data-line-number="1"><span class="kw">head</span>(result)</a></code></pre></div>
<table class="table">
<thead><tr class="header">
<th align="left">A</th>
<th align="left">B</th>
<th align="left">X</th>
<th align="left">C</th>
<th align="left">O</th>
<th align="right">alpha</th>
<th align="right">stderr</th>
<th align="right">Zscore</th>
</tr></thead>
<tbody>
<tr class="odd">
<td align="left">Altai</td>
<td align="left">Vindija</td>
<td align="left">French</td>
<td align="left">Yoruba</td>
<td align="left">Chimp</td>
<td align="right">0.023774</td>
<td align="right">0.006173</td>
<td align="right">3.851</td>
</tr>
<tr class="even">
<td align="left">Altai</td>
<td align="left">Vindija</td>
<td align="left">Sardinian</td>
<td align="left">Yoruba</td>
<td align="left">Chimp</td>
<td align="right">0.024468</td>
<td align="right">0.006079</td>
<td align="right">4.025</td>
</tr>
<tr class="odd">
<td align="left">Altai</td>
<td align="left">Vindija</td>
<td align="left">Han</td>
<td align="left">Yoruba</td>
<td align="left">Chimp</td>
<td align="right">0.022117</td>
<td align="right">0.005901</td>
<td align="right">3.748</td>
</tr>
<tr class="even">
<td align="left">Altai</td>
<td align="left">Vindija</td>
<td align="left">Papuan</td>
<td align="left">Yoruba</td>
<td align="left">Chimp</td>
<td align="right">0.037311</td>
<td align="right">0.005821</td>
<td align="right">6.410</td>
</tr>
<tr class="odd">
<td align="left">Altai</td>
<td align="left">Vindija</td>
<td align="left">Khomani_San</td>
<td align="left">Yoruba</td>
<td align="left">Chimp</td>
<td align="right">0.003909</td>
<td align="right">0.005923</td>
<td align="right">0.660</td>
</tr>
<tr class="even">
<td align="left">Altai</td>
<td align="left">Vindija</td>
<td align="left">Mbuti</td>
<td align="left">Yoruba</td>
<td align="left">Chimp</td>
<td align="right">0.000319</td>
<td align="right">0.005721</td>
<td align="right">0.056</td>
</tr>
</tbody>
</table>
<div class="sourceCode" id="cb19"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb19-1" data-line-number="1"><span class="kw">ggplot</span>(result, <span class="kw">aes</span>(<span class="kw">fct_reorder</span>(X, alpha), alpha, <span class="dt">color =</span> <span class="kw">abs</span>(Zscore) <span class="op">&gt;</span><span class="st"> </span><span class="dv">2</span>)) <span class="op">+</span></a>
<a class="sourceLine" id="cb19-2" data-line-number="2"><span class="st">  </span><span class="kw">geom_point</span>() <span class="op">+</span></a>
<a class="sourceLine" id="cb19-3" data-line-number="3"><span class="st">  </span><span class="kw">geom_errorbar</span>(<span class="kw">aes</span>(<span class="dt">ymin =</span> alpha <span class="op">-</span><span class="st"> </span><span class="dv">2</span> <span class="op">*</span><span class="st"> </span>stderr, <span class="dt">ymax =</span> alpha <span class="op">+</span><span class="st"> </span><span class="dv">2</span> <span class="op">*</span><span class="st"> </span>stderr)) <span class="op">+</span></a>
<a class="sourceLine" id="cb19-4" data-line-number="4"><span class="st">  </span><span class="kw">geom_hline</span>(<span class="dt">yintercept =</span> <span class="dv">0</span>, <span class="dt">linetype =</span> <span class="dv">2</span>) <span class="op">+</span></a>
<a class="sourceLine" id="cb19-5" data-line-number="5"><span class="st">  </span><span class="kw">labs</span>(<span class="dt">y =</span> <span class="st">"Neandertal ancestry proportion"</span>, <span class="dt">x =</span> <span class="st">"present-day individual"</span>)</a></code></pre></div>
<p><img src="tutorial_files/figure-html/f4ratio_plot-1.png" width="672"></p>
<p>We can make several observations:</p>
<ul>
<li>Again, we don’t see any significant Neanderthal ancestry in present-day Africans (proportion is consistent with 0%), which is what we confirmed using <span class="math inline">\(D\)</span> and <span class="math inline">\(f_4\)</span> above.</li>
<li>Present-day non-Africans carry between 2-3% of Neanderthal ancestry.</li>
<li>We see a much higher proportion of Neanderthal ancestry in people from Papua New Guinea - more than 4%! This is consistent with earlier studies that suggest additional archaic admixture events in the ancestors of present-day Papuans.</li>
</ul>
</div>
<div id="f_3-statistic" class="section level2">
<h2 class="hasAnchor">
<a href="#f_3-statistic" class="anchor"></a><span class="math inline">\(f_3\)</span> statistic</h2>
<p>The <span class="math inline">\(f_3\)</span> statistic, also known as the 3-population statistic, is useful whenever we want to:</p>
<ol style="list-style-type: decimal">
<li>Estimate the branch length (shared genetic drift) between a pair of populations <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> with respect to a common outgroup <span class="math inline">\(C\)</span>. In this case, the higher the <span class="math inline">\(f_3\)</span> value, the longer the shared evolutionary time between <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span>.</li>
<li>Test whether population <span class="math inline">\(C\)</span> is a mixture of two parental populations <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span>. Negative value of the <span class="math inline">\(f_3\)</span> statistic then serves as statistical evidence of this admixture.</li>
</ol>
<p>As an example, imagine we are interested in relative divergence times between pairs of present-day human populations, and want to know in which approximate order they split of from each other. To address this problem, we could use <span class="math inline">\(f_3\)</span> statistic by fixing the <span class="math inline">\(C\)</span> outgroup as San, and calculating pairwise <span class="math inline">\(f_3\)</span> statistics between all pairs of present-day modern humans.</p>
<div class="sourceCode" id="cb20"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb20-1" data-line-number="1">pops &lt;-<span class="st"> </span><span class="kw">c</span>(<span class="st">"French"</span>, <span class="st">"Sardinian"</span>, <span class="st">"Han"</span>, <span class="st">"Papuan"</span>, <span class="st">"Mbuti"</span>, <span class="st">"Dinka"</span>, <span class="st">"Yoruba"</span>)</a>
<a class="sourceLine" id="cb20-2" data-line-number="2"></a>
<a class="sourceLine" id="cb20-3" data-line-number="3">result &lt;-<span class="st"> </span><span class="kw"><a href="../reference/f4ratio.html">f3</a></span>(<span class="dt">A =</span> pops, <span class="dt">B =</span> pops, <span class="dt">C =</span> <span class="st">"Khomani_San"</span>, <span class="dt">data =</span> snps)</a></code></pre></div>
<div class="sourceCode" id="cb21"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb21-1" data-line-number="1"><span class="kw">head</span>(result)</a></code></pre></div>
<table class="table">
<thead><tr class="header">
<th align="left">A</th>
<th align="left">B</th>
<th align="left">C</th>
<th align="right">f3</th>
<th align="right">stderr</th>
<th align="right">Zscore</th>
<th align="right">nsnps</th>
</tr></thead>
<tbody>
<tr class="odd">
<td align="left">French</td>
<td align="left">French</td>
<td align="left">Khomani_San</td>
<td align="right">0.000000</td>
<td align="right">-1.000000</td>
<td align="right">0.000</td>
<td align="right">-1</td>
</tr>
<tr class="even">
<td align="left">French</td>
<td align="left">Sardinian</td>
<td align="left">Khomani_San</td>
<td align="right">0.353447</td>
<td align="right">0.012527</td>
<td align="right">28.215</td>
<td align="right">249760</td>
</tr>
<tr class="odd">
<td align="left">French</td>
<td align="left">Han</td>
<td align="left">Khomani_San</td>
<td align="right">0.316964</td>
<td align="right">0.011914</td>
<td align="right">26.604</td>
<td align="right">253158</td>
</tr>
<tr class="even">
<td align="left">French</td>
<td align="left">Papuan</td>
<td align="left">Khomani_San</td>
<td align="right">0.306962</td>
<td align="right">0.011708</td>
<td align="right">26.218</td>
<td align="right">251648</td>
</tr>
<tr class="odd">
<td align="left">French</td>
<td align="left">Mbuti</td>
<td align="left">Khomani_San</td>
<td align="right">0.119283</td>
<td align="right">0.008448</td>
<td align="right">14.119</td>
<td align="right">271501</td>
</tr>
<tr class="even">
<td align="left">French</td>
<td align="left">Dinka</td>
<td align="left">Khomani_San</td>
<td align="right">0.190141</td>
<td align="right">0.010049</td>
<td align="right">18.922</td>
<td align="right">276964</td>
</tr>
</tbody>
</table>
<div class="sourceCode" id="cb22"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb22-1" data-line-number="1"><span class="co"># sort the population labels according to an increasing f3 value relative to French</span></a>
<a class="sourceLine" id="cb22-2" data-line-number="2">ordered &lt;-<span class="st"> </span><span class="kw">filter</span>(result, A <span class="op">==</span><span class="st"> "Mbuti"</span>, B <span class="op">!=</span><span class="st"> "Mbuti"</span>) <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">arrange</span>(f3) <span class="op">%&gt;%</span><span class="st"> </span>.[[<span class="st">"B"</span>]] <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">c</span>(<span class="st">"Mbuti"</span>)</a>
<a class="sourceLine" id="cb22-3" data-line-number="3"></a>
<a class="sourceLine" id="cb22-4" data-line-number="4"><span class="co"># plot heatmap of pairwise f3 values</span></a>
<a class="sourceLine" id="cb22-5" data-line-number="5">result <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb22-6" data-line-number="6"><span class="st">  </span><span class="kw">filter</span>(A <span class="op">!=</span><span class="st"> </span>B) <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb22-7" data-line-number="7"><span class="st">  </span><span class="kw">mutate</span>(<span class="dt">A =</span> <span class="kw">factor</span>(A, <span class="dt">levels =</span> ordered),</a>
<a class="sourceLine" id="cb22-8" data-line-number="8">         <span class="dt">B =</span> <span class="kw">factor</span>(B, <span class="dt">levels =</span> ordered)) <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb22-9" data-line-number="9"><span class="st">  </span><span class="kw">ggplot</span>(<span class="kw">aes</span>(A, B)) <span class="op">+</span><span class="st"> </span><span class="kw">geom_tile</span>(<span class="kw">aes</span>(<span class="dt">fill =</span> f3))</a></code></pre></div>
<p><img src="tutorial_files/figure-html/f3_plot-1.png" width="768"></p>
<p>We can see that when we order the heatmap labels based on values of pairwise <span class="math inline">\(f_3\)</span> statistics, the (already known) order of population splits pops up beautifully (i.e. San separated first, followed by Mbuti, etc.).</p>
</div>
<div id="qpwave-and-qpadm" class="section level2">
<h2 class="hasAnchor">
<a href="#qpwave-and-qpadm" class="anchor"></a>qpWave and qpAdm</h2>
<p>Both <em>qpWave</em> and <em>qpAdm</em> can be though of as more complex and powerful extensions of the basic idea behind a simple <span class="math inline">\(f_4\)</span> statistic. Building upon this idea and generalizing it, the <em>qpWave</em> method makes it possible to find the lowest number of “streams of ancestry” between two groups of populations that is consistent with the data. Extending the concept of <span class="math inline">\(f_4\)</span> statistics even further, <em>qpAdm</em> allows to find the proportions of ancestry from a set of ancestral populations that contributed ancestry to our sample or population of interest.</p>
<p>Unfortunately, both methods represent a rather advanced topic that still lacks a proper documentation and beginner-friendly tutorials, and explaining them in detail is beyond the scope of this vignette. If you want to use them, it’s highly recommended that you read the official documentation decribing the basic ideas of both methods (<a href="https://github.com/DReichLab/AdmixTools/blob/master/pdoc.pdf">distributed with ADMIXTOOLS</a>), <em>and</em> that you read the relevant supplementary sections of papers published by David Reich’s group. At the very least, I recommend reading:</p>
<ul>
<li><p>Note S6 of <em>“<a href="https://www.nature.com/articles/nature11258">Reconstructing Native American population history</a>”</em> by Reich et al. This paper first introduced the theoretical background of what later became <em>qpWave</em>.</p></li>
<li><p>Supplementary Information 10 of <em>“<a href="https://www.nature.com/articles/nature14317">Massive migration from the steppe was a source for Indo-European languages in Europe</a>”</em> by Haak et al., which gives a more consise overview of the <em>qpWave</em> method than S6 of Reich et al. 2012, and also introduces the <em>qpAdm</em> methodology for estimating admixture proportions.</p></li>
</ul>
<p>If you read these papers and the tutorial distributed with the ADMIXTOOLS carefully, you will have a solid understanding of both <em>qpWave</em> and <em>qpAdm</em>.</p>
<p>In the remainder of this section, I will assume that you are familiar with both methods, and will only explain how <em>admixr</em> makes running them much easier.</p>
<div id="qpwave" class="section level3">
<h3 class="hasAnchor">
<a href="#qpwave" class="anchor"></a><em>qpWave</em>
</h3>
<p>To run <em>qpWave</em>, you must provide a list of <em>left</em> and <em>right</em> populations (using the terminology of Haak et al. 2015 above). The aim of the method is to get an idea about the number of migration waves from <em>right</em> to <em>left</em> (with no back-migration from <em>left</em> to <em>right</em>!). This is done by estimating the rank of a matrix of all possible <span class="math inline">\(f_4\)</span> statistics</p>
<p><span class="math display">\[f_4(\textrm{left}_1, \textrm{left}_i; \textrm{right}_1, \textrm{right}_i),\]</span></p>
<p>where <span class="math inline">\(\textrm{left}_1\)</span> and <span class="math inline">\(\textrm{right}_1\)</span> are some fixed populations and the <span class="math inline">\(i\)</span> and <span class="math inline">\(j\)</span> indices run over all other possible choices of populations.</p>
<p>As an example, let’s try to find the number of admixture waves from <em>right</em> = {Yoruba, Mbuti, Alta} into <em>left</em> = {French, Sardinian, Han}. We can do this using the function <code><a href="../reference/qpWave.html">qpWave()</a></code>, setting its arguments appropriately:</p>
<div class="sourceCode" id="cb23"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb23-1" data-line-number="1">result &lt;-<span class="st"> </span><span class="kw"><a href="../reference/qpWave.html">qpWave</a></span>(</a>
<a class="sourceLine" id="cb23-2" data-line-number="2"> <span class="dt">left =</span> <span class="kw">c</span>(<span class="st">"French"</span>, <span class="st">"Sardinian"</span>, <span class="st">"Han"</span>),</a>
<a class="sourceLine" id="cb23-3" data-line-number="3"> <span class="dt">right =</span> <span class="kw">c</span>(<span class="st">"Altai"</span>, <span class="st">"Yoruba"</span>, <span class="st">"Mbuti"</span>),</a>
<a class="sourceLine" id="cb23-4" data-line-number="4"> <span class="dt">data =</span> snps</a>
<a class="sourceLine" id="cb23-5" data-line-number="5">)</a></code></pre></div>
<p>The <code><a href="../reference/qpWave.html">qpWave()</a></code> function returns a data frame which shows the results of a series of matrix rank tests. The <code>rank</code> column is the matrix rank tested, <code>df</code>, <code>chisq</code> and <code>tail</code> give …, and <code>dfdiff</code>, <code>chisqdiff</code> and <code>taildiff</code> give the same, but always comparing the fit to the fit of a rank immediately lower.</p>
<div class="sourceCode" id="cb24"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb24-1" data-line-number="1">result</a></code></pre></div>
<table class="table">
<thead><tr class="header">
<th align="right">rank</th>
<th align="right">df</th>
<th align="right">chisq</th>
<th align="right">tail</th>
<th align="right">dfdiff</th>
<th align="right">chisqdiff</th>
<th align="right">taildiff</th>
</tr></thead>
<tbody>
<tr class="odd">
<td align="right">0</td>
<td align="right">4</td>
<td align="right">1.758</td>
<td align="right">0.7801969</td>
<td align="right">0</td>
<td align="right">0.000</td>
<td align="right">1.0000000</td>
</tr>
<tr class="even">
<td align="right">1</td>
<td align="right">1</td>
<td align="right">0.192</td>
<td align="right">0.6614221</td>
<td align="right">3</td>
<td align="right">1.566</td>
<td align="right">0.6671280</td>
</tr>
<tr class="odd">
<td align="right">2</td>
<td align="right">0</td>
<td align="right">0.000</td>
<td align="right">1.0000000</td>
<td align="right">1</td>
<td align="right">0.192</td>
<td align="right">0.6614221</td>
</tr>
</tbody>
</table>
<p>In this example, we see that matrix <span class="math inline">\(r = 0\)</span> cannot be rejected (<code>tail</code> <span class="math inline">\(p\)</span>-value = 0.684 - not rejected). Because Reich et al. 2012 showed that <span class="math inline">\(r + 1 \le n\)</span>, where <span class="math inline">\(n\)</span> is the number of admixture waves, we can interpret this as <em>left</em> populations having at least <span class="math inline">\(n = 1\)</span> streams of ancestry from the set of <em>right</em> populations. In this case, the most likely explanation is Neandertal admixture in non-Africans today.</p>
<p>Now, what happens if we add Papuans to the <em>left</em> group?</p>
<div class="sourceCode" id="cb25"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb25-1" data-line-number="1">result &lt;-<span class="st"> </span><span class="kw"><a href="../reference/qpWave.html">qpWave</a></span>(</a>
<a class="sourceLine" id="cb25-2" data-line-number="2"> <span class="dt">left =</span> <span class="kw">c</span>(<span class="st">"Papuan"</span>, <span class="st">"French"</span>, <span class="st">"Sardinian"</span>, <span class="st">"Han"</span>),</a>
<a class="sourceLine" id="cb25-3" data-line-number="3"> <span class="dt">right =</span> <span class="kw">c</span>(<span class="st">"Altai"</span>, <span class="st">"Yoruba"</span>, <span class="st">"Mbuti"</span>),</a>
<a class="sourceLine" id="cb25-4" data-line-number="4"> <span class="dt">data =</span> snps</a>
<a class="sourceLine" id="cb25-5" data-line-number="5">)</a></code></pre></div>
<div class="sourceCode" id="cb26"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb26-1" data-line-number="1">result</a></code></pre></div>
<table class="table">
<thead><tr class="header">
<th align="right">rank</th>
<th align="right">df</th>
<th align="right">chisq</th>
<th align="right">tail</th>
<th align="right">dfdiff</th>
<th align="right">chisqdiff</th>
<th align="right">taildiff</th>
</tr></thead>
<tbody>
<tr class="odd">
<td align="right">0</td>
<td align="right">6</td>
<td align="right">29.150</td>
<td align="right">0.0000570</td>
<td align="right">0</td>
<td align="right">0.000</td>
<td align="right">1.0000000</td>
</tr>
<tr class="even">
<td align="right">1</td>
<td align="right">2</td>
<td align="right">0.603</td>
<td align="right">0.7395638</td>
<td align="right">4</td>
<td align="right">28.547</td>
<td align="right">0.0000097</td>
</tr>
<tr class="odd">
<td align="right">2</td>
<td align="right">0</td>
<td align="right">0.000</td>
<td align="right">1.0000000</td>
<td align="right">2</td>
<td align="right">0.603</td>
<td align="right">0.7395638</td>
</tr>
</tbody>
</table>
<p>We can now clearly reject rank <span class="math inline">\(r = 0\)</span>, but we see that the data is consistent with rank <span class="math inline">\(r = 1\)</span>, meaning that there must have been at least <span class="math inline">\(n = 2\)</span> streams of ancestry from <em>right</em> to <em>left</em> populations (<span class="math inline">\(r + 1 \le n\)</span>). Because this happened after we introduced Papuans to the <em>left</em> set, this could indicate a separate pulse of archaic introgression into Papuans, which is not surprising given what we know about the geographical patterns of archaic admixture in non-Africans today, and significantly more introgression observed in Papuans than any other present-day population.</p>
</div>
<div id="qpadm" class="section level3">
<h3 class="hasAnchor">
<a href="#qpadm" class="anchor"></a><em>qpAdm</em>
</h3>
<p>The <em>qpAdm</em> method can be used to find, for a given target population, the proportions of ancestry coming from a set of ancestral populations. Importantly, since we often lack accurate representatives of the true ancestral populations, we can use a set of <em>reference</em> populations instead, under a crucial assumption that the <em>references</em> set is phylogenetically closer to true ancestral populations than a set of specified <em>outgroups</em>. For example, coming back to our example of using <span class="math inline">\(f_4\)</span>-ratio statistics to estimate the proportions of Neandertal ancestry in people today, we could define:</p>
<ul>
<li>some Europeans as the <em>target</em>;</li>
<li>Vindija Neanderthal and an African as two <em>reference</em> populations (two potential sources of ancestries in Europeans today);</li>
<li>
<em>outgroup</em> populations as Chimp, Altai Neanderthal and Denisovan (which are all further from the true ancestral populations - here Vindija and African - than the <em>reference</em> populations).</li>
</ul>
<p>Having defined all three population sets, we can run qpAdm with:</p>
<div class="sourceCode" id="cb27"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb27-1" data-line-number="1">result &lt;-<span class="st"> </span><span class="kw"><a href="../reference/qpAdm.html">qpAdm</a></span>(</a>
<a class="sourceLine" id="cb27-2" data-line-number="2">  <span class="dt">target =</span> <span class="kw">c</span>(<span class="st">"Sardinian"</span>, <span class="st">"Han"</span>, <span class="st">"French"</span>),</a>
<a class="sourceLine" id="cb27-3" data-line-number="3">  <span class="dt">sources =</span> <span class="kw">c</span>(<span class="st">"Vindija"</span>, <span class="st">"Yoruba"</span>),</a>
<a class="sourceLine" id="cb27-4" data-line-number="4">  <span class="dt">outgroups =</span> <span class="kw">c</span>(<span class="st">"Chimp"</span>, <span class="st">"Denisova"</span>, <span class="st">"Altai"</span>),</a>
<a class="sourceLine" id="cb27-5" data-line-number="5">  <span class="dt">data =</span> snps</a>
<a class="sourceLine" id="cb27-6" data-line-number="6">)</a></code></pre></div>
<p>The <code><a href="../reference/qpAdm.html">qpAdm()</a></code> function has an argument <code>details</code> (default TRUE) which makes the function return a list of three elements:</p>
<ul>
<li>
<code>proportions</code> - data frame with admixture proportions - this is what we care about;</li>
<li>
<code>ranks</code> - results of rank tests performed by <em>qpWave</em> - these evaluate how well does the assumed traget-references-outgroups population model match the data;</li>
<li>
<code>subsets</code> - results of the “all subsets” analysis (see the <a href="https://github.com/DReichLab/AdmixTools/blob/master/pdoc.pdf">documentation</a> for more details.</li>
</ul>
<p>Let’s start with the <code>ranks</code> element:</p>
<div class="sourceCode" id="cb28"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb28-1" data-line-number="1">result<span class="op">$</span>ranks</a></code></pre></div>
<table class="table">
<thead><tr class="header">
<th align="left">target</th>
<th align="right">rank</th>
<th align="right">df</th>
<th align="right">chisq</th>
<th align="right">tail</th>
<th align="right">dfdiff</th>
<th align="right">chisqdiff</th>
<th align="right">taildiff</th>
</tr></thead>
<tbody>
<tr class="odd">
<td align="left">Sardinian</td>
<td align="right">1</td>
<td align="right">1</td>
<td align="right">0.006</td>
<td align="right">0.9362605</td>
<td align="right">3</td>
<td align="right">-0.006</td>
<td align="right">1.0000000</td>
</tr>
<tr class="even">
<td align="left">Sardinian</td>
<td align="right">2</td>
<td align="right">0</td>
<td align="right">0.000</td>
<td align="right">1.0000000</td>
<td align="right">1</td>
<td align="right">0.006</td>
<td align="right">0.9362605</td>
</tr>
<tr class="odd">
<td align="left">Han</td>
<td align="right">1</td>
<td align="right">1</td>
<td align="right">2.144</td>
<td align="right">0.1431157</td>
<td align="right">3</td>
<td align="right">-2.144</td>
<td align="right">1.0000000</td>
</tr>
<tr class="even">
<td align="left">Han</td>
<td align="right">2</td>
<td align="right">0</td>
<td align="right">0.000</td>
<td align="right">1.0000000</td>
<td align="right">1</td>
<td align="right">2.144</td>
<td align="right">0.1431157</td>
</tr>
<tr class="odd">
<td align="left">French</td>
<td align="right">1</td>
<td align="right">1</td>
<td align="right">3.814</td>
<td align="right">0.0508171</td>
<td align="right">3</td>
<td align="right">-3.814</td>
<td align="right">1.0000000</td>
</tr>
<tr class="even">
<td align="left">French</td>
<td align="right">2</td>
<td align="right">0</td>
<td align="right">0.000</td>
<td align="right">1.0000000</td>
<td align="right">1</td>
<td align="right">3.814</td>
<td align="right">0.0508171</td>
</tr>
</tbody>
</table>
<p>The row with rank = 1 represents a <em>qpWave</em> test with all <span class="math inline">\(n\)</span> <em>reference</em> populations set as the <em>left</em> set and all <em>outgroups</em> as the <em>right</em> set, and evaluates whether the ancestral sources themselves are descended from <span class="math inline">\(n\)</span> independent streams of ancestry. In our case, <span class="math inline">\(n = 2\)</span> (Mbuti and Vindija), which means that the data would have to be consistent with rank <span class="math inline">\(r = 1\)</span> to satisfy the inequality <span class="math inline">\(r + 1 \le n\)</span> proved by Reich et al., 2012. We see that this is true for all three target populations (<span class="math inline">\(p\)</span>-value &gt; 0.05 for all targets), and the simple model thus seems to be reasonably consistent with the data.</p>
<p>The rank = 2 row represents a <em>qpWave</em> test after adding a target population to the <em>left</em> group together with the <em>references</em>. This test makes sure that including the target population does not increase the rank of the <span class="math inline">\(f_4\)</span> matrix, meaning that the target can be really modelled as a mixture of ancestries from the <em>reference</em> set. If the <span class="math inline">\(p\)</span>-values turn out to be very low, this indicates that the assumed model does not fit the data very well and that a part of the ancestry in <em>target</em> possibly cannot be traced to any of the <em>references</em>. In our case, however, all rank = 2 test <span class="math inline">\(p\)</span>-values do not appear significant, and we can be reasonably sure that the <em>target</em> samples can be fully modelled as a mixtures of all specified <em>references</em>.</p>
<p>Having made sure that our model is reasonably correct, we can now have a look at the <code>proportions</code> element, that contains an estimated admixture proportion from all specified sources, as well as standard errors for those proportion estimated using a block jackknife:</p>
<div class="sourceCode" id="cb29"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb29-1" data-line-number="1">result<span class="op">$</span>proportions</a></code></pre></div>
<table class="table">
<thead><tr class="header">
<th align="left">target</th>
<th align="right">Vindija</th>
<th align="right">Yoruba</th>
<th align="right">stderr_Vindija</th>
<th align="right">stderr_Yoruba</th>
<th align="right">nsnps</th>
</tr></thead>
<tbody>
<tr class="odd">
<td align="left">Sardinian</td>
<td align="right">0.025</td>
<td align="right">0.975</td>
<td align="right">0.006</td>
<td align="right">0.006</td>
<td align="right">499314</td>
</tr>
<tr class="even">
<td align="left">Han</td>
<td align="right">0.021</td>
<td align="right">0.979</td>
<td align="right">0.006</td>
<td align="right">0.006</td>
<td align="right">499654</td>
</tr>
<tr class="odd">
<td align="left">French</td>
<td align="right">0.022</td>
<td align="right">0.978</td>
<td align="right">0.006</td>
<td align="right">0.006</td>
<td align="right">499434</td>
</tr>
</tbody>
</table>
<p>If we compare this result to the <span class="math inline">\(f_4\)</span>-ratio values calculated above, we see that the <em>qpAdm</em> estimates are very close to what we got earlier.</p>
<p>The third element in the list of results shows the outcome of an “all subsets” analysis, which involves testing all subsets of potential source populations. Each 1 in the “pattern” column means that the proportion of ancestry from that particular source population (in the order specified originally by the user) was forced to 0.0.</p>
<div class="sourceCode" id="cb30"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb30-1" data-line-number="1">result<span class="op">$</span>subsets</a></code></pre></div>
<table class="table">
<thead><tr class="header">
<th align="left">target</th>
<th align="left">pattern</th>
<th align="right">wt</th>
<th align="right">dof</th>
<th align="right">chisq</th>
<th align="right">tail</th>
<th align="right">Vindija</th>
<th align="right">Yoruba</th>
</tr></thead>
<tbody>
<tr class="odd">
<td align="left">Sardinian</td>
<td align="left">00</td>
<td align="right">0</td>
<td align="right">1</td>
<td align="right">0.006</td>
<td align="right">0.9362610</td>
<td align="right">0.025</td>
<td align="right">0.975</td>
</tr>
<tr class="even">
<td align="left">Sardinian</td>
<td align="left">01</td>
<td align="right">1</td>
<td align="right">2</td>
<td align="right">15953.171</td>
<td align="right">0.0000000</td>
<td align="right">1.000</td>
<td align="right">0.000</td>
</tr>
<tr class="odd">
<td align="left">Sardinian</td>
<td align="left">10</td>
<td align="right">1</td>
<td align="right">2</td>
<td align="right">16.564</td>
<td align="right">0.0002530</td>
<td align="right">0.000</td>
<td align="right">1.000</td>
</tr>
<tr class="even">
<td align="left">Han</td>
<td align="left">00</td>
<td align="right">0</td>
<td align="right">1</td>
<td align="right">2.144</td>
<td align="right">0.1431160</td>
<td align="right">0.021</td>
<td align="right">0.979</td>
</tr>
<tr class="odd">
<td align="left">Han</td>
<td align="left">01</td>
<td align="right">1</td>
<td align="right">2</td>
<td align="right">14965.791</td>
<td align="right">0.0000000</td>
<td align="right">1.000</td>
<td align="right">0.000</td>
</tr>
<tr class="even">
<td align="left">Han</td>
<td align="left">10</td>
<td align="right">1</td>
<td align="right">2</td>
<td align="right">14.454</td>
<td align="right">0.0007269</td>
<td align="right">0.000</td>
<td align="right">1.000</td>
</tr>
<tr class="odd">
<td align="left">French</td>
<td align="left">00</td>
<td align="right">0</td>
<td align="right">1</td>
<td align="right">3.814</td>
<td align="right">0.0508171</td>
<td align="right">0.022</td>
<td align="right">0.978</td>
</tr>
<tr class="even">
<td align="left">French</td>
<td align="left">01</td>
<td align="right">1</td>
<td align="right">2</td>
<td align="right">15441.258</td>
<td align="right">0.0000000</td>
<td align="right">1.000</td>
<td align="right">0.000</td>
</tr>
<tr class="odd">
<td align="left">French</td>
<td align="left">10</td>
<td align="right">1</td>
<td align="right">2</td>
<td align="right">16.028</td>
<td align="right">0.0003308</td>
<td align="right">0.000</td>
<td align="right">1.000</td>
</tr>
</tbody>
</table>
</div>
</div>
<div id="grouping-samples" class="section level2">
<h2 class="hasAnchor">
<a href="#grouping-samples" class="anchor"></a>Grouping samples</h2>
<p>What we’ve been doing so far was calculating statistics for individual samples. However, it is often useful to treat multiple samples as a single group or population. <em>admixr</em> provides a function called <code>relabel</code> that does just that.</p>
<p>Here is an example: let’s say we want to run a similar analysis to the one described in the <span class="math inline">\(D\)</span> statistic section, but we want to treat Europeans, Africans and archaics combined populations. But the <code>ind</code> file that we have does not contain grouped labels - each sample stands on its own:</p>
<pre><code>Chimp        U  Chimp
Mbuti        U  Mbuti
Yoruba       U  Yoruba
Khomani_San  U  Khomani_San
Han          U  Han
Dinka        U  Dinka
Sardinian    U  Sardinian
Papuan       U  Papuan
French       U  French
Vindija      U  Vindija
Altai        U  Altai
Denisova     U  Denisova</code></pre>
<p>To merge several individual samples under a combined label we can call <code>group_labels</code> like this:</p>
<div class="sourceCode" id="cb32"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb32-1" data-line-number="1"><span class="co"># paths to the original ind file and a new modified ind file, which will</span></a>
<a class="sourceLine" id="cb32-2" data-line-number="2"><span class="co"># contain merged population labels</span></a>
<a class="sourceLine" id="cb32-3" data-line-number="3">modif_snps &lt;-<span class="st"> </span><span class="kw"><a href="../reference/relabel.html">relabel</a></span>(</a>
<a class="sourceLine" id="cb32-4" data-line-number="4">  snps,</a>
<a class="sourceLine" id="cb32-5" data-line-number="5">  <span class="dt">European =</span> <span class="kw">c</span>(<span class="st">"French"</span>, <span class="st">"Sardinian"</span>),</a>
<a class="sourceLine" id="cb32-6" data-line-number="6">  <span class="dt">African =</span> <span class="kw">c</span>(<span class="st">"Dinka"</span>, <span class="st">"Yoruba"</span>, <span class="st">"Mbuti"</span>, <span class="st">"Khomani_San"</span>),</a>
<a class="sourceLine" id="cb32-7" data-line-number="7">  <span class="dt">Archaic =</span> <span class="kw">c</span>(<span class="st">"Vindija"</span>, <span class="st">"Altai"</span>, <span class="st">"Denisova"</span>)</a>
<a class="sourceLine" id="cb32-8" data-line-number="8">)</a></code></pre></div>
<div class="sourceCode" id="cb33"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb33-1" data-line-number="1">modif_snps</a>
<a class="sourceLine" id="cb33-2" data-line-number="2"><span class="co">#&gt; EIGENSTRAT object</span></a>
<a class="sourceLine" id="cb33-3" data-line-number="3"><span class="co">#&gt; =================</span></a>
<a class="sourceLine" id="cb33-4" data-line-number="4"><span class="co">#&gt; components:</span></a>
<a class="sourceLine" id="cb33-5" data-line-number="5"><span class="co">#&gt;   ind file: /var/folders/kk/s4cwdkx90pscz314mp0hhz480000gn/T//Rtmpiusthx/snps/snps.ind</span></a>
<a class="sourceLine" id="cb33-6" data-line-number="6"><span class="co">#&gt;   snp file: /var/folders/kk/s4cwdkx90pscz314mp0hhz480000gn/T//Rtmpiusthx/snps/snps.snp</span></a>
<a class="sourceLine" id="cb33-7" data-line-number="7"><span class="co">#&gt;   geno file: /var/folders/kk/s4cwdkx90pscz314mp0hhz480000gn/T//Rtmpiusthx/snps/snps.geno</span></a>
<a class="sourceLine" id="cb33-8" data-line-number="8"><span class="co">#&gt; </span></a>
<a class="sourceLine" id="cb33-9" data-line-number="9"><span class="co">#&gt; modifiers:</span></a>
<a class="sourceLine" id="cb33-10" data-line-number="10"><span class="co">#&gt;   groups:  /var/folders/kk/s4cwdkx90pscz314mp0hhz480000gn/T//Rtmpiusthx/file5d946113abb5.ind</span></a></code></pre></div>
<p>We can see that the function <code>relabel</code> returned a modified <code>EIGENSTRAT</code> object, which contains a new item in the “modifiers” section - the path to a new ind file that will be used in downstream analyses. Let’s look at its contents:</p>
<pre><code>Chimp        U  Chimp
Mbuti        U  African
Yoruba       U  African
Khomani_San  U  African
Han          U  Han
Dinka        U  African
Sardinian    U  European
Papuan       U  Papuan
French       U  European
Vindija      U  Archaic
Altai        U  Archaic
Denisova     U  Archaic</code></pre>
<p>Having the modified <code>EIGENSTRAT</code> object ready, we can then use “European”, “African” and “Archaic” labels in any of the <em>admixr</em> wrapper functions described above. For example:</p>
<div class="sourceCode" id="cb35"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb35-1" data-line-number="1">result &lt;-<span class="st"> </span><span class="kw"><a href="../reference/f4ratio.html">d</a></span>(<span class="dt">W =</span> <span class="st">"European"</span>, <span class="dt">X =</span> <span class="st">"African"</span>, <span class="dt">Y =</span> <span class="st">"Archaic"</span>, <span class="dt">Z =</span> <span class="st">"Chimp"</span>, <span class="dt">data =</span> modif_snps)</a></code></pre></div>
<p>Here is the result, showing (as we’ve seen above for individual samples) that Europeans show genetic affinity to archaic humans compared to Africans today:</p>
<div class="sourceCode" id="cb36"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb36-1" data-line-number="1"><span class="kw">head</span>(result)</a></code></pre></div>
<table class="table">
<thead><tr class="header">
<th align="left">W</th>
<th align="left">X</th>
<th align="left">Y</th>
<th align="left">Z</th>
<th align="right">D</th>
<th align="right">stderr</th>
<th align="right">Zscore</th>
<th align="right">BABA</th>
<th align="right">ABBA</th>
<th align="right">nsnps</th>
</tr></thead>
<tbody><tr class="odd">
<td align="left">European</td>
<td align="left">African</td>
<td align="left">Archaic</td>
<td align="left">Chimp</td>
<td align="right">0.0225</td>
<td align="right">0.004404</td>
<td align="right">5.117</td>
<td align="right">15487</td>
<td align="right">14805</td>
<td align="right">489003</td>
</tr></tbody>
</table>
<p>Note that the <code><a href="../reference/f4ratio.html">d()</a></code> function correctly picks up the “group modifier” <code>ind</code> file from the provided <code>EIGENSTRAT</code> object and uses it in place of the original <code>ind</code> file.</p>
</div>
<div id="counting-presentmissing-snps" class="section level2">
<h2 class="hasAnchor">
<a href="#counting-presentmissing-snps" class="anchor"></a>Counting present/missing SNPs</h2>
<p>The <code>count_snps</code> function can be useful for quality control, weighting of admixture statistics (<span class="math inline">\(D\)</span>, <span class="math inline">\(f_4\)</span>, etc.) for regression analyses etc. There are two optional arguments:</p>
<ul>
<li>
<code>prop</code> - changes whether to report SNP counts or proportions (set to <code>FALSE</code> by default),</li>
<li>
<code>missing</code> - controls whether to count missing SNPs instead of present SNPs (set to <code>FALSE</code> by default).</li>
</ul>
<p>For each sample, count the SNPs present in that sample:</p>
<div class="sourceCode" id="cb37"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb37-1" data-line-number="1"><span class="kw"><a href="../reference/count_snps.html">count_snps</a></span>(snps)</a></code></pre></div>
<table class="table">
<thead><tr class="header">
<th align="left">id</th>
<th align="left">sex</th>
<th align="left">label</th>
<th align="right">present</th>
</tr></thead>
<tbody>
<tr class="odd">
<td align="left">Chimp</td>
<td align="left">U</td>
<td align="left">Chimp</td>
<td align="right">491273</td>
</tr>
<tr class="even">
<td align="left">Mbuti</td>
<td align="left">U</td>
<td align="left">Mbuti</td>
<td align="right">499334</td>
</tr>
<tr class="odd">
<td align="left">Yoruba</td>
<td align="left">U</td>
<td align="left">Yoruba</td>
<td align="right">499246</td>
</tr>
<tr class="even">
<td align="left">Khomani_San</td>
<td align="left">U</td>
<td align="left">Khomani_San</td>
<td align="right">499250</td>
</tr>
<tr class="odd">
<td align="left">Han</td>
<td align="left">U</td>
<td align="left">Han</td>
<td align="right">499654</td>
</tr>
<tr class="even">
<td align="left">Dinka</td>
<td align="left">U</td>
<td align="left">Dinka</td>
<td align="right">499362</td>
</tr>
<tr class="odd">
<td align="left">Sardinian</td>
<td align="left">U</td>
<td align="left">Sardinian</td>
<td align="right">499314</td>
</tr>
<tr class="even">
<td align="left">Papuan</td>
<td align="left">U</td>
<td align="left">Papuan</td>
<td align="right">499377</td>
</tr>
<tr class="odd">
<td align="left">French</td>
<td align="left">U</td>
<td align="left">French</td>
<td align="right">499434</td>
</tr>
<tr class="even">
<td align="left">Vindija</td>
<td align="left">U</td>
<td align="left">Vindija</td>
<td align="right">497544</td>
</tr>
<tr class="odd">
<td align="left">Altai</td>
<td align="left">U</td>
<td align="left">Altai</td>
<td align="right">497729</td>
</tr>
<tr class="even">
<td align="left">Denisova</td>
<td align="left">U</td>
<td align="left">Denisova</td>
<td align="right">497398</td>
</tr>
</tbody>
</table>
</div>
<div id="data-filtering" class="section level2">
<h2 class="hasAnchor">
<a href="#data-filtering" class="anchor"></a>Data filtering</h2>
<div id="filtering-based-on-a-bed-file" class="section level3">
<h3 class="hasAnchor">
<a href="#filtering-based-on-a-bed-file" class="anchor"></a>Filtering based on a BED file</h3>
<p>A common situation in genomics is performing an analysis on a subset of the genome. However, EIGENSTRAT is a rather obscure file format which makes it very difficult to find bioinformatics tools that support it. Luckily, <em>admixr</em> includes a function <code><a href="../reference/filter_bed.html">filter_bed()</a></code> that takes an <code>EIGENSTRAT</code> object and a BED file as inputs and produces a new object that contains a “excluded” modifier, describing which sites did not pass the filtering and must be excluded from an analysis later.</p>
<div class="sourceCode" id="cb38"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb38-1" data-line-number="1">bed &lt;-<span class="st"> </span><span class="kw">file.path</span>(<span class="kw">dirname</span>(prefix), <span class="st">"regions.bed"</span>)</a></code></pre></div>
<div class="sourceCode" id="cb39"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb39-1" data-line-number="1"><span class="co"># run this if your BED file contains regions to keep in an analysis</span></a>
<a class="sourceLine" id="cb39-2" data-line-number="2">new_snps &lt;-<span class="st"> </span><span class="kw"><a href="../reference/filter_bed.html">filter_bed</a></span>(snps, bed)</a>
<a class="sourceLine" id="cb39-3" data-line-number="3"></a>
<a class="sourceLine" id="cb39-4" data-line-number="4"><span class="co"># run this if your BED file contains regions to remove from an analysis</span></a>
<a class="sourceLine" id="cb39-5" data-line-number="5">new_snps &lt;-<span class="st"> </span><span class="kw"><a href="../reference/filter_bed.html">filter_bed</a></span>(snps, <span class="st">"regions.bed"</span>, <span class="dt">remove =</span> <span class="ot">TRUE</span>)</a>
<a class="sourceLine" id="cb39-6" data-line-number="6"></a>
<a class="sourceLine" id="cb39-7" data-line-number="7"><span class="co"># run this if you want the filter_bed function to save the "exclude"</span></a>
<a class="sourceLine" id="cb39-8" data-line-number="8"><span class="co"># file to a specified location - can be useful for debugging</span></a>
<a class="sourceLine" id="cb39-9" data-line-number="9">new_snps &lt;-<span class="st"> </span><span class="kw"><a href="../reference/filter_bed.html">filter_bed</a></span>(snps, <span class="st">"regions.bed"</span>, <span class="dt">outfile =</span> <span class="st">"filtered_snps.snp"</span>)</a></code></pre></div>
<p>If we want to run the whole analysis in a single pipeline, we can use the <code>%&gt;%</code> pipe operator and do the following:</p>
<p>(The <code>%&gt;%</code> operator simply takes what is on its left side and puts it as a first argument of a function on the right side. While it takes some time to get used to, it is very useful in longer multi-step “pipelines” because it makes more pipelines much more readable. In fact, the resulting code often reads <em>almost</em> like English! The <code>%&gt;%</code> pipe is automatically imported when you load the <code>tidyverse</code> library, and you can read about it more <a href="https://magrittr.tidyverse.org">here</a>.)</p>
<div class="sourceCode" id="cb40"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb40-1" data-line-number="1">snps <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb40-2" data-line-number="2"><span class="st">  </span><span class="kw"><a href="../reference/filter_bed.html">filter_bed</a></span>(<span class="st">"regions.bed"</span>) <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb40-3" data-line-number="3"><span class="st">  </span><span class="kw"><a href="../reference/f4ratio.html">d</a></span>(<span class="dt">W =</span> <span class="st">"French"</span>, <span class="dt">X =</span> <span class="st">"Mbuti"</span>, <span class="dt">Y =</span> <span class="st">"Vindija"</span>, <span class="dt">Z =</span> <span class="st">"Chimp"</span>)</a></code></pre></div>
<p>This is because in the formal definitions of <em>admixr</em> function, <code>data =</code> is always the argument, so we don’t have to specify it manually.</p>
<p><strong>Important:</strong> the set of sites to be removed that is generated after running <code><a href="../reference/filter_bed.html">filter_bed()</a></code> is actually saved into a file (it’s a requirement of the underlying ADMIXTOOLS software)! While the function makes it very easy to do filtering without worrying about locations of intermediate files, it is important to keep in mind that the function still creates them. If you plan to run many <em>independent</em> calculations on a filtered subset of the data, it’s better to save the new <code>EIGENSTRAT</code> object to a variable first and re-use the same object multiple times, rather than running the whole pipeline for each analysis separately (which would create new copies of the intermediate file for each iteration).</p>
<p>(Note for ADMIXTOOLS powerusers: There is a rather obscure option called <a href="https://github.com/DReichLab/AdmixTools/blob/master/README.ROLLOFF#L48"><code>badsnpfile</code></a>, which specifies the path to a snp file with colrdinates of sites to exclude from the calculation. The file specified in an “exclude” modifier of an <code>EIGENSTRAT</code> object simply fills in the <code>badsnpfile</code> parameter before starting an ADMIXTOOLS command.)</p>
</div>
<div id="filtering-out-potential-ancient-dna-damage-snps" class="section level3">
<h3 class="hasAnchor">
<a href="#filtering-out-potential-ancient-dna-damage-snps" class="anchor"></a>Filtering out potential ancient DNA damage SNPs</h3>
<p>In an ancient DNA world, we often need to repeat an analysis on a subset of data that is less likely to be influenced by ancient DNA damage, to verify that our results are not caused by artifacts in the data (due to biochemical properties of DNA degradation, ancient DNA damage will lead to an increase in C→T and G→A substitutions). Using a similar method described in the BED filtering section above, we can use the <code>filter_damage()</code> function to generate a snp file with positions that carry transitions (C→T and G→A sites), that can be removed using the <code>exclude =</code> argument of the main <em>admixr</em> functions.</p>
<div class="sourceCode" id="cb41"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb41-1" data-line-number="1"><span class="co"># generate snp file with positions of transitions</span></a>
<a class="sourceLine" id="cb41-2" data-line-number="2">new_snps &lt;-<span class="st"> </span><span class="kw"><a href="../reference/transversions_only.html">transversions_only</a></span>(snps)</a>
<a class="sourceLine" id="cb41-3" data-line-number="3"></a>
<a class="sourceLine" id="cb41-4" data-line-number="4"><span class="co"># perform the calculation only on transversions</span></a>
<a class="sourceLine" id="cb41-5" data-line-number="5"><span class="kw"><a href="../reference/f4ratio.html">d</a></span>(<span class="dt">W =</span> <span class="st">"French"</span>, <span class="dt">X =</span> <span class="st">"Dinka"</span>, <span class="dt">Y =</span> <span class="st">"Altai"</span>, <span class="dt">Z =</span> <span class="st">"Chimp"</span>, <span class="dt">data =</span> new_snps)</a></code></pre></div>
<p>Again, we could combine several filtering steps into one pipeline:</p>
<div class="sourceCode" id="cb42"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb42-1" data-line-number="1">snps <span class="op">%&gt;%</span><span class="st">                                    </span><span class="co"># take the original data</span></a>
<a class="sourceLine" id="cb42-2" data-line-number="2"><span class="st">  </span><span class="kw"><a href="../reference/filter_bed.html">filter_bed</a></span>(<span class="st">"regions.bed"</span>, <span class="dt">remove =</span> <span class="ot">TRUE</span>) <span class="op">%&gt;%</span><span class="st">  </span><span class="co"># remove sites not in specified regions</span></a>
<a class="sourceLine" id="cb42-3" data-line-number="3"><span class="st">  </span><span class="kw"><a href="../reference/transversions_only.html">transversions_only</a></span>() <span class="op">%&gt;%</span><span class="st">                      </span><span class="co"># remove potential false SNPs due to aDNA damage</span></a>
<a class="sourceLine" id="cb42-4" data-line-number="4"><span class="st">  </span><span class="kw"><a href="../reference/f4ratio.html">d</a></span>(<span class="dt">W =</span> <span class="st">"French"</span>, <span class="dt">X =</span> <span class="st">"Dinka"</span>, <span class="dt">Y =</span> <span class="st">"Altai"</span>, <span class="dt">Z =</span> <span class="st">"Chimp"</span>) <span class="co"># calculate D on the filtered dataset</span></a></code></pre></div>
</div>
</div>
<div id="merging-eigenstrat-datasets" class="section level2">
<h2 class="hasAnchor">
<a href="#merging-eigenstrat-datasets" class="anchor"></a>Merging EIGENSTRAT datasets</h2>
<p>Another useful data processing function is <code>merge_eigenstrat</code>. This function takes two EIGENSTRAT datasets and merges them, producing a union of samples and intersection of SNPs from both of them. It returns a new <code>EIGENSTRAT</code> object that can be directly used in <em>admixr</em> analyses.</p>
<div class="sourceCode" id="cb43"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb43-1" data-line-number="1"><span class="co"># this is just an example code - it will not run unless you specify the paths</span></a>
<a class="sourceLine" id="cb43-2" data-line-number="2">merged &lt;-<span class="st"> </span><span class="kw"><a href="../reference/merge_eigenstrat.html">merge_eigenstrat</a></span>(</a>
<a class="sourceLine" id="cb43-3" data-line-number="3">    <span class="dt">merged =</span> <span class="op">&lt;</span>prefix_of_the_merged_dataset<span class="op">&gt;</span></a>
<a class="sourceLine" id="cb43-4" data-line-number="4"><span class="st">    </span><span class="dt">a =</span> <span class="op">&lt;</span>first EIGENSTRAT object<span class="op">&gt;</span>,</a>
<a class="sourceLine" id="cb43-5" data-line-number="5">    <span class="dt">b =</span> <span class="op">&lt;</span>second EIGENSTRAT object<span class="op">&gt;</span></a>
<a class="sourceLine" id="cb43-6" data-line-number="6">)</a></code></pre></div>
</div>
  </div>

  <div class="col-md-3 hidden-xs hidden-sm" id="sidebar">
        <div id="tocnav">
      <h2 class="hasAnchor">
<a href="#tocnav" class="anchor"></a>Contents</h2>
      <ul class="nav nav-pills nav-stacked">
<li><a href="#introduction">Introduction</a></li>
      <li><a href="#installation">Installation</a></li>
      <li><a href="#a-note-about-eigenstrat-format">A note about EIGENSTRAT format</a></li>
      <li><a href="#philosophy-of-admixr">Philosophy of <em>admixr</em></a></li>
      <li><a href="#internal-representation-of-eigenstrat-data">Internal representation of EIGENSTRAT data</a></li>
      <li><a href="#d-statistic"><span class="math inline">\(D\)</span> statistic</a></li>
      <li><a href="#f_4-statistic"><span class="math inline">\(f_4\)</span> statistic</a></li>
      <li><a href="#f_4-ratio-statistic"><span class="math inline">\(f_4\)</span>-ratio statistic</a></li>
      <li><a href="#f_3-statistic"><span class="math inline">\(f_3\)</span> statistic</a></li>
      <li><a href="#qpwave-and-qpadm">qpWave and qpAdm</a></li>
      <li><a href="#grouping-samples">Grouping samples</a></li>
      <li><a href="#counting-presentmissing-snps">Counting present/missing SNPs</a></li>
      <li><a href="#data-filtering">Data filtering</a></li>
      <li><a href="#merging-eigenstrat-datasets">Merging EIGENSTRAT datasets</a></li>
      </ul>
</div>
      </div>

</div>


      <footer><div class="copyright">
  <p>Developed by Martin Petr.</p>
</div>

<div class="pkgdown">
  <p>Site built with <a href="http://pkgdown.r-lib.org/">pkgdown</a>.</p>
</div>

      </footer>
</div>

  
  </body>
</html>