manual/dna-manual.Rnw

\documentclass[10pt]{report}

\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{geometry}
\geometry{margin=3cm}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{natbib}
\usepackage{scrextend}
\usepackage{graphicx}
\usepackage{placeins}
\usepackage{booktabs}
\usepackage{ltablex}
%\usepackage[table]{xcolor} % see global chunk options
\usepackage{soul} % include for kable
\usepackage{booktabs}
\usepackage{longtable}
\usepackage{array}
\usepackage{multirow}
\usepackage{wrapfig}
\usepackage{float}
\usepackage{colortbl}
\usepackage{pdflscape}
\usepackage{tabu}
\usepackage{threeparttable}
\usepackage[normalem]{ulem}
\usepackage[export]{adjustbox} % to add boxes around screenshots
\usepackage{tikz}
\usepackage{tikz-3dplot}

\graphicspath{ {Figures/} }
\newcommand{\dna}{\texttt{DNA}}
\newcommand{\rdna}{\texttt{rDNA}}
\newcommand{\rstudio}{\texttt{RStudio}}
\newcommand{\java}{\texttt{Java}}
\newcommand{\rjava}{\texttt{rJava}}
\newcommand{\R}{\texttt{R}}
\newcommand{\ucinet}{\texttt{Ucinet}}
\newcommand{\netminer}{\texttt{NetMiner}}
\newcommand{\gephi}{\texttt{Gephi}}
\newcommand{\visone}{\texttt{visone}}
\newcommand{\win}{\raisebox{-0.1em}{\includegraphics[height=1.5\fontcharht\font`\B]{03-6-winbutton}}}
\newcommand{\rrun}{\raisebox{-0.1em}{\includegraphics[height=1.5\fontcharht\font`\B]{03-5-rrun}}}
\newcommand{\github}{\href{https://github.com/leifeld/dna}{\texttt{GitHub}}}
\newcommand*{\fullref}[1]{\hyperref[{#1}]{ \nameref*{#1}}} % One single link
\newcommand{\code}[1]{% same color and decoration as knitr bash command
  \textcolor{codecolort}{%
    \sethlcolor{codecolorbg}\hl{%
      \texttt{#1}%
    }%
  }%
}
\newcommand{\infobox}[2]{%
  \begin{center}\fbox{%
    \parbox{#1}{#2}%
  }\end{center}%
}

\definecolor{codecolorbg}{rgb}{0.969, 0.969, 0.969}
\definecolor{codecolort}{rgb}{0.345, 0.345, 0.345}
\definecolor{black}{RGB}{0,0,0}
\definecolor{grey}{RGB}{240,240,240}
\definecolor{white}{RGB}{255,255,255}

\usepackage{suffix}

\newcommand\chapterauthor[1]{\authortoc{#1}\printchapterauthor{#1}}
\WithSuffix\newcommand\chapterauthor*[1]{\printchapterauthor{#1}}

\makeatletter
\newcommand{\printchapterauthor}[1]{%
  {\parindent0pt\vspace*{-25pt}%
  \linespread{1.1}\large\scshape#1%
  \par\nobreak\vspace*{35pt}}
  \@afterheading%
}
\newcommand{\authortoc}[1]{%
  \addtocontents{toc}{\vskip-10pt}%
  \addtocontents{toc}{%
    \protect\contentsline{chapter}%
    {\hskip1.3em\mdseries\scshape\protect\small#1}{}{}}
  \addtocontents{toc}{\vskip5pt}%
}
\makeatother


\newcommand{\Autorname}{\addtocontents{toc}{\hspace{0.47cm}\emph{\aut}\par}}

\setlength{\parindent}{0em}
\setlength{\parskip}{0.5em}

\bibpunct[: ]{(}{)}{;}{a}{}{,}

\emergencystretch 1.5em
\widowpenalty=10000
\clubpenalty=10000
\raggedbottom

\pagenumbering{roman} % roman numbering in frontmatter

\usepackage[
  unicode=true,
  pdfusetitle,
  bookmarks=true,
  bookmarksnumbered=true,
  bookmarksopen=true,
  bookmarksopenlevel=2,
  breaklinks=true,
  colorlinks=true,
  pdfstartview={XYZ null null 1},
  citecolor={blue}
 ]{hyperref}

\begin{document}
<<setup, include=FALSE, cache=FALSE, results='hide', message=FALSE, warning=FALSE>>=
library("knitr")
# get latest R version
site <- tryCatch({
  readLines("https://cran.r-project.org/bin/windows/base/", n = 10)
}, 
error = function(e) {warning("Internet connection neccessary to check for newest R version")})
R_vers <- site[grepl("^<title>", site)]
R_vers <- as.character(regmatches(R_vers, gregexpr("\\d+.\\d+.\\d+", R_vers)))
if (length(R_vers) == 0) {
  R_vers <- "3.5.0"
}
# get latest RStudio version
site <- tryCatch({
  readLines("https://www.rstudio.com/products/rstudio/download/")
}, 
error = function(e) {warning("Internet connection neccessary to check for newest R version")})
RS_vers <- site[grepl("<h4 id=\"download\"><strong>RStudio Desktop", site)]
RS_vers <- as.character(regmatches(RS_vers, gregexpr("\\d+.\\d+.\\d+", RS_vers)))
if (length(RS_vers) == 0) {
  RS_vers <- "1.1.447"
}

# set global chunk options
opts_chunk$set(fig.path = 'figure/workshop-', fig.align = 'center', fig.show = 'hold', error = FALSE)
options(formatR.arrow = TRUE, width = 90, knitr.table.format = "latex")
knit_hooks$set(crop=hook_pdfcrop,
               document = function(x) {
                 x <- gsub('$RS_vers$', RS_vers, x, fixed = TRUE)
                 x <- sub('\\usepackage[]{color}', '\\usepackage[table]{xcolor}', x, fixed = TRUE)
                 x
                 })
@

\title{Discourse Network Analyzer Manual}
\date{\footnotesize{Last update: DNA 2.0 beta 22 with rDNA \Sexpr{packageVersion("rDNA")} on \today.}}
\author{Philip Leifeld, Johannes Gruber and Felix Rolf Bossner}

\maketitle
\setcounter{tocdepth}{1}
% change linkcolor in TOC
{\hypersetup{linkcolor=black}
\tableofcontents
}


\chapter{Introduction} \label{chp:intro}
\chapterauthor{Philip Leifeld and Johannes Gruber}
\FloatBarrier
\pagenumbering{arabic}

This manual demonstrates how to install, set up, and use the open-source standalone software \texttt{Discourse Network Analyzer} (\dna) and its companion \R\ package \rdna\ \citep{leifeld2018rdna}, which are designed for researchers using the method \emph{discourse network analysis}.%
\footnote{\emph{This manual is a work in progress and will be continuously updated during the year 2018.
See \url{https://github.com/leifeld/dna/blob/master/manual/} for the most recent version}.}
By combining content analysis and dynamic network analysis, this method can reveal the structure and dynamics of policy debates.
The method comprises three basic steps:
\begin{enumerate}
 \item annotating statements of actors in unstructured (text) sources,
 \item creating networks from the resulting structured data,
 \item analysing and interpreting the results by employing the toolbox of network analysis.
\end{enumerate}
The results can take a number of different forms, such as so-called congruence or conflict networks of actors or of concepts, affiliation networks of actors and concepts, and longitudinal versions of these networks (see Chapter~\ref{chp:algorithms} and \citealt{leifeld2017discourse} for a comprehensive overview of the method).

The benefit of using the \java\ software \dna\ is that it is specifically designed to aid the user in the first two of these basic steps of discourse network analysis.
It is mainly designed for qualitative annotation of actors' statements in order to structure the text.
The program can also create different kinds of network matrices based on these structured data and export them to other programs for further analysis and plotting.
Additionally, while the software is primarily designed for discourse network analysis of actors and concepts, it is also flexible with regard to the definition of new statement types, for example using user-defined variables like ``location'' or ``addressee'' (see Section~\ref{sec:stattype}).
While there are numerous alternative software packages for qualitative content analysis, there are very few which were specifically developed with discourse network analysis in mind, and therefore they lack the functionality necessary for exporting network data.

The companion package \rdna\ for the statistical computing environment \R\ additionally helps with the third step mentioned above: analysis of the annotated statements.
\rdna\ takes the structured data from \dna\ and permits further in-depth analysis using network analysis.
While data can also be exported to other software such as \ucinet, \visone, \netminer, and \gephi, \R\ is the preferred choice as it facilitates reproducible research, is free and open source, and has a large community of users and developers who are engaged in all kinds of data analysis tasks.
\R\ has several packages developed specifically for network analysis, such as \texttt{statnet} \citep{handcock2008statnet}, \texttt{igraph} \citep{csardi2006igraph}, \texttt{xergm} \citep{leifeld2018temporal, leifeld2017xergm}, \texttt{sna} \citep{butts2016sna}, \texttt{network} \citep{butts2008network}, \texttt{ggraph} \citep{linpedersen2017ggraph}, and \texttt{tidygraph} \citep{linpedersen2017tidygraph}.
Most of these packages work seamlessly with data processed by \rdna\ and therefore add a myriad of possibilities to the native functions of our own \R\ package.

In recent years, discourse network analysis has been employed by a growing number of scholars in a wide field of policy sectors, such as
pension politics \citep{leifeld2013reconceptualising, leifeld2016policy},
climate politics \citep{fisher2013mapping, fisher2013where, broadbent2014inter, gkiouzepas2015climate, manfredo2014society, schneider2014punctuations, stoddart2015canada, wagner2017trends, yun2014framing},
software patents and property rights \citep{leifeld2012software},
internet policy \citep{breindl2013discourse, haunss2009ip},
infrastructure projects \citep{nagel2016polarisierung},
energy policy \citep{brutschin2013dynamics, haunss2017ausstieg, imbert2017inquiry, rinscheid2015crisis},
shooting rampages \citep{hurka2013framing},
abortion \citep{muller2014beleidscontroverse, muller2014discourscoalities, muller2015discourse},
outdoor sports \citep{stoddartetal2015environmentalists},
deforestation \citep{rantala2014multistakeholder},
higher education \citep{naegler2015partner},
international financial politics \citep{haunss2017finace},
and online deception \citep{wu2015dobnet}.

While a default toolbox of methods is available for discourse network analysis, new methods are being developed presently.
For example, one promising approach is the application of inferential network models to the temporal network structure produced by \dna\ in order to model policy debates at the micro level.
This will soon allow us to develop and test theories on how actors contribute statements to policy debates, and to forecast debates based on these theories \citep[for an outlook, see][]{leifeld2017discourse}.

The outline of this manual is as follows.
Chapter~\ref{chp:algorithms} describes the types of networks \dna\ can export.
Chapter~\ref{chp:installation} explains how to install \dna\ and \rdna, which both rely on a correctly configured \java\ runtime environment.
Only consult this chapter if you experience problems with the installation on your own.
The following four sections describe the usage of \dna\ in detail:
Chapter~\ref{chp:dna-prep} describes how to set up a project in \dna, including the creation of a database, adding and managing users, and how to set up or edit statement types and variables.
Chapter~\ref{chp:dna-import} explains how you can import and organise your raw data (i.\,e., documents).
Chapter~\ref{chp:dna-coding} provides an overview of how you annotate statements in \dna.
Even though this process is very straightforward, the section also reveals some functions that can help you to annotate material faster and more reliably.
Chapter~\ref{chp:dna-export} explains how data can be exported to other programs for further analysis.
What may have seemed abstract in Chapter~\ref{chp:algorithms} quickly becomes clear at this point---once you have exported a few example networks yourself.
Chapter~\ref{chp:rdna} is an introductory tutorial on using the \rdna\ package to perform additional analysis and plotting tasks using the infrastructure provided by \R.

Both \dna\ and \rdna\ can be downloaded from \github\ (see Chapter~\ref{chp:installation}).
Please feel free to post questions and bug reports to the issue tracker on \github.


\chapter{Methods for Network Construction} \label{chp:algorithms}
\chapterauthor{Philip Leifeld}
\FloatBarrier
This chapter summarises the main network algorithms implemented in \dna\ graphically and using mathematical notation.

\section{Graphical Intuition}
Figure~\ref{fig:algo_aff} illustrates how actors (as yellow nodes on the left) and concepts (as blue nodes on the right) are connected by dashed lines.
These dashed lines represent the edges of a bipartite graph, also called an affiliation network.
Substantively, these edges represent statements that were annotated in a policy debate.
For example, actor~5 refers to concepts~3 and~4 in the debate.

\tikzset{
 actor/.style ={circle,draw=red!50!yellow,fill=red!20!yellow,thick,inner sep=1pt},
 category/.style ={circle,draw=blue!70,fill=blue!40,thick,inner sep=1pt},
 grey/.style ={line width=0.5mm, dashed,color=gray,inner sep=0pt},
 black/.style ={line width=0.5mm},
 annotation/.style = {fill=gray!50, rounded corners=3}
}

\begin{figure}[tbp]
 \begin{center}
  \begin{tikzpicture}
   \node [actor] (a1) at (0,4) {$a_1$};
   \node [actor] (a2) at (1,3) {$a_2$};
   \node [actor] (a3) at (0,2) {$a_3$};
   \node [actor] (a4) at (1,1) {$a_4$};
   \node [actor] (a5) at (0,0) {$a_5$};
   \node [category] (c1) at (5,4) {$c_1$};
   \node [category] (c2) at (7,3) {$c_2$};
   \node [category] (c3) at (6,2) {$c_3$};
   \node [category] (c4) at (5,0) {$c_4$};
   \node [category] (c5) at (7,1) {$c_5$};
   \draw [grey] (a1) to (c1);
   \draw [grey] (a2) to (c1);
   \draw [grey] (a2) to (c2);
   \draw [grey] (a2) to (c3);
   \draw [grey] (a3) to (c3);
   \draw [grey] (a3) to (c5);
   \draw [grey] (a4) to (c3);
   \draw [grey] (a5) to (c3);
   \draw [grey] (a5) to (c4);
   \draw [black] (a1) to (a2);
   \draw [black] (a2) to (a3);
   \draw [black] (a2) to (a5);
   \draw [black] (a3) to (a4);
   \draw [black] (a3) to (a5);
   \draw [black] (a2) to (a4);
   \draw [black] (a4) to (a5);
   \draw [black] (c1) to (c2);
   \draw [black] (c1) to (c3);
   \draw [black] (c2) to (c3);
   \draw [black] (c3) to (c4);
   \draw [black] (c3) to (c5);
   \node at (0,-2) {};
   \node at (7,5.5) {};
   \node [annotation] at (0,5) {actors};
   \node [annotation] at (6,5) {concepts};
   \node [annotation, text width=1.5cm, right=-3mm] at (2.5,-1) {affiliation network};
   \node [annotation, text width=1.5cm, right=-6mm] at (0,-1) {actor network};
   \node [annotation, text width=1.5cm, right=-6mm] at (5.5,-1) {concept network};
  \end{tikzpicture}
  \caption{Illustration: Affiliation network (dashed lines) between actors (yellow nodes, variable~1) and concepts (blue nodes, variable~2) and their induced actor congruence network (solid lines on the left) and concept congruence network (solid lines on the right).}
  \label{fig:algo_aff}
 \end{center}
\end{figure}

Based on this bipartite graph, an actor congruence network and a concept congruence network can be inferred.
For example, actors~1 and~2 jointly refer to the same concept~1, hence they are directly connected by an edge in the actor congruence network illustrated on the left.
If actors~1 and~2 shared more than one concept, their edge weight would be proportional to the number of concepts they shared.
Substantively, the strength of connection between two actors can be interpreted as their similarity in terms of the concepts they employ in the policy debate.

Conversely, concepts~1 and~3 are jointly referred to by the same actor~2, hence they are directly connected by an edge in the concept congruence network illustrated on the right.
If concepts~1 and~3 were jointly referred to by more than one actor, their similarity would be greater than one.
Substantively, the edge weights between concepts can be interpreted as their similarity in terms of the actors that employ them in the policy debate.

Actors and concepts are merely a substantive application where other variables~1 and~2 could have been encoded instead, such as persons and locations or speakers and addressees.

Simply modelling referral of a concept by an actor, however, is insufficient to capture agreement and opposition in policy debates.
For example, actor~1 and actor~2 may either both support concept~1, they may both reject concept~1, or one of them may refer to concept~1 in a positive way while the other one may refer to concept~1 in a negative way.
Depending on these configurations, one would infer a congruent or a conflictual relationship between actors~1 and~2.

Figure~\ref{fig:algo_binarycongruence} illustrates how two types of networks can be generated, congruence and conflict networks.
In a congruence network, edges are counted when both actors co-support or co-reject a concept.
In a conflict network, edges are counted when the two actors' agreement patterns differ.

\begin{figure}[tbp]
 \begin{center}
  \begin{tikzpicture}
   \node [actor] (a1) at (0,1) {$a_1$};
   \node [actor] (a2) at (0,0) {$a_2$};
   \node [category] (c1) at (3,0.5) {$c_1$};
   \draw [grey] (a1) to node [above=1mm,circle,solid,line width=0.2mm] {\textbf{+}} (c1);
   \draw [grey] (a2) to node [below=1mm,circle,solid,line width=0.2mm] {\textbf{+}} (c1);
   \draw [black] (a1) to (a2);
   \node [annotation,right=-3mm] at (0,2.3) {congruence networks};
  
   \node [actor] (a1) at (5.5,1) {$a_1$};
   \node [actor] (a2) at (5.5,0) {$a_2$};
   \node [category] (c1) at (8.5,0.5) {$c_1$};
   \draw [grey] (a1) to node [above=1mm,circle,solid,line width=0.2mm] {\textbf{+}} (c1);
   \draw [grey] (a2) to node [below=1mm,circle,solid,line width=0.2mm] {\textbf{--}} (c1);
   \draw [black] (a1) to (a2);
   \node [annotation,right=-3mm] at (5.5,2.3) {conflict networks};
  
   \node [actor] (a1) at (0,-1.5) {$a_1$};
   \node [actor] (a2) at (0,-2.5) {$a_2$};
   \node [category] (c1) at (3,-2) {$c_1$};
   \draw [grey] (a1) to node [above=1mm,circle,solid,line width=0.2mm] {\textbf{--}} (c1);
   \draw [grey] (a2) to node [below=1mm,circle,solid,line width=0.2mm] {\textbf{--}} (c1);
   \draw [black] (a1) to (a2);
  
   \node [actor] (a1) at (5.5,-1.5) {$a_1$};
   \node [actor] (a2) at (5.5,-2.5) {$a_2$};
   \node [category] (c1) at (8.5,-2) {$c_1$};
   \draw [grey] (a1) to node [above=1mm,circle,solid,line width=0.2mm] {\textbf{--}} (c1);
   \draw [grey] (a2) to node [below=1mm,circle,solid,line width=0.2mm] {\textbf{+}} (c1);
   \draw [black] (a1) to (a2);
  \end{tikzpicture}
  \caption{Illustration: congruence and conflict networks with a binary qualifier variable.}
  \label{fig:algo_binarycongruence}
 \end{center}
\end{figure}

Qualifiers like ``agreement'' need not be binary (e.\,g., positive or negative).
They could be represented by integer weights, such as intensity of agreement on a scale from $-5$ to $+5$.
In this case, one would need to define the edge weights between nodes in a congruence network as the absolute difference between the two weights in the affiliation network subtracted from the maximum possible difference and divided by the maximum possible difference.
Conversely, in a conflict network, one would need to define the edge weights between nodes as the absolute difference between the two weights in the affiliation network divided by the maximum possible difference, such that a value of 0 represents no conflict and 1 represents maximal conflict.
In either case, these fractions would need to be counted over all concepts (or, more generally, over all nodes of the second variable).
This calculation is illustrated in Figure~\ref{fig:algo_integercongruence}.

\begin{figure}[tbp]
 \begin{center}
  \begin{tikzpicture}
   \node [actor] (a1) at (0,1) {$a_1$};
   \node [actor] (a2) at (0,0) {$a_2$};
   \node [category] (c1) at (3,0.5) {$c_1$};
   \draw [grey] (a1) to node [above=1mm,circle,solid,line width=0.2mm] {$+2$} (c1);
   \draw [grey] (a2) to node [below=1mm,circle,solid,line width=0.2mm] {$+5$} (c1);
   \draw [black] (a1) to node [left=1mm,circle,solid,line width=0.2mm] {$\frac{7}{10}$} (a2);
   \node [annotation,right=-3mm] at (0,2.3) {congruence networks};
  
   \node [actor] (a1) at (5.5,1) {$a_1$};
   \node [actor] (a2) at (5.5,0) {$a_2$};
   \node [category] (c1) at (8.5,0.5) {$c_1$};
   \draw [grey] (a1) to node [above=1mm,circle,solid,line width=0.2mm] {$+2$} (c1);
   \draw [grey] (a2) to node [below=1mm,circle,solid,line width=0.2mm] {$+5$} (c1);
   \draw [black] (a1) to node [left=1mm,circle,solid,line width=0.2mm] {$\frac{3}{10}$} (a2);
   \node [annotation,right=-3mm] at (5.5,2.3) {conflict networks};
  
   \node [actor] (a1) at (0,-1.5) {$a_1$};
   \node [actor] (a2) at (0,-2.5) {$a_2$};
   \node [category] (c1) at (3,-2) {$c_1$};
   \draw [grey] (a1) to node [above=1mm,circle,solid,line width=0.2mm] {$-4$} (c1);
   \draw [grey] (a2) to node [below=1mm,circle,solid,line width=0.2mm] {$+2$} (c1);
   \draw [black] (a1) to node [left=1mm,circle,solid,line width=0.2mm] {$\frac{4}{10}$} (a2);
  
   \node [actor] (a1) at (5.5,-1.5) {$a_1$};
   \node [actor] (a2) at (5.5,-2.5) {$a_2$};
   \node [category] (c1) at (8.5,-2) {$c_1$};
   \draw [grey] (a1) to node [above=1mm,circle,solid,line width=0.2mm] {$-4$} (c1);
   \draw [grey] (a2) to node [below=1mm,circle,solid,line width=0.2mm] {$+2$} (c1);
   \draw [black] (a1) to node [left=1mm,circle,solid,line width=0.2mm] {$\frac{6}{10}$} (a2);
  \end{tikzpicture}
  \caption{Illustration: congruence and conflict networks with an integer qualifier variable.}
  \label{fig:algo_integercongruence}
 \end{center}
\end{figure}

Finally, it may be necessary to normalise the resulting affiliation, congruence, or conflict network.
Normalisation may be necessary to avoid a core--periphery structure where those actors end up at the center of the network who refer to most concepts.
Normalisation corrects for the verbosity of actors (or, more generally, for the centrality of a node in the affiliation network).
Several normalisation methods are available, and they will be described below.
Graph clustering can then be applied to the normalised networks to identify coalitions in policy debates.
More details, especially on the topic of normalisation, can be found in \citet{leifeld2017discourse}.

The next section will introduce some formal notation to represent the data structures and transformations introduced above; then these transformations will be re-introduced more formally using mathematical notation, and finally normalisation methods will be proposed.

\section{Notation}\label{sec:notation}
$X$ is a three-dimensional array representing statement counts.
$x_{ijk}$ is a specific count value in this array, with the first index $i$ denoting an instance of the first variable (e.\,g., organization or actor $i$), the second index $j$ denoting an instance of the second variable (e.\,g., concept $j$), and the third index $k$ denoting a level on the qualifier variable (e.\,g., agreement = $1$).
For example, $x_{ijk} = 5$ could mean that actor $i$ mentions concept $j$ with intensity $k$ five times.
$X$ can be represented as a cuboid, as illustrated in Figure~\ref{fig:algo_cuboid}.

\begin{figure}
 \begin{center}
  % diagram adjusted from https://latex.org/know-how/440-tikz-3dplot
  \tdplotsetmaincoords{60}{125}
  \begin{tikzpicture}
   [tdplot_main_coords,
    grid/.style={very thin,gray},
    axis/.style={->,blue,thick},
    cube/.style={opacity=.5,very thick,fill=red}]
   %draw a grid in the x-y plane
   \foreach \x in {-0.5,0,...,2.5}
    \foreach \y in {-0.5,0,...,2.5}
     {
      \draw[grid] (\x,-0.5) -- (\x,2.5);
      \draw[grid] (-0.5,\y) -- (2.5,\y);
     }			
  
   %draw the axes
   \draw[axis] (0,0,0) -- (3,0,0) node[anchor=west]{$i$};
   \draw[axis] (0,0,0) -- (0,3,0) node[anchor=west]{$j$};
   \draw[axis] (0,0,0) -- (0,0,3) node[anchor=west]{$k$};
  
   %draw the bottom of the cube
   \draw[cube] (0,0,0) -- (0,2,0) -- (2,2,0) -- (2,0,0) -- cycle;
  
   %draw the back-right of the cube
   \draw[cube] (0,0,0) -- (0,2,0) -- (0,2,2) -- (0,0,2) -- cycle;
  
   %draw the back-left of the cube
   \draw[cube] (0,0,0) -- (2,0,0) -- (2,0,2) -- (0,0,2) -- cycle;
  
   %draw the front-right of the cube
   \draw[cube] (2,0,0) -- (2,2,0) -- (2,2,2) -- (2,0,2) -- cycle;
  
   %draw the front-left of the cube
   \draw[cube] (0,2,0) -- (2,2,0) -- (2,2,2) -- (0,2,2) -- cycle;
  
   %draw the top of the cube
   \draw[cube] (0,0,2) -- (0,2,2) -- (2,2,2) -- (2,0,2) -- cycle;
  \end{tikzpicture}
  \caption{Statements in the $X$ array can be represented in a cuboid data structure.}
  \label{fig:algo_cuboid}
 \end{center}
\end{figure}

Where the qualifier variable is binary, \emph{false} values are represented as $0$ and \emph{true} values as $1$ on the $k$ index, i.\,e., $K^\text{binary} = \{ 0; 1 \}$.
Where the qualifier variable is integer, the respective integer value is used as the level.
This implies that $k$ can take positive or negative values or 0, i.e, $K^\text{integer} \subseteq \mathbb{Z}$.
Note that all $k$ levels of the scale are included in $K$, not just those values that are empirically observed.

Indices with a prime denote a second instance of an element, e.\,g., $i'$ may denote another organization.
$Y$ denotes the output matrix to be obtained by applying a transformation to $X$.
Several transformations are possible and will be described below.

\section{Construction of One-Mode Networks} \label{sec:onemode}

\subsection{Congruence Networks} \label{subsec:congruence}
In a congruence network, the edge weight between nodes $i$ and $i'$ represents the number of times they co-support or co-reject second-variable nodes (if a binary qualifier is used) or the cumulative similarity between $i$ and $i'$ over their assessments of second-variable nodes (in the case of an integer qualifier variable).

In the integer case, the similarity between nodes $i$ and $i'$ is defined as the cumulative similarity over levels $k$ of the qualifier variable:
\begin{equation} \label{eq:congruence_integer}
 y_{ii'}^\text{congruence} = \Phi_{ii'}\left( \sum_{j = 1}^n \sum_{k} \sum_{k'} x_{ijk} x_{i'jk'} \left( 1 - \frac{\vert k - k' \vert}{\vert K \vert - 1} \right) \right)
\end{equation}
where $\Phi_{ii'}(\cdot)$ denotes a normalization function (to be specified below).

Here, $\vert k - k' \vert$ is the difference in assessment of second-mode node $k$ (e.\,g., concept) by two first-mode nodes $i$ and $i'$.
$\vert K \vert - 1$ is the maximum diffference there can be, with $\vert K \vert$ indicating the number of levels in qualifier variable $K$.
For example, if the qualifier scale is $[-5; 5]$, $\vert K \vert - 1 = 10$.
The subtraction in the parentheses serves to convert distances (as in a conflict network) to similarities.
The distances are counted over all statements, meaning that nodes $i$ and $i'$ count these similarities over all combinations of the levels $k$ (for node $i$) and $k'$ (for node $i'$) for each second-variable node $j$ (e.\,g., each concept) and weight them by how often these combinations occur.
This weighting occurs in the $x_{ijk} x_{i'jk'}$ part.
For example, if $i$ mentions $j$ at intensity $-4$ twice and $i'$ mentions $j$ at intensity $+2$ three times on an intensity scale $[-5; +5]$, this contributes $2 \cdot 3 \cdot (1 - \frac{\vert-4 - 2\vert}{11 - 1}) = 4.2$ to the edge weight between $i$ and $i'$ in the congruence network.

The binary case with $\vert K \vert = 2$ is a special case of the integer congruence network with a negative or positive agreement pattern, for example reflecting rejection or support of a concept by an actor.
In the binary case, congruent opinions always reduce to $1 - \frac{\vert k - k' \vert}{\vert K \vert - 1} = 1$, and differences in opinion always reduce to $1 - \frac{\vert k - k' \vert}{\vert K \vert - 1} = 0$.
Hence the binary case can be more easily expressed by counting the matches on the $k$ qualifier for all $j$ items without computing any distances:
\begin{equation} \label{eq:congruence_binary}
 y_{ii'}^\text{congruence binary} = \Phi_{ii'}\left( \sum_{j = 1}^n \sum_{k} x_{ijk} x_{i'jk} \right).
\end{equation}

\subsection{Conflict Networks} \label{subsec:conflict}
The same logic as for the congruence network can be applied to produce conflict networks.
In the integer case, Equation~\ref{eq:congruence_integer} must be modified such that the relative distances are not subtracted from one, while everything else stays the same:
\begin{equation}
 y_{ii'}^\text{conflict} = \Phi_{ii'}\left( \sum_{j = 1}^n \sum_{k} \sum_{k'} x_{ijk} x_{i'jk'} \left( \frac{\vert k - k' \vert}{\vert K \vert - 1} \right) \right)
\end{equation}

In the binary case, Equation~\ref{eq:congruence_binary} must be modified such that contradictions instead of matches are counted.
In other words, instead of counting $x_{ijk} x_{i'jk}$, $x_{ijk} x_{i',j,(1-k)}$ must be counted:
\begin{equation}\label{eq:conflict_binary}
 y_{ii'}^\text{conflict binary} = \Phi_{ii'}\left( \sum_{j = 1}^n \sum_{k} x_{ijk} x_{i',j,(1-k)} \right).
\end{equation}

\subsection{The Subtract Method}
In many empirical applications, it might make sense to combine the notions of congruence and conflict in a single signed and weighted network.
If only congruence is considered, for example, one misses out on the possible fact that two actors may contradict each other on more concepts than they agree on.
For this reason, it might make sense to subtract conflict edge weights from congruence edge weights and thereby construct a signed, weighted graph using the \emph{subtract} method as follows:
\begin{equation}
 y_{ii'}^\text{subtract} = y_{ii'}^\text{congruence} - y_{ii'}^\text{conflict}
\end{equation}
Here, positive $y_{ii'}^\text{subtract}$ values indicate congruence in excess of conflict while negative values indicate conflict in excess of congruence.
In some practical applications---for example, for visualisations of the congruence network---, it may make sense to discard all negative values or introduce some other threshold value $c$ for recoding all $y_{ii'}^\text{subtract} < c$ values as $0$.

\subsection{The Ignore Method} \label{subsec:ignore}
In some applications, qualifiers do not matter substantively, or there is only one level on the qualifier variable.
In such applications, it is possible to just count all referrals of $j$ by $i$ across levels of $k$ to get the number of times $i$ mentions $j$ in any way, then do the same for $i'$, and multiply both to yield the similarity between $i$ and $i'$ in terms of overlap in $j$, disregarding the levels of $k$:
\begin{equation}\label{eq:ignore}
 y_{ii'}^\text{ignore} = \Phi_{ii'}\left( \sum_{j = 1}^n \left( \left( \sum_{k} x_{ijk} \right) \left( \sum_{k} x_{i'jk} \right) \right) \right)
\end{equation}

\section{Normalisation for One-Mode Networks}\label{sec:normalis}
\citet{leifeld2017discourse} discusses the normalisation of congruence networks.
Normalisation, however, is also possible for affiliation networks, as will be demonstrated below.

Normalisation can be necessary to correct networks for the activity or popularity of nodes.
For example, if some first-variable nodes refer to a substantial number of second-variable nodes while others refer to few, the former will be more likely to be connected to many other nodes and especially those with similar levels of activity, which leads to a core--periphery structure of the discourse network.
Normalisation corrects for this pattern by cancelling out the effect of activity or popularity of nodes.
This will often lead to a clear cluster structure based on the similarity of node profiles, instead of a core--periphery structure.

In the simplest case, normalization can be switched off, in which case
\begin{equation}
 \Phi_{ii'}^\text{no}(\omega) = \omega.
\end{equation}

\subsection{Average Activity Normalisation of One-Mode Networks}
Edge weights can be divided by the \emph{average activity} of nodes $i$ and $i'$:
\begin{equation}\label{eq:activity}
 \Phi_{ii'}^\text{avg} (\omega) = \frac{\omega}{ \frac{1}{2} \left( \sum_{j = 1}^n \sum_{k} x_{ijk} + \sum_{j = 1}^n \sum_{k} x_{i'jk} \right) }.
\end{equation}
\emph{Average activity normalisation} is the most commonly applied form of normalisation and works both with binary and weighted $X$ arrays, i.\,e., with or without duplicate statements.
It divides each weight by the mean of the number of second-variable referrals of nodes $i$ and $i'$.

\subsection{Jaccard Normalisation for One-Mode Networks}
With \emph{Jaccard normalisation}, we do not just count $i$'s and $i'$'s activity and sum them up independently, but we add up both their independent activities and their joint activity, i.\,e., both matches and non-matches:
\begin{equation}\label{eq:jaccard}
 \Phi_{ii'}^\text{Jaccard} (\omega) = \frac{\omega}{ \sum_{j = 1}^n \sum_{k} x_{ijk} [x_{i'jk} = 0] + \sum_{j = 1}^n \sum_{k} x_{i'jk}[x_{ijk} = 0] + \sum_{j = 1}^n \sum_{k} x_{ijk} x_{i'jk} }.
\end{equation}
Jaccard normalisation works best with binary $X$ arrays, i.\,e., if duplicate statements are not possible in the data structure.

\subsection{Cosine Normalisation for One-Mode Networks}
With \emph{cosine normalization}, we modify Equation~\ref{eq:activity} to take the product in the denominator instead of the mean:
\begin{equation}\label{eq:cosine}
 \Phi_{ii'}^\text{cosine} (\omega) = \frac{\omega}{ \sqrt{(\sum_{j = 1}^n \sum_{k} x_{ijk})^2} \sqrt{(\sum_{j = 1}^n \sum_{k} x_{i'jk})^2} }.
\end{equation}
This works best when duplicates are admitted but can also be applied to binary $X$ arrays.

\section{Affiliation Networks}\label{sec:twomode}
While one-mode networks as portrayed in Section~\ref{sec:onemode} are most useful for analysing coalition structure in policy debates, affiliation networks convey more complexity.
This makes them harder to interpret with increasing complexity of the data but can be more informative for less complex discourse networks.

The simplest case is to ignore the qualifier variable:
\begin{equation}
 y_{ij}^\text{affiliation ignore} = \Phi_{ij}\left(\sum_{k} x_{ijk} \right)
\end{equation}
This only makes sense if there is only one level in $K$ or if the qualifier variable does not matter substantively.

More interestingly, negative edges (e.\,g., rejection of concepts by actors) can be subtracted from positive edges (e.\,g., support of concepts by actors).
This yields a signed, weighted affiliation network.

In the integer case, the respective cells in $X$ can just be weighted by the respective level $k$.
If the weight is negative, this will subtract $x_{ijk}$ from the count:
\begin{equation}
 y_{ij}^\text{affiliation subtract integer} = \Phi_{ij}\left(\sum_{k} k x_{ijk} \right)
\end{equation}

In the binary case (assuming $K = \{0; 1\}$), $0$ values need to be transformed into $-1$ before they can be subtracted:
\begin{equation}
 y_{ij}^\text{affiliation subtract binary} = \Phi_{ij}\left(\sum_{k} \left( k x_{ijk} - (1 - k) x_{ijk} \right) \right)
\end{equation}

Alternatively, in the binary case (assuming $K = \{0; 1\}$), it is possible to map all combinations of $k$ for each $(ij)$ dyad into a multiplex network with three distinct types of edges, where $0$ represents neither agreement nor disagreement, $1$ represents agreement$, $2$ represents disagreement$, and $3$ represents a mix of both agreement and disagreement.
This can be useful, for example, for visualising agreement, disagreement, and ambiguity/ambivalence in the same affiliation network using different colours.
More formally:
\begin{equation}
 y_{ij}^\text{affiliation combine binary} =
 \begin{cases}
    0 & \text{if } \sum_k x_{ijk} = 0 \\
    1 & \text{if } x_{i,j,k=0} = 0 \wedge x_{i,j,k=1} > 0 \\
    2 & \text{if } x_{i,j,k=1} = 0 \wedge x_{i,j,k=0} > 0 \\
    3 & \text{if } x_{i,j,k=0} > 0 \wedge x_{i,j,k=1} > 0
 \end{cases}
\end{equation}

\section{Normalisation of Affiliation Networks}
Like one-mode networks, affiliation networks can be normalised.
With \emph{activity normalisation}, ties from more active nodes receive lower weights:
\begin{equation}
 \Phi_{ij}^\text{activity}(\omega) = \frac{\omega}{\sum_{j = 1}^n \sum_k x_{ijk}}
\end{equation}

With \emph{prominence normalisation}, ties to more prominent nodes receive lower weights:
\begin{equation}
 \Phi_{ij}^\text{prominence}(\omega) = \frac{\omega}{\sum_{i = 1}^m \sum_k x_{ijk}}
\end{equation}

\section{Temporal Aggregation: Time Windows and Attenuation}\label{sec:longi}
Networks can be temporally smoothed.
For example, it is possible to create a series of temporally overlapping time slices and aggregate these slices into a single network to limit the temporal scope of congruence edges (\emph{time window algorithm}).
Using the same algorithm, it is possible to visualise change over time using animations.
Or it is possible to make the edge weight proportional to the time that has passed between the relevant statements of $i$ and $i'$ (\emph{attenuation algorithm}).
These methods are more advanced and are introduced in \citet{leifeld2016policy}.


\chapter{Installation of \dna\ and \rdna} \label{chp:installation}
\chapterauthor{Johannes Gruber and Philip Leifeld}
\FloatBarrier

This section explains how \dna\ and \rdna\ can be installed on common desktop operating systems.

As \dna\ is written in \java, both \dna\ and \rdna\ rely on \java\ to work on your computer properly.
Installing and configuring a valid \java\ Runtime Environment on your machine will thus be the first and only complicated step of the installation.
Following the simple steps below, one should not run into problems while setting up \java.
The advantage of the \java\ programming language for academic software is that it both runs on different operating systems without altering the source code, once the Runtime Environment is set up, and that it is, for the most part, open source.
Besides setting up the \java\ Runtime Environment, the installation of \dna\ and \rdna\ is identical on different operating systems.

If you feel confident that \java\ is already correctly set up on your computer, you can therefore skip to Section~\ref{sec:installdna} if you like.
Otherwise please continue to the section for the operating system you wish to install \dna\ and \rdna\ on:
\fullref{sec:windows},
\fullref{sec:mac} or
\fullref{sec:linux}.

\enlargethispage{1cm}

For more experienced users, here is a short version of the steps described below:
\begin{enumerate}
\item (On Mac: install \href{https://support.apple.com/downloads/DL1572/en_US/javaforosx.dmg}{Apple's legacy version of \java}---even though we will never use it.)
\item Install \java\ Runtime Environment (JDK) (Version 8) on your computer.
\item (On Windows and Mac: set up the \code{JAVA\_HOME} to the installation path of your JDK.)
\item Download the newest executable JAR from \url{https://github.com/leifeld/dna/releases}.
\item (On Linux: make the JAR file executable.) \\
      (On Mac: allow executing apps from an unidentified developer.)
\item You can now run the standalone \dna\ or continue to install \rdna\ as well.
\item Download and install \R\ (and \rstudio).
\item In \R: install the necessary \R\ packages \texttt{rJava} and \texttt{devtools}.
\item In \R: install \rdna\ via
<<eval=FALSE, results = 'tex', message = FALSE>>=
devtools::install_github("leifeld/dna/rDNA", args = "--no-multiarch")
@
\end{enumerate}

\section{Windows} \label{sec:windows}

\subsection{Installing \java\ on Windows}
To install the necessary \java\ Runtime Environment on your Windows computer, simply go to \url{https://www.java.com/en/download/manual.jsp}, scroll down to and download \code{Windows Offline (64-bit)} (see Figure~\ref{fig:downljava}; download \code{Windows Offline} instead if you are using a 32-bit version of Windows).
During the installation, you can accept all the default options, including the installation path.

\begin{figure}[tbp]
  \includegraphics[frame, width=\textwidth]{03-1-downljava}
  \caption{Downloading JDK from Oracle}
  \label{fig:downljava}
\end{figure}

Next, you should set \code{JAVA\_HOME} in your environmental variables to tell your Windows PC where your \java\ installation lives.
This step is optional, but can prevent many issues with \java\ users had in the past.
To set \code{JAVA\_HOME}, you need to navigate to the menu \code{edit the system environment variables}.
The easiest way to get there is to hit the \win\ button on your keyboard and enter \code{environment}.
Windows will then search for programs and settings menus that include this title and should usually display the menu we are looking for on top.\footnote{On older versions of Windows, this might not work.
On Windows~7 you can alternatively right-click on \code{My Computer} and select \code{Properties} $\rightarrow$ \code{Advanced}.
On Windows~8 \code{Control Panel} $\rightarrow$ \code{System} $\rightarrow$ \code{Advanced System Settings}.}
In this menu, you have to find the button \code{Environment variables...}.
Clicking this button should open the window shown in Figure~\ref{fig:javahome}.

Under \code{User Variables}, click \code{New}.\footnote{This sets \code{JAVA\_HOME} just for the current user.
If you want to make \java\ available for all users on the computer you are working on, you can create a \code{System Variable} instead.}
Enter the variable name \code{JAVA\_HOME} and the path to your Java installation in the field \code{Variable value}.
If you have not altered the default installation location, you should find \java\ in \code{"C:\textbackslash Program Files\textbackslash Java\textbackslash jre1.8.0\_151"} or, if you chose to install a 32-bit version of \java, in \code{"C:\textbackslash Program Files (x86)\textbackslash Java\textbackslash jre1.8.0\_151"} (which will cause problems, though, if you try to use it with a 64-bit version of \R).\footnote{Note that you have to repeat this procedure whenever the installation path of \java\ changes, for example whenever \java\ is updated.}

\begin{figure}
 \centering
 \includegraphics[width=0.6\textwidth]{03-2-javahome}
 \caption{Edit JAVA\_HOME to tell Windows where your \java\ lives.}
 \label{fig:javahome}
\end{figure}

Windows should now recognise \java\ and be able to run \java\ commands.
To test this, we can open the command prompt (press the \win\ button on your keyboard and simply enter \code{cmd} and then hit \code{Enter}) and type a \java\ command, e.\,g., \code{java -version}.
If the installation was successful, the output should display information about the \java-version and build as depicted in Figure~\ref{fig:javvers}.

\begin{figure}
  \centering
  \includegraphics[width=0.65\linewidth]{03-3-javaVersionCommand}
  \caption{Testing Java installation in Windows command prompt}
  \label{fig:javvers}
\end{figure}

After installing \java, you are ready to use \dna\ and could skip to Section~\ref{sec:installdna} if you are not interested in installing \rdna\ as well.
In order to use \rdna, the rest of this section will explain how to install \R\ and the recommended \href{https://en.wikipedia.org/wiki/Integrated_development_environment}{integrated development environment (IDE)} \href{https://www.rstudio.com/products/RStudio/}{\rstudio}, which makes working with \R\ a lot easier and also looks a lot better than the default interface.

\subsection{Installing \R\ on Windows} \label{subsec:installr-win}
\begin{enumerate}
\item First, you need to download \R\ from \url{https://cran.r-project.org/bin/windows/base/}.
\item At the top of the page, click on \code{Download R \Sexpr{R_vers} for Windows} (or a newer version if available).
\item Install the downloaded file, e.\,g., \code{R-\Sexpr{R_vers}-win.exe}.
      Usually, it is fine to leave all default settings in the installation options.
\item Go to \url{https://www.rstudio.com/products/rstudio/download/}.
\item At the bottom of the page, under \code{Installers for Supported Platforms}, click on the link \code{RStudio \Sexpr{RS_vers} -- Windows Vista/7/8/10} (or a newer version if available).
      Again, the default installation options are fine in most cases and can be accepted without changes.
\item After installation, you can use \R\ by opening \rstudio.
\end{enumerate}

\subsection{Testing the Installation of \rstudio} \label{subsec:rtest}
Traditionally, the first test you perform in a new programming language is to write a ``Hello, World!'' program.
To do this in \R, you simply type \code{print(``Hello World!'')} in the console (in the lower left corner of \rstudio).
Alternatively, you can make \R\ perform a simple mathematical operation.
If everything is set up correctly, the output should look like this:
<<eval=TRUE, results = 'tex'>>=
print("Hello World!")
# You can also use R as a calculator
2 * 3
@

The chunk of code above marks the first time we are using \R\ commands in this manual.
It might be worth explaining what this means for users who are not familiar with documents containing \R\ code.
Whenever code is shown in this manual it is decorated with a light grey background.
Comments in \R\ code (i.\,e., text targeted at the user to explain what is happening in a specific line) are marked with a \code{\#} and are formatted in italic font and in dark grey.
The output, which is generated by running a command, is marked by two \code{\#} and formatted in black.
This means that any line that does not start with \code{\#\#} contains \R\ code you can copy and paste to the console in \rstudio\ and run.
Alternatively, you can also copy the code into an \R\ script and execute it by either clicking on the \rrun\ button in the upper right corner of the console in \rstudio, or you can use the shortcut \code{Ctrl+Enter}.
Either way, the highlighted code or the line in which the caret is currently flashing are sent to the console and executed.
If this works fine, you should be able to continue to Section~\ref{sec:installdna}.

\section{macOS} \label{sec:mac}

\subsection{Installing \java\ on macOS}
On macOS, you have to install two versions of \java\ in order for \rdna\ to work properly.
The reasons behind this are too complicated to cover here.
Basically, Apple built its own version of \java, which needs to be on your machine, even though it is outdated.
Therefore we need to first install the legacy \java~6---which we will never use---before installing the correct \java\ Development Kit version~8.\footnote{If you do not wish to ever use \rdna\ or any other \R\ package that relies on \java, you might not need both versions and can just download the newest \java\ Runtime Environment.
However, installing \java\ version~8 before the legacy \java\ will cause problems if you ever change your mind.}

First, please download the file \url{https://support.apple.com/downloads/DL1572/en_US/javaforosx.dmg} and install it, accepting all defaults.
After this has finished, we can proceed to get the new version of the \java\ Development Kit.
Go to \url{http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html} and scroll down to \code{Java SE Development Kit 8u162}, accept the License Agreement and then click on \code{jdk-8u161-macosx-x64.dmg} to download the file (see Figure~\ref{fig:downljava2}).
Again, install the program accepting all defaults.

\begin{figure}[tbp]
  \centering
  \includegraphics[frame, width=0.75\textwidth]{03-1-downljava2}
  \caption{Downloading JDK from Oracle.}
  \label{fig:downljava2}
\end{figure}

After installing \java, you are ready to use \dna\ and could skip to Section~\ref{sec:installdna} if you are not interested in installing \rdna\ as well.
In order to use \rdna, the rest of this section will explain how to install \R\ and the recommended \href{https://en.wikipedia.org/wiki/Integrated_development_environment}{integrated development environment (IDE)} \href{https://www.rstudio.com/products/RStudio/}{\rstudio}, which makes working with \R\ a lot easier and also looks a lot better than \R's default interface.

\subsection{Installing \R\ on macOS} \label{subsec:installr-mac}
\begin{enumerate}
  \item First, you need to download \R\ from \url{https://cran.r-project.org/bin/macosx/}.
  \item At the top of the page, click on \code{R-\Sexpr{R_vers}.pkg} (or a newer version if available).
  \item Install the downloaded file.
        Usually, it is fine to leave all default settings in the installation options.
  \item Go to \url{https://www.rstudio.com/products/rstudio/download/}.
  \item At the bottom of the page, under \code{Installers for Supported Platforms}, click on the link \code{RStudio \Sexpr{RS_vers} -- Mac OS X 10.6+ (64-bit)} (or a newer version if available).
        Install RStudio by simply dragging the application icon in the downloaded \code{.dmg} file to your Applications folder.
  \item Then you need to install the program \texttt{Xcode} from the app store. The program is very large and will take a while to install.
  \item After installation, you can use \R\ by opening \rstudio.
\end{enumerate}

To test your installation of \R, follow the instructions in Section~\ref{subsec:rtest}.

Working with \java\ from within \R\ on a Mac is a bit messy.
Apple's own version of \java, although important to have installed, does not run in combination with \R.
That is why we have to tell your system which version of \java\ to use by default.
To do this, we have to enter a few system commands, which you can either do in the Terminal app or directly from within \R\ using the \code{system} function:
<<eval=FALSE, results = 'tex'>>=
# list files in java_home
system("/usr/libexec/java_home -V")
##Matching Java Virtual Machines (3):
## 1.8.0_162, x86_64:	"Java SE 8"	/Library/Java/JavaVirtualMachines/jdk1.8.0...
## 1.6.0_65-b14-468, x86_64:	"Java SE 6"	/Library/Java/JavaVirtualMachines/...
## 1.6.0_65-b14-468, i386:	"Java SE 6"	/Library/Java/JavaVirtualMachines/1....


# see default version of Java
system("java -version")
##java version "1.8.0_162"
##Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
##Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)
@
If your output looks like the output above, you are almost ready to install \texttt{rJava}.
The only thing left to do is to associate \java\ with \R.
To do this, you can either use the terminal app, or you can invoke a system command directly from within \R\ using the \code{system} function:
<<eval=FALSE, engine = 'bash', results = 'tex'>>=
$sudo R CMD javareconf
@
Or in \R:
<<eval=FALSE, results = 'tex'>>=
system("sudo R CMD javareconf")
@

If \code{/usr/libexec/java\_home -V} does not show \code{1.8.0\_162} (or any other version staring with \code{1.8.}), you need to install \java\ version~8 again (see above) and possibly reboot your computer.
If \code{java -version} shows \code{java version "1.6.0\_65"}, but version~1.8 is listed in the output from the first command, you can set the default by excecuting the following command:
<<eval=FALSE, results = 'tex'>>=
# Set JAVA_HOME
system("export JAVA_HOME=`/usr/libexec/java_home -v 1.8`")
@

After this, you should be able to continue to Section~\ref{sec:installdna}. However, depending on prior installations and the configuration of your machine, there can be other problems. You can find one nice tutorial and trouble-shooting guide \href{https://github.com/MTFA/CohortEx/wiki/Run-rJava-with-RStudio-under-OSX-10.10,-10.11-(El-Capitan)-or-10.12-(Sierra)#using-rstudioapp-or-rapp}{here}.

\section{Linux} \label{sec:linux}

\subsection{Installing \java\ on Linux}
%To Do: Add Suse and Debian commands where different
Since you are using Linux, we assume that you are sufficiently comfortable with using the terminal.

First, check if \java\ might already be installed:
<<eval=FALSE, engine = 'bash', results = 'tex'>>=
$java -version
@

If not, install it, e.\,g., via \code{APT}:
<<eval=FALSE, engine = 'bash', results = 'tex'>>=
$sudo apt-get install default-jdk
@

After installing \java, you are ready to use \dna\ and could skip to Section~\ref{sec:installdna} if you are not interested in installing \rdna\ as well.
In order to use \rdna, the rest of this section will explain how to install \R\ and the recommended \href{https://en.wikipedia.org/wiki/Integrated_development_environment}{integrated development environment (IDE)} \href{https://www.rstudio.com/products/RStudio/}{\rstudio}, which makes working with \R\ a lot easier and also looks a better than the default user interface.

\subsection{Installing \R\ on Linux}\label{subsec:installr-linux}
\begin{enumerate}
  \item Since the version of \R\ in the default repositories tends to be fairly outdated, we add the repository of the Comprehensive R Archive Network (CRAN) to our \code{sources.list}:
<<eval=FALSE, engine = 'bash', results = 'tex'>>=
$sudo add-apt-repository \
"deb [arch=amd64,i386] https://cran.rstudio.com/bin/linux/ubuntu \
$(lsb_release -cs)/"
@
        Note, that \code{lsb\_release -a} automatically selects your flavour and version of Linux from the CRAN server.
        Visit \href{https://cran.rstudio.com/bin/linux/}{CRAN} to see for which  Linux distributions \R\ is available.
        \code{cran.rstudio.com} is also just one of several \href{https://cran.r-project.org/mirrors.html}{CRAN mirrors}, so you could replace it with a different one if you prefer.
  \item Next, you need to add \R\ to your keyring.
        Here is how you would accomplish this in Ubuntu:
<<eval=FALSE, engine = 'bash', results = 'tex'>>=
$sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
@
  \item Update apt and install \R\ (or \code{r-base-dev} if you wish to compile packages from source):
<<eval=FALSE, engine = 'bash', results = 'tex'>>=
$sudo apt-get update
$sudo apt-get install r-base
@
  \item Now install \rstudio\ via gdebi (and install gdebi first if you do not already have it):%
  \footnote{Alternatively, you can download an installation file from \url{https://www.rstudio.com/products/rstudio/download/}.}\textsuperscript{,}
<<eval=FALSE, engine = 'bash', results = 'tex'>>=
$sudo apt-get install gdebi-core
$wget https://download1.rstudio.org/rstudio-$RS_vers$-amd64.deb
$sudo gdebi -n rstudio-$RS_vers$-amd64.deb
$rm rstudio-$RS_vers$-amd64.deb
@
 
  Note, that as of version \Sexpr{RS_vers}, \rstudio\ depends on an outdated version of \texttt{libgstreamer}.
  This version has already been deprecated in some linux distributions, which can lead to an error during installation of \rstudio.
  If you run into trouble while installing \rstudio, you should try installing the old version of \texttt{libgstreamer} side-by-side the newer library:
 
<<eval=FALSE, engine = 'bash', results = 'tex'>>=
# Download files with wget
$wget http://ftp.ca.debian.org/debian/pool/main/g/gstreamer0.10/libgstreamer\
0.10-0_0.10.36-1.5_amd64.deb
$wget http://ftp.ca.debian.org/debian/pool/main/g/gst-plugins-base0.10/libgs\
treamer-plugins-base0.10-0_0.10.36-2_amd64.deb

# Now install with gdebi
$sudo gdebi libgstreamer0.10-0_0.10.36-1.5_amd64.deb
$sudo gdebi libgstreamer-plugins-base0.10-0_0.10.36-2_amd64.deb

# And then clean up
$sudo rm libgstreamer0.10-0_0.10.36-1.5_amd64.deb libgstreamer-plugins-base0.10-0_0.10.36-2_amd64.deb
@
  
  \item For Linux, there are a few other system dependencies for \rdna\. You should install these using:

<<eval=FALSE, engine = 'bash', results = 'tex'>>=
$sudo apt-get install libudunits2-dev
$sudo apt-get build-dep libcurl4-gnutls-dev
$sudo apt-get install libcurl4-gnutls-dev
@
  \item After the installation has finished, you can use \R\ by opening \rstudio.
\end{enumerate}

To test your installation of \R, follow the instructions in Section~\ref{subsec:rtest}.

Before we can actually run \rdna, we need to associate \java\ with \R.
To do this, you should go back to the terminal:
<<eval=FALSE, engine = 'bash', results = 'tex'>>=
$sudo apt-get install r-cran-rjava
$sudo R CMD javareconf
@

If this finishes without errors, you are ready to start installing \dna\ and the \rdna\ package as described in Section~\ref{sec:installdna}.

\section{Installing \dna\ and \rdna} \label{sec:installdna}
Once \java\ is set up correctly, you can simply download the latest version of \dna\ as a JAR file from \url{https://github.com/leifeld/dna/releases} (see Figure~\ref{fig:downloadjar}).
JAR or \texttt{.jar} files are technically self-contained and executable archive files, which usually contain a computer program written in \java, along with all the files necessary to run the program.
Once the download is finished, you can start the program by double-clicking on the downloaded file.
However, on Linux, it is sometimes necessary to make the file executable first (e.\,g., via \code{\$chmod +x /path/to/your/dna.jar} or using \href{https://askubuntu.com/a/484719/570716}{a GUI method}).
On newer version of macOS, a security exception needs to be made before you can run a program from an ``unidentified developer'' (i.\,e., if the program has not been registered with Apple).
To do so for \dna, control-click the program's icon, then choose \code{Open} from the shortcut menu.
If clicking on the file does not open the program on a Windows machine, right-click on the \texttt{.jar} file $\rightarrow$ \code{Open with} $\rightarrow$ \code{Use another app} and then navigate to the file \code{"C:\textbackslash Program Files\textbackslash Java\textbackslash jre1.8.0\_151\textbackslash bin\textbackslash javaw.exe"}.

If you are not interested in using \rdna, you can now skip to Chapter~\ref{chp:dna-prep}.

\begin{figure}
  \includegraphics[frame, width=\textwidth]{03-4-downloadjar}
  \caption{Download \dna\ jar file from GitHub releases page.}
  \label{fig:downloadjar}
\end{figure}

At this point, it is assumed that you have installed \R\ and have at least a minimal understanding of how the program works (see Section~\ref{subsec:rtest}).
If that is the case, we can go ahead and install \rdna\ from within \R.

First, we need to install the package \rjava\ \citep{urbanek2017rjava}, which is the most important dependency of \rdna:%
\footnote{Again this sometimes doesn't work that easily on macOS. If the installation fails, you could try to install the package from source using \code{install.packages("rJava", type="source").}}\textsuperscript{,}%
  \footnote{Alternativly, it can make sense on Linux systems to install \rjava\ via apt: \code{sudo apt-get install r-cran-rjava}.}
<<eval=FALSE, results = 'tex', message = FALSE>>=
install.packages("rJava")
@

To see if this worked, or to troubleshoot potential problems, we can run a few \java\ commands from within \R:
\footnote{Loading \rjava\ for the first time regulary fails on macOS with the warning \code{...}.
If this is the case, try the command \code{sudo ln -s \$(/usr/libexec/java\_home)/jre/lib/server/libjvm.dylib /usr/local/lib} in your terminal app.}
<<eval=TRUE, results = 'tex', message = FALSE>>=
library("rJava")
# 1. initialize JVM
.jinit()

# 2. retrieve the Java-version
.jcall("java/lang/System", "S", "getProperty", "java.version")

# 3. retrieve JAVA_HOME location
.jcall("java/lang/System", "S", "getProperty", "java.home")

# 4. retrieve Java architecture
.jcall("java/lang/System", "S", "getProperty", "sun.arch.data.model")

# 5. retreive architecture of OS (This should have 64 in it if step 4 displays
#    "64")
.jcall("java/lang/System", "S", "getProperty", "os.arch")

# 6. retrieve architecture of R as well (This should again have 64 in it if
#    step 4 and 5 display 64)
R.Version()$arch
@

For \rdna\ to work properly, you need to ensure that \rjava\ works correctly.
In particular, it is essential that the architectures of \java, your operating system, and your version of \R\ match (see comments~4, 5, and 6 in the code chunk above).

Once this is done, you should install the package \texttt{devtools} \citep{wickham2018devtools}, which permits installing \R\ packages from \github.
<<eval=FALSE, results = 'tex', message = FALSE>>=
install.packages("devtools")
@

Since we only need one function from the package \texttt{devtools} at this point, it is not necessary to invoke the \code{library} command to load the whole package.
Instead, you can write \code{devtools::} and then type the function you want to use.\footnote{The option \code{args = "--no-multiarch"} should normally not be necessary, but prevents errors on some operating systems.
Since \texttt{devtools} tries to test both the 32-bit and 64-bit version of a package during installation, the process inevitably fails as only one architecture of \java\ is available.}
<<eval=FALSE, results = 'tex', message = FALSE>>=
devtools::install_github("leifeld/dna/rDNA", args = "--no-multiarch")
@

After this is done as well, the final step of the installation is to test if \rdna\ can be loaded into \R\ correctly and to perform a basic operation with it---opening \dna\ from within \R.
In order to do so, you first need to download \dna, which can also be done in \R\ with the \code{dna\_downloadJar} command (see Chapter~\ref{chp:rdna} for more details on what these commands mean).
<<eval=TRUE, echo = FALSE, warning = FALSE>>=
rDNA::dna_downloadJar()
@
<<eval=FALSE, results = 'tex', message = FALSE>>=
# download rDNA JAR
dna_downloadJar() # download DNA jar

# load library
library("rDNA")

# initialise the file you just downloaded
dna_init()

# start up DNA from R with the sample file to see if everything worked
dna_gui(infile = dna_sample())
@
If these commands can be executed correctly, you are ready to use both \dna\ and \rdna.


\chapter{Preparation of your \dna\ Workspace} \label{chp:dna-prep}
\chapterauthor{Felix Rolf Bossner and Johannes Gruber}
\FloatBarrier

After installing the program (see Chapter~\ref{chp:installation}), you can now create your first DNA database for your own research project.
How you set up a DNA database will mainly depend on the needs of your personal research design---which should usually be clear before you start analysing data.
Therefore, \dna\ can be customised during the creation of a new database in accordance with how you are planning to use the tool.

\section{Creating a new DNA Database}\label{sec:createnewdb}
In order to create a new DNA database file, you have to click on the index tab \code{File} (in the upper left corner of your DNA program window) and select the option \code{New DNA database} (see Figure~\ref{fig:newdb}).
As a result, a new window will open (see Figure~\ref{fig:dbchoose}), in which you find a menu that provides you with a step-by-step guidance for specifying the configuration of your personal DNA database
\begin{figure}
  \includegraphics[frame, width=\linewidth]{04-1-newDatabase}
  \caption{Starting a new Database}
  \label{fig:newdb}
\end{figure}


\begin{figure}
  \includegraphics[frame, width=\linewidth]{04-3-chooseDB}
  \caption{Choose if database will be stored locally or remotly}
  \label{fig:dbchoose}
\end{figure}

Clicking on the first tab in the sidebar of this menu---\code{Database} (see Figure~\ref{fig:dbchoose})---opens a menu, which allows you to choose the file name and storage location of your database.
For this first step of your set-up, DNA provides you with two options in respect to the type of database, in which your data is stored.
Which of these options best fits your research project is dependent on the circumstances of your coding process:

\begin{description}
\item[\code{Local .dna file}] The preset option \code{Local .dna file} means that the dataset is stored in a local file (technically an SQLite file) on your PC or device.
  This file, with the file extension .dna, can be moved on your machine, sent via email, uploaded and shared via a cloud file hosting service---such as Dropbox---and can generally be treated in the same way as any other file PC users are familiar with.
  A local .dna file will be sufficient in most user scenarios, for example, if you employ a single coder working on a single computer, if multiple coders work on a single dataset at non-overlapping intervals or when multiple coders work at the same time on different datasets, which you merge after the coding process (see Section~\ref{subsec:merge}).
  For most users, this simpler option will adequate in order to use \dna\.
  It is not necessary to be familiar with setting up and managing an SQLite or MySQL database.
  If you think the scenarios described above cover your intended use of \dna, you can now jump to the next section and start \fullref{subsec:createlocal}.
\item[\code{Remote database on a server}] However, for more experienced user or research projects in which several coders want to work on the same database at the same time, a second option was included into \dna\: \code{Remote database on a server}.
  This stores your data in a  MySQL database which could be stored locally on your machine---which would defy the purpose though---, on a private server---such as a Network-attached storage (NAS)---, or on an online cloud server.
  You should select this option if you employ a single coder working on multiple devices or multiple coders working on a single dataset at the same time.
  The preconditions for using this type of storage are that all coders have a stable connection to the database during the coding process---e.\,g., via the internet---and that you \href{https://dev.mysql.com/doc/mysql-getting-started/en/}{set up an online MySQL database} in advance.
  If this is how you want to proceed, you can now jump directly to the section which descibes the necessary steps for \fullref{subsec:usingremote}.
\end{description}


\subsection{Creating a Local DNA File}\label{subsec:createlocal}
\begin{enumerate}
\item Click on the button \code{Browse} (see Figure~\ref{fig:dbchoose}).
Now a pop-up menu---similiar to the one shown in Figure~\ref{fig:path}---should be open.
\item In this pop-up menu, you can choose the storage location of your database on your local device from the \code{Save in} slide down menu.
Enter the name of your database in the field \code{File Name} and confirm your choices by pressing the \code{Save} button (see Figure~\ref{fig:path}).
Now the pop-up menu will close.

\begin{figure}
  \centering
  \includegraphics[frame, width=0.75\linewidth]{04-4-path}
  \caption{Choose location of database window}
  \label{fig:path}
\end{figure}

\item \emph{Next, it is important, that you confirm your choices again by pressing the \code{Apply} button} (see Figure~\ref{fig:apply}).
If you forget to press this button, you cannot create the database in the final step, because the program will report ``No database selected'' (see Figure~\ref{fig:nodb}).

\begin{figure}
  \includegraphics[frame, width=\linewidth]{04-5-apply}
  \caption{Apply database choice}
  \label{fig:apply}
\end{figure}
\end{enumerate}

If you just employ a single coder and don´t want to change or supplement the preset standard research variables (\code{person}, \code{organization}, \code{concept}, \code{agreement}) or types of codeable statements (\code{Statement}, \code{Annotation}), you can now proceed directly to the \hyperref[sec:finalstep]{final step}.
If you use this manual as a beginner´s tutorial for working with \dna, however, it would be helpful to follow the steps outlined in sections~\ref{sec:userman} and \ref{sec:stattype} in order to gain a better understanding of the \dna's potential uses and its functions.


\subsection{Creating and Using a Remote Database (MySQL)}\label{subsec:usingremote}
Before you can configure \dna\ for working with a remote MySQL database, it is necessary to execute at least three basic operations in MySQL (see Figure~\ref{fig:mysql}).\footnote{For a detailed introduction to database management with MySQL see \url{https://dev.mysql.com/doc/mysql-getting-started/en/}.
}
\begin{figure}
  \includegraphics[width=\linewidth]{04-6-mysql}
  \caption{Create MySQL database}
  \label{fig:mysql}
\end{figure}

\begin{enumerate}
\item You have to create a database on your MySQL server (\emph{usually by the command} \newline
\code{CREATE DATABASE 'DatabaseName'}).
\item As you probably don´t want to allow all coders access to all other databases stored on your MySQL server, you should create distinct user profile(s) for the coding process of your \dna\ project.
Even if \dna\ itself allows for
\hyperref[sec:userman]{managing multiple different coder roles}, we recommend to create separate user profiles for each of the individual coders---especially if they simultaneously edit the content of your database.
It is also advisable to create passwords for the access to your database, not only for safety reasons, but also because DNA sometimes has problems with signing in users without a password.
Consequently you would use the
\code{CREATE USER 'Username'@'\%' IDENTIFIED BY 'Password'}
  command.
Note, that in this step you could also restrict the respective users access to your database to a specific device by replacing '\%' through a particular server address if this is necessary.
\item Finally, you have to equip the users with the necessary rights to edit your database.
In MySQL simply use
\code{GRANT ALL PRIVILEGES ON Databasename.* TO 'Username'@'\%'}, as it makes more sense to specify distinct user roles and rights directly in \dna\ (see~\ref{sec:userman}), where options were tailored to fit discourse network-analytical coding purposes.
\end{enumerate}

Once the MySQL database is set up, you only have to select the option \code{Remote database on a server} in the first tab of the sidebar menu \code{Database} in DNA (see \fullref{sec:createnewdb}) and enter the respective username and password created in the previous step in the respective fields \code{User} and \code{Password} as well as to specify the server address of the database, with which you want to connect, in the field \code{mysql://}.
If you want to access the database remotely from another device, you have to indicate the URL or IP-address of your host server, the port (which is 3306 in default, but can be \href{https://dev.mysql.com/doc/refman/5.5/en/connecting.html}{configured manually}) and the name of your database in the format \code{Hostserveraddress:Port/Databasename}.
If you use DNA on the device hosting the database you can instead use the configuration shown in Figure~\ref{fig:localhost} (\code{localhost/Databasename}).
By clicking the button \code{Check} you can now check if DNA is able to connect to your database.
If this is successful, you will receive the message \code{Ok. Tables will be created} (see Figure~\ref{fig:localhost}); if not, DNA will report  \code{Error: Connection could not be established}.
In case of the latter, you should check the validity of your server address, username and password and---if necessary---repeat the steps outlined above.
It should be noted that---for security reasons---MySQL doesn´t allow remote access with the ``root'' superuser-profile in most cases.
Similar to the generation of a local .dna file, it is finally important, that you confirm your choices again by pressing the \code{Apply} button (see Figure~\ref{fig:localhost}).
If you forget to press this button, you cannot create the database in the \hyperref[sec:finalstep]{final step}, because the program will report ``No database selected'' (see Figure~\ref{fig:nodb}).
\begin{figure}
  \includegraphics[width=\linewidth]{04-7-localhost}
  \caption{Connecting to local MySQL database}
  \label{fig:localhost}
\end{figure}


\section{User Management: Multiple Coders and Permissions}\label{sec:userman}
This second step of preparing your DNA workspace allows you to generate multiple user identities with different sets of rights for different coders.
Thus, you can specify for each coder, which parts of the dataset each user can see or edit and thereby pre-structure your coding and research process.
In order to do so, click on  second tab \code{Coder} in the sidebar of the \code{Create new database} menu (see Figure~\ref{fig:addcoder}).
\begin{figure}
  \includegraphics[width=\linewidth]{04-8-addcoder}
  \caption{Adding a second coder to the database}
  \label{fig:addcoder}
\end{figure}

In the main window (see Figure~\ref{fig:addcoder}) you can now see a list with all coders and how many of the 12 possible actions they are permitted to perform.
Now you can either add a new user profile by clicking the \code{Add} button (see Figure~\ref{fig:addcoder}) or select an existing coder and adjust her/his users rights by clicking on the user and then on the \code{Edit} button (see Figure~\ref{fig:editcoder}).
Both options will open the pop-up menu shown in (see Figure~\ref{fig:coderdetail}).
\begin{figure}
  \includegraphics[width=\linewidth]{04-9-coderdetail}
  \caption{Configuring coder permissions}
  \label{fig:coderdetail}
\end{figure}

This pop-up menu allows you to configure an individual profile for each coder in three simple steps:
\begin{enumerate}
 \item You can choose the \emph{colour} for the coder (see Figure~\ref{fig:coderdetail}, step 1).
It is recommended to choose different---if possible---divergent colours for each coder, because this permits you to detect at the first glance, which user coded which statement, as every coded statement is marked in the individual colour of its respective coder (see middle column of Figure~\ref{fig:chacoder}).
 \item You can enter the preferred name of each coder in the field \code{Name}.
If possible with respect to data protection rules, it is recommended to use the real names of the coders.
This makes it easier for them to select their profile (in the upper left of the main program window) the first time they start the program (see Figure~\ref{fig:chacoder}).
 \item The final step allows you to configure the \emph{permissions} of each coder individually by (de)selecting the respective rights via a click (see Figure~\ref{fig:coderdetail}, step 3).
Each new user has all of the 12 configurable permissions in the preset mode.
Which parts of the dataset an individual coder should be able to see or edit, should depend on your coding process.
For better orientation, a few practical implications of the 12 configurable permissions are listed below:

  \begin{description}
  \item[add documents] The user can add new documents (i.\,e., raw data) manually (via copy and paste or retyping) to the database $\Rightarrow$ user has (also) a research function.
  \item[import documents] The user can import new documents from other sources like~.txt or other~.dna files to the database or recode the metadata of multiple documents $\Rightarrow$ user has (also) a research function.
  \item[delete documents] The user can delete documents from the database or dataset.
This option requires at least the other permission \code{view others' documents}  if the user has an organizing or editing function (structuring database for coding by other users) or the permission \code{add documents} and \code{add statements} if the coder determines own codes and organizes her/his own set of data.
  \item[edit documents] The user can edit her/his own documents (i.\,e., raw data),  but not necessarily the codings in these documents that were made by other users---which would require the permission \code{edit others' statements}---or the documents uploaded by other users---which requires the permission \code{edit others' documents}.
This option requires at least the other permission \code{add documents} or \code{import documents} and should be selected  if the user determines own codes and organizes her/his own set of data or acts as a researcher for the other coders.
  \item[view others' documents] The user can view the documents uploaded by other users.
This option is necessary for a collaborative coding process in which only a part of the users selects and uploads the raw data (i.\,e., documents) for all other users.
The option should not be selected if each coder comes up with own codes and organizes her/his own set of data.
  \item[edit others' documents] The user can edit the documents uploaded by other users.
This option requires at least the other permission \code{view others' documents} and should be selected if a user organizes or edits the raw data provided by other users.
  \item[add statements] The coder actually codes the data by creating and editing statements.
If only a part of the users select and upload the raw data  this option requires the additional permission \code{view others' documents}.
If the coder suggests own codes and organizes her/his own set of data this option requires either the additional permission \code{add documents} or \code{import documents}.
  \item[view others' statements] The coder can view the statements coded by other users.
For example  the Coder ``DNA User'' would not see the yellow statement of the Coder ``Admin'' in Figure~\ref{fig:chacoder} if this option was deselected for her/his user role.
This option should be de-selected if you want to establish a blind coding process.
  \item[edit others' statements] The coder can edit or correct the statements coded by other users.
This option requires at least the other permission \code{view others' statements} and should only be selected for few users with an \emph{organizing},  \emph{controlling} or \emph{editing function}.
  \item[add coders] The user can add new coders (see Section~\ref{sec:userman}).
This option should only be selected for few users with an \emph{organizing} function.
  \item[edit statement types] The user can change or complement the variables of interest (see Section~\ref{sec:stattype}).
This option should only be selected for very few users or the researchers themselves because possible adjustment of these variables is usually only necessary in cases when the research design and/or research questions change fundamentally.
  \item[edit regex settings] The user can specify keywords which are highlighted in the text, along with a text color (see Section~\ref{sec:regex}).
For example, in Figure~\ref{fig:chacoder} the word \code{colors} is highlighted in the raw data text (middle column), because it was specified as a keyword in the \emph{regex highlighter sidebar} in the bottom left of the DNA window.
If a user does not have the right to edit the regex setting, the buttons \code{Add} and \code{Remove} in this highlighter would be hidden, but the keyword would nevertheless be visibly highlighted in the text and listed in the regex highlighter sidebar.
Thus, if you specify a distinct set of theory based keywords in advance in order to render the coding procedure semi-automatic, you should not enable this option or select it only for \emph{few users}, as the respective coder could change the keywords.
However, if you don't have a theoretically relevant set of keywords in advance or just specify them as a assistance for your coders, you can allow them to formulate such keywords by themselves.
 \end{description}

 Please keep in mind, that every user can see and change to other user identities either accidentally or because of non-compliance, as s/he has to select her/his role the first time s/he starts the program and can change her/his role anytime (see above and Figure~\ref{fig:chacoder})
\end{enumerate}

\begin{figure}
  \includegraphics[width=\linewidth]{04-10-chacoder}
  \caption{Change coder identity}
  \label{fig:chacoder}
\end{figure}

\begin{figure}
  \includegraphics[width=\linewidth]{04-11-edicoder}
  \caption{Edit coder details}
  \label{fig:editcoder}
\end{figure}

Finally you approve your choices by clicking the \code{OK} button (see Figure~\ref{fig:coderdetail}, step 4).
It is possible to change the settings either in the ``new database'' menu by selecting the respective user and clicking the \code{Edit} button (see Figure~\ref{fig:chacoder}) or changing the coder settings in the main menu.
To do so, simply select a coder from the drop-down menu in the window at the top left of the main menu and then push the pencil icon underneth.
The same menu as depicted in Figure~\ref{fig:chacoder} will open up again.

\section{Statement Types and Variables} \label{sec:stattype}
Clicking on the third tab in the sidebar of the ``Create new database'' menu---\code{Statement Types} (see Figure~\ref{fig:stattype})---opens a menu, which allows you to adjust or supplement either the variables or the types of statements, which your coders derive from the raw data.
\begin{figure}
  \includegraphics[width=0.9\linewidth]{04-12-stattype}
  \caption{Edit Statement Types}
  \label{fig:stattype}
\end{figure}

\subsection{Adjusting the Variables of Interest} \label{subsec:adjusvarint}
The statement type \code{DNA Statement} represents a text portion of your raw data, where an actor reveals her/his opinion/belief/etc.
about an issue.
Thus, the main task of your coder(s) is to identify such text portions and gain the relevant data about the actor or his opinion/belief/etc.
Your research question or theory should not only dictate what kind of information should be coded as statements, but also which relevant variables of this information should be captured by the coder.
As you can see in the ``Statement Types'' menu, DNAs default configuration allows capturing four variables.
Selecting \code{DNA Statement} and clicking on the button \code{Edit} (see Figure~\ref{fig:stattype}) opens a pop-up window (see Figure~\ref{fig:statdetail}), which reveals the nature of this four preconfigured variables, along whose lines the coders can collect information:

\begin{figure}
  \centering
  \includegraphics{04-13-statdetail}
  \caption{Edit Statement Type details}
  \label{fig:statdetail}
\end{figure}

\begin{itemize}
  \item the \emph{person} who makes the statement.
  \item the \emph{organization} the speaker is affiliated with.
  \item the \emph{concept} (opinion/belief/etc.) which is raised by the actor.
  \item a dummy variable indicating whether the actor \emph{agrees} with the concept or not.
\end{itemize}

Furthermore the pop-up window depicted in Figure~\ref{fig:statdetail} shows, that each variable is assigned to a specific data type: While \code{person}, \code{organization} and \code{concept}---according to their nature as nominal variables---will be coded by a short text, \code{agreement} as a dichotomous variable will be coded as a
\href{https://en.wikipedia.org/wiki/Boolean_data_type#Python.2C_Ruby.2C_and_JavaScript}{boolean data type}
, which accordingly only allows for two forms (either agreement or non-agreement).
Neither the data type nor the name of the variables can be changed directly.
However by selecting a variable and clicking on the \emph{trash symbol} (on the right side of the \code{Add Variable} button, Figure~\ref{fig:statdetail}, step 4) you can delete a variable and subsequently replace it by a new one.
Generating a new variable---either to replace one of the preconfigured variables or because you are interested in an additional or a different set of variables---is possible in five simple steps:

\begin{enumerate}
\item You have to \emph{select an existing variable} in order to activate the variable menu (see 1, Figure~\ref{fig:statdetail}).
\item Now you can enter the \emph{name} of the new variable in the \emph{text field} at the bottom of the pop-up window (see 2, Figure~\ref{fig:statdetail}).
For example, in Figure~\ref{fig:statdetail} we are interested in collecting the age of the person who makes the statement.
Please note, that DNA does not allow spaces in variable names.
Putting a space in the variable name will disable the \code{Add Variable} button necessary for step 4.
\item Now you can choose the \emph{data type} of your variable by clicking on one of the four options.
In our example, we choose the option \code{integer}, as the age of a person is neither a nominal nor a dichotomous variable, but an
\href{https://en.wikipedia.org/wiki/Integer_(computer_science)}{integer number}%
%Should be common sense. Is link needed?%
) (see Figure~\ref{fig:statdetail}, step 3).
\item You have to click on the \code{Add-Variable} button, which has the form of a \emph{green plus symbol} (see 4, Figure~\ref{fig:statdetail}).
If this button is disabled, you probably did not select a existing variable (step 1) or have a space in your variable name (see step 2).
\item Click the \code{OK} button to confirm your choices (see Figure~\ref{fig:statdetail}, step 5).
\end{enumerate}

Please note, that---for the statement type ``DNA Statement''---you should only specify variables, in which you have an actual research interest in and that accordingly have to be coded for all statements by all coders.
If you are interested in additional and optional information about some statements, you can specify them as variables of the other preconfigured statement type---\emph{``Annotation''}.

\subsection{Adjusting the Statement Types}
There are very few research scenarios, in which it is necessary to complement the two existing types of statements with further ones or with an adjustment of type ``DNA statement''.
One of them would be, if you study two parallel yet different research questions, which employ the same dataset \emph{and} the same coders at the same time.
In this case, you could first rename the statement type ``DNA Statement'' by selecting it from the statement type menu, clicking the \code{Edit} button (see Figure~\ref{fig:stattype}), entering the new name (in this case: ``Statement for Research Project 1'') in the text field on top of the pop-up window (see Figure~\ref{fig:statdetail}) and pressing the \code{OK} button (see 5, Figure~\ref{fig:statdetail}).
Subsequently you would open a new pop-up window by clicking on the \code{Add} button in the statement type menu (left button in Figure~\ref{fig:stattype}).
Then name the new statement type (in this case: ``Statement for Research Project 2'') in the text field on top of the pop-up window and choose a color (different from the other type) by clicking on the colored button next to this text field.
Then you also need to specify the relevant variables synchronous to the procedure depicted in Section~\ref{subsec:adjusvarint}.
However, please evaluate carefully, if it is really neccesary for your second research interest that you specify a second statement type or if it would be possible to either conceptualize it as a variable of the existing statement type or study it sequentially or with a different set of coders (and therefore in a different \dna\ dataset).
\emph{More than two statement types (besides ``Statement'' and ``Annotation'') can cause a confusion of the coders and therefore compromise the validity of the coding procedure}.

\section{Final Step: Approving your Workspace and Creating the DNA File} \label{sec:finalstep}
Finally, clicking on the \code{Summary} tab in the sidebar of the ``Create new database'' menu provides you with a summary of your choices in respect to the configuration of your coding process (see Figure~\ref{fig:summary}).
After controlling each of the three information you can now create your database by clicking on the \code{Create database} button.
If this button is disabled and you get the error ``No database selected'' (see Figure~\ref{fig:nodb}), you probably forgot to click the \code{Apply} button after specifying your database (see Section~\ref{subsec:createlocal}, step 3).
After creating the database, the new database will open in the main DNA window (see Figure~\ref{fig:newdb}) and you can proceed towards loading up and organizing the raw data.
\begin{figure}
  \includegraphics[width=\linewidth]{04-14-summary}
  \caption{Summary of your about to be created \dna\ database}
  \label{fig:summary}
\end{figure}

\begin{figure}
  \includegraphics[width=\linewidth]{04-15-nodb}
  \caption{No databse selected (e.\,g., if choice was not applied)}
  \label{fig:nodb}
\end{figure}


\chapter{Importing and Organizing your Raw Data} \label{chp:dna-import}
\chapterauthor{Felix Rolf Bossner and Johannes Gruber}
\FloatBarrier
This section describes how to upload and organize your research project's raw data---i.\,e., the text files (newspaper articles, press releases etc.) containing the uncoded statements---in DNA.
First it will be layed out how you open an existing database---either locally or from a remote location.
Then you will learn how to import new documenst into \dna---either by importing one document at a time or by selecting mutliple documents for import.
Finally, we tell you how you can organise the documents in your database and how you can change your docuemtns' metadata.

\section{Opening an Existing DNA Database}\label{chp:dna-open}
First of all, you have to choose, in which \dna\ Database you want to upload and process your data.
To open a \dna database, simply follow the steps depicted in Figure 1: First, click on the index tab \code{File} and select the option \code{Open DNA database} (see Figure~\ref{fig:opendb}, step 1).
As a result, a pop-up window will appear, which allows you to choose between opening a \code{Local .dna file} or a \code{remote database on a server}.
If your database is stored on a remote server, you should choose the second option and repeat the procedure outlined in \fullref{subsec:usingremote}.
 If your dataset is stored in a folder on your local PC or device, you can proceed with the preset option and click on the button \code{Browse} (see Figure~\ref{fig:opendb}, step 2), which will open a further pop-up window, in which you can find your database by choosing its storage location from the  \code{Save in} slide down menu  (see step 3), selecting the respective database (see step 4) and clicking on the button \code{Open} both in the pop-up  and the ``Open existing database...'' window (see steps 5 and 6).
\begin{figure}
  \includegraphics[width=\linewidth]{05-1-opendb}
  \caption{Opene \dna database}
  \label{fig:opendb}
\end{figure}

\section{Importing Documents (Raw Data)}\label{sec:importdoc}
There are three
%four
different---partly semi-automatic---ways to upload your raw data and related descriptive information (title, date, author, source, section and type of document) into \dna: Importing single Documents manually via copy and paste, Importing multiple Documents semi-automatically from text files and importing Documents from other DNA databases.
%and using \rdna\ to import data which is already available in R (see Section~\ref{subsec:importr})%
.
All three
%four
will be explained in detail in this section.

\subsection{Importing Single Documents Manually via copy and paste}
The most basic way to import data to \dna\ requires you to manually copy and paste the content and the descriptive information for each of your documents into the text fields of a pop-up window, which you open by clicking on the index tab \code{Documents} and selecting the option \code{Add new document} (see Figure~\ref{fig:adddoc}).
This window has eight text boxes, in which you can enter information from and about your source data (see Figure~\ref{fig:adddoc}):

\begin{itemize}
  \item  The field \code{title} is mandatory and may include any kind of information, for instance a unique ID if you plan to collect additional information about the articles in a separate database.
Duplicate article titles are not allowed.
  \item The field \code{date} is also mandatory and preset on the current time and day.
You can change it by either clicking on the year, month, day or time and adjusting the respective value via the arrows on the right or by manually entering the date in the format \code{YYYY-MM-DD hh:mm:ss}.
Please make sure you enter the date correctly because otherwise the algorithms for longitudinal data (see Section~\ref{sec:longi}) will not work properly.
  \item The fields \code{author}, \code{source}, \code{section} and \code{type} are optional, but this additional information can help you to efficiently organize your data and ensure the reproducibilty, transparency and future usage of your research project.
You can enter these information either manually or select an author, source, section or type you specified for a previously added document from the drop-down menu, which appears when you click on the downward arrow buttonon the left of the respective field.
  \item To insert the content of your document, copy your article from a website or any other text source and paste it in the \emph{text field (largest field at the bottom of the pop-up window)}.
Single line breaks are automatically removed, while double line breaks (paragraph breaks) are preserved.
Some escape sequences and special characters are automatically removed when text is inserted.
  \item If you want to add further meta information to your document, which does not fit the preset categories, you can use the field \code{notes}.
\end{itemize}
Finally---after checking your specifications---you can import the document to \dna\ by clicking the \code{Add} button.
\begin{figure}
  \includegraphics[frame, width=\linewidth]{05-2-adddoc}
  \caption{Open \dna-database}
  \label{fig:adddoc}
\end{figure}


\subsection{Importing Multiple Documents Semi-automatically from Text Files}\label{subsec:multiimport}
If you want to analyze a greater number of articles, it quickly becomes tedious to manually copy and paste each document and its meta data.
This is why \dna\ also offers a semi-automatic way to upload multiple documents and their relevant meta data (author, date, source, type) at the same time.

\subsubsection{Downloading and Preparing your Raw Data}
This way of importing raw data to \dna\ requires that you save all documents as \emph{separate \code{.txt} files} (one file for each article) \emph{in a common folder}.
Please note, that you have to use the \code{.txt}format for saving your data, as \dna\ can not import \code{.doc} or \code{.pdf} files.\footnote{You can, however, save Word-documents as .txt files or use an online converter to transform PDFs into txt files.
Note, that you need to make sure (both cases) that the .txt file is saved with UTF 8 encoding.} In case you use the newspaper database of LexisNexis---which is available through many university lbraries---for finding and retrieving your raw data, please make sure that you download all documents separately (by selecting the individual document before clicking the download button, see Figure~\ref{fig:nexis}, step 1-2) and choose the document format \code{Text} (under \code{Format Options} in the Download pop-up menu, (see Figure~\ref{fig:nexis}, step 3-4) before downloading the data (see Figure~\ref{fig:nexis}, step 5).\footnote{If you use \rdna\ it will soon also be possible to import LexisNexis data into \dna\ via using \rdna\ and a new \R package called \href{https://github.com/JBGruber/LexisNexisTools}{LexisNexisTools} \citep{gruber2018lexis}.}
\begin{figure}
  \includegraphics[frame, width=\linewidth]{05-3-nexis}
  \caption{Downloading files from the LexisNexis newspaper archive}
  \label{fig:nexis}
\end{figure}

  If you want to use the preset regex configurations (\hyperref[subsubsec:adjregex]{in contrast to adjusting them}) for automatically detecting and uploading the meta data of your documents, you should use a \emph{file name} in the format \code{DD.MM.YYYY - Author - Source - TYPE.txt} \emph{with blanks before and after the minuses}, where \code{DD.MM.YYYY} is the date, on which the article was published.
While \code{Author} and \code{Source} do not require a special format or length (e.\,g., you can use the first and/or last name of the author), the type of the document must always be indicated by capital letters.
For example, the file name of the
article \url{http://spon.de/aeclD}, which is used as an example here, would have the format \code{31.03.2014 - Ralf Neukirch - SPON International - DIGITALRESOURCE.txt}.
Please note, that plain text files are somtimes saved as \code{.TXT} instead of \code{.txt} files.
While this is technically the same, it can cause problems while importing multiple text files.
If this is the case, you have to either change the preset Regex configuration or correct the \code{.txt} suffix manually in the file name(s).
Otherwise the automatic detection of your documents' meta data will not work.

\subsubsection{Importing your Raw Data into DNA}
If you prepared your data adequately, you can retrieve the documents and the relevant additional information in four simple steps (see Figure~\ref{fig:importtxt}):

\begin{enumerate}
  \item Click on the index tab \code{Documents} and select the option \code{Import text files} (see Figure~\ref{fig:importtxt}, step 1).
As a result, a new window will open, in which you press the button \code{Select folder} (see step 2).
This will open a further pop-up menu.
Here, you have to select the \emph{folder}, in which you saved the text files of your raw data, from the \emph{\code{Look in} slide down menu} (see step 3) and click the button \code{Open} (see step 4).
 
\begin{figure}
  \includegraphics[width=\linewidth]{05-4-importtxt}
  \caption{Import text files}
  \label{fig:importtxt}
\end{figure}
 
  \item   Now all documents, which are stored in the respective folder, should be listed in the main window of the \code{Import text files...} pop-up (see Figure~\ref{fig:txtfls}).
If this isn't the case, please check if your documents are saved in the right file format (.txt).
In order to check, whether \dna\ is able to automatically identify your documents' meta data, select one of the documents and click on the \code{Refresh} button (see Figure~\ref{fig:txtfls}).
If you specified the file names correctly, you can now see the respective meta data of the selected document in the fields \code{Title}, \code{Author}, \code{Source}, \code{Type} and \code{Date} of the \code{Preview} Section at the bottom right of the ``Import text files'' window (see Figure~\ref{fig:txtfls}).
  \item If you want to adjust or amend the meta data manually, just select the document, \emph{uncheck} the box \code{Regex} of the field you want to edit and enter the new or additional information in the field on the left.
Then click again on the \code{Refresh} button to check, whether your changes were accepted.
  \item Finally, click on the button \code{Import files} to import all documents of the respective folder into your \dna\ database (you do not need to select each document for import).
  \begin{figure}
    \includegraphics[width=\linewidth]{05-5-txtfls}
    \caption{Import text files}
    \label{fig:txtfls}
  \end{figure}
 
\end{enumerate}

\subsubsection{Adjusting the Regex Configuration for automatic identification of meta data}\label{subsubsec:adjregex}
The previous steps assumed that you use the preset configuration of \dna\ to detect and upload the meta data (Title, Author, Source, Type, Date) of your documents automatically into your database.
However, if you are interested in automatically importing additional information about your source data (in the fields \code{Section} or \code{Notes}) or if your file names depart from the naming system layed out here (but nevertheless contain all relevant information in a systematic order), \dna\ allows you to change, adjust or amend the pattern, through which the meta data about your documents is derived from the file names.
The commands/rules, on which the ``translation'' of file names into meta data is based, are formulated in the
\href{https://en.wikipedia.org/wiki/Regular_expression}{Regular expessions (in short: Regex) syntax} and can be edited for each kind of information (Title, Author, Source, Section, Tyoe, Notes, Date) in the field \code{Pattern} on the bottom left of the ``Import text files...'' window (see Figure~\ref{fig:txtfls}).
If you want to amend or adjust this settings it is recommended to use a Regex Cheatsheet (see e.\,g.,
\href{https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/}{cheatography.com}
or
\href{http://www.txt2re.com/index-perl.php3?s=31.03.2014\%20-\%20Ralf\%20Neukirch\%20-\%20SPON\%20International\%20-\%20DIGITALRESOURCE.txt&-94&-102&80&77&75&81&-95&79&76&78&82&13&14&5&3}{this regex ``translator''}).
As further support, Figure~\ref{fig:regex} translates the preset regular expressions of the \dna\ \code{Import text files...} option.
\begin{figure}
  \includegraphics[width=\linewidth]{05-6-regex}
  \caption{Import text files}
  \label{fig:regex}
\end{figure}

\subsection{Importing Documents from Other \dna\ Databases}\label{subsec:merge}
You can also import documents from other \dna\ databases.
This function is particularly relevant in two scenarios: First, if you not only want to use the \emph{raw data, but also the coded statements} of an already finished research project, this function allows you to import both.
Secondly, if there is \emph{more than one person working on the same project at the same time} and you did not use multiple user roles (see Section~\ref{sec:userman}) to enable your coders to work on the same remote database.
In the second scenario, you should use this function to prepare your datasets or merge the codings, as it is usually difficult to merge the files manually later on.
In the latter scenario, the function helps you to avoid trouble with \emph{duplicate statement IDs} and article names, as \dna\ will take care of e.\,g., duplicates automatically.

Make sure, that you \emph{know which version of DNA} (DNA 2.0 or older) was used to create and edit the database, from which you want to import data, \emph{before} using the ``Import from DNA'' function.
If you use this manual as a beginner's tutorial for working with \dna\ please download the file \code{sample.dna} from the \dna\ \url{https://github.com/leifeld/dna/releases}.
This file contains a small selection of documents and statements from a larger project about congressional hearings on climate change, employed in the project described in \citet{fisher2013mapping, fisher2013where}.

To import documents (and the included code statements), click on the index tab \code{Documents} and select the option \code{Import from DNA 2.0 file}, if \dna\ 2.0 was used to create and edit the database.
As the internal structure of .dna files has significantly changed since version 1.31, databases created with an older version of \dna\ need to be impored using the seperate method \code{Import from DNA 1.31 file} (see Figure~\ref{fig:importdna}, step 1).
As a result of either step, a further pop-up menu will open (see Figure~\ref{fig:importstat}).
In this window, you have to select the folder, in which you saved the text files of your raw data, from the \code{Look in} slide down menu (see step 2) and \emph{select the respective \code{.dna} file} (see step 3).
Click the button \code{Open} (see step 4) to then open the menu depicted in Figure~\ref{fig:importstat}.
\begin{figure}
  \includegraphics[width=\linewidth]{05-7-importdna}
  \caption{Import a \dna\ 2.0-database}
  \label{fig:importdna}
\end{figure}

\begin{figure}
  \centering
  \includegraphics[width=0.6\linewidth]{05-8-importstat}
  \caption{Import Statements menu}
  \label{fig:importstat}
\end{figure}

In this menu, you can select, which documents (and respective which coded statements) from the original \dna\ database you want to import in your database by either manually checking or unchecking the boxes on the left of the document title or by using the function ``Keyword filter''.
This function is particularly helpful if you want to only import few documents with a specific common characteristic (author, topic) from a very large dataset.
Clicking on the button \code{Keyword filter...} (see left button in Figure~\ref{fig:importstat}) opens a new pop-up window, in which you can enter a specific search term.
For example, if you downloaded and opened the \href{https://github.com/leifeld/dna/raw/master/manual/sample.dna}{\code{sample.dna}} file, you can select all congressional hearings of NGO representatives by \emph{entering the keyword ``NGO''} in the text field and pressing the button \code{OK} in the ``Keyword filer'' pop-up window (see Figure~\ref{fig:importstat}).
Now only the boxes of the three documents, which contain the hearings of NGO representatives Kateri Callahan, David Hamilton and Nayak Navin, should be checked, while the other boxes are unchecked.
The ``Keyword filter'' function is based on the same regex syntax described in \fullref{subsubsec:adjregex}.
This means, you can also use more specified regular expressions (see Figure~\ref{fig:regex} or \href{https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/}{regex cheatsheet}) to select certain articles.
For example, if you enter a \code{\^N} in the ``Keyword filter'' \dna\ will select all articles starting with a capital N.
If you want to undo your selections, you can also automatically select or unselect all articles by pressing the button \code{(Un)select all} in the middle of the ``Import statements'' window (see Figure~\ref{fig:importstat}).
Pressing the right button \code{Import selected} in the same window imports all documents with a checked box (and the respective coded statements) in your \dna\ database (see Figure~\ref{fig:importstat}).
If you use this manual as a beginner's tutorial for working with \dna, you should try importing all documents and the respective statements from the file \href{https://github.com/leifeld/dna/raw/master/manual/sample.dna}{\code{sample.dna}} into your database.

%To DO
%\subsection{Importing Documents using \R\}\label{subsec:importr}


\section{Organizing Documents (Raw Data) }
\subsection{Deleting and Navigating Through Documents}
All your imported documents are listed in the upper middle table of the \dna\ main window.
If you click on an article, its corresponding text (i.\,e., the speech) will be displayed in the text area below the document table.
By clicking on, for example, the entry \code{109-1: Callahan, Kateri-NGO-Y} you open the speech of Kateri Callahan, a representative of the Alliance to Save Energy.
You can adjust the size of the document table (by clicking on the bar above the text area and moving it vertically with your cursor) or its colums (by clicking on the edge of the column and moving it horizontally with your cursor).
You can also customize the meta information, which are displayed in the document table: Just right click on any document and use the appearing context menu to (un-)check the boxes of the information you (don't) want to be displayed (see Figure~\ref{fig:batchrecode}, step 1).
A structured (and customised) overview of your raw data is essential for detecting missing information and thus efficiently controlling, organizing and coding your data.
For example, if you display the meta information ``Type'' (by checking the respective box in the context menu), you can see that the type of all documents from the \code{sample.dna} file is not listed.
\begin{figure}
  \includegraphics[width=\linewidth]{05-9-batchrecode}
  \caption{Import Statements menu}
  \label{fig:batchrecode}
\end{figure}

The same context menu can be used to delete documents from your database by \emph{selecting the documents} you want to delete (pressing and holding the \code{Ctrl} key for selecting multiple documents), opening the context menu with a \emph{right click} and choosing the option \code{Delete selected documents}.

\subsection{Editing the Documents' Metadata (Author, Time etc...)}
DNA allows you to edit, delete or complement the descriptive information related to your raw data (title, date, author, source, section and type of document).
Similiar to the procedures outlined in Section~\ref{sec:importdoc} there is a manual as well as a semi-automatic way to adjust the meta data of your documents.

\subsubsection{Editing the documents' meta data manually}
The most basic way to edit your documents' metadata is to \emph{select the document}, of which you want to edit the information (by left-clicking on it) and adjusting the values in the \code{Document properties} submenu on the middle left of the \dna\ main window (see Figure~\ref{fig:batchrecode}, step 2) by either manually typing in the relevant information or by selecting an already specified author, a source, a section or a type from the drop-down menu on the right of the respective meta field.
For example, in Figure~\ref{fig:batchrecode} (step 2) Kateri Callahans speech was selected, and the value ``NGO'' (for Non-Governmental Organisation) was manually specified as ``Type of document'' by entering it in the field \code{Type} of the ``Document properties'' submenu.
Do not forget to press the button \code{Save} in the submenu (see Figure~\ref{fig:batchrecode}, step 2) to confirm your edits.

Please note, that you can manually only edit the meta data of \emph{one document at one time}.
If you try to select multiple documents for editing, the ``Document properties'' submenu will disappear, returning ``(No document or permission)''.

\subsubsection{Editing the documents' meta data semi-automatically}
However if you want to adjust the meta data of a greater number of articles, it quickly becomes tedious to manually edit information about each document.
This is why \dna\ also offers a semi-automatic way to edit, delete or complement the descriptive information related to your documents.
In order to edit your documents' meta data semi-automatically, click on the index tab \code{Documents} and select the option \code{Batch-recode meta-data} (see Figure~\ref{fig:batchrecode}, step 3).
As a result, a pop-up window similiar to Figure~\ref{fig:recodewin} will open.
In the upper half of this pop-up window you find nine fields, which can be configured in order to adjust the meta data for \emph{multiple documents at once}:
\begin{figure}
  \centering
  \includegraphics[frame, width=0.75\linewidth]{05-10-recodewin}
  \caption{Meta information recode window}
  \label{fig:recodewin}
\end{figure}

\begin{itemize}
  \item The field \code{Target field:} specifies, which kind of meta information (i.\,e., title, author, source, section, type, notes) should be adjusted by choosing the respective meta data category from the slide-down menu (which you open by clicking the arrow on the right of the target field).
  \item The field \code{Source field:} specifies, where the data you want to use for adjusting the target field is stored.
For example, if you simply want to delete or correct (e.\,g., misspelt) title-, author-, source-, section-, type- or notes-metadata, you usually choose the same field as source field as you have chosen as target field, since you want to adjust the data already stored in this field.
However, if you want to add new data to a (maybe empty or incomplete) target field, you have to choose the part of the meta information as source field, which contains the information, from which you want to derive the new data.
As the document title should contain all relevant meta information, \code{Title} is usually used as source field for the latter case.
  \item The field \code{Matching on target regex} allows you to automatically delimit the documents which you want to adjust, based on the information stored in the document's target field.
Similiar to all regex implementations in \dna\ you can either use search terms or regular expressions to filter the documents.
If you, for instance, misspelt the author ``Ralf Neukirch'' sometimes as ``Ralf Neu\emph{n}kirch'', you can correct all your misspellings by simply selecting ``Author'' as \code{Target field}, entering ``Ralf Neu\emph{n}kirch'' in the field \code{Matching on target regex:} and the correct version (``Ralf Neukirch'') in the field \code{New target field}.
As \code{Matching on target regex} automatically deselects all non-matching cases (here: All documents, who do not have ``Ralf Neunkirch'' specified as their author), the meta information (here: ``Author'') remains the same for all other documents.
  \item The field \code{Matching on source regex} similarly allows you to automatically filter the documents of which you want to alter the meta data, based on the information stored in the document's source field.
For example, if you realise that Ralf Neukirch does not write for ``SPON International'' (as you erroneously specified), but for ``THE GUARDIAN'', you can simply correct all your misspecifications by first selecting \code{Source} as the \code{Target field} and \code{Author} as the \code{Source field}, secondly entering ``Ralf Neukirch'' in the field \code{Matching on source regex} and then specifying ``THE GUARDIAN'' as \code{New target field}.
  \item The field \code{\%target regular expression} allows you to specify/match a part of the target field, which you want to use as new information in the same field.
For example, if the field \code{Author} somehow contains the full document titles you can reduce the information in the field \code{Author} to just the name of the respective author by entering the regular expression \code{(?<=.+?---).+?(?= -)} (see Figure~\ref{fig:regex} or \href{https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/}{regex cheatsheet}) in the field \code{\%target regular expression} and entering \code{\%target} in the field \code{New target field}.
\emph{Please note, that if you do not use this function, you should not change the preset value \code{.+} in this field}---because if you do, your recoding might not obtain the expected results.
  \item The field \code{\%target replacement} defines a new value for the information in the target field---similarly to the fields \code{New target field} and \code{\%source replacement}.
If you use \code{\%target} as \code{New target field}, you have to specify the new, additional, corrected or reduced information in this field.
  \item The field \code{\%source regular expression} allows you to specify/match a part of the source field, which you want to use as new information in the target field.
For example, if your source field is \code{Title} and the titles of your documents have the recommended format (i.\,e., \code{DD.MM.YYYY - Author - Source - TYPE.txt} with blanks before and after the minuses; see Section~\ref{subsec:multiimport})  you can automatically specify the meta information for the field \code{Author} by (1.)~choosing \code{Author} as the \code{Target field} and \code{Title} as the \code{Source field}, (2.)~entering the regular expression \code{(?<=.+?---).+?(?= -)} (see Figure~\ref{fig:regex} or \href{https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/}{regex cheatsheet}) in the field \code{\%source regular expression} and (3.)~entering \code{\%source} in the field \code{New target field}.
\emph{Please note, that if you do not use this function, you should not change the preset value \code{.+} in this field}---because if you do, your recoding might not obtain the expected results.
  \item The field \code{\%source replacement}---similarly to the fields \code{New target field} and \code{\%target replacement}---defines a new value for the information in the target field.
If you use \code{\%source} as \code{New target field}, you have to specify the new, additional, corrected or reduced information in this field.
  \item The field \code{New target field} defines the new, corrected, reduced or additional data, which is entered in your target field (see examples above).
Please note, that this field has to be set on \code{\%source} (preset value) if you use the functions \code{\%source regular expression} or \code{\%source replacement} and has to be set on \code{\%target} if you use the functions \code{\%target regular expression} or \code{\%target replacement}.
Otherwise, the respective functions will not work.
\end{itemize}

The lower half of the ``Recode document meta-data'' pop-up window (see Figure~\ref{fig:recodewin}) displays a table with four columns and a row for each of your documents, which help you to preview, control and trace back your changes to the meta data:

\begin{itemize}
  \item The column \code{ID} contains the individual ID of each of your documents.
This column can be particularly helpful if you specify a recoding procedures for a certain set of documents.
If you know the ID of a few exemplary documents from this set, you can quickly trace back and understand the consequences of your recoding specifications by scrolling down to the respective IDs and taking a look at the other columns of these documents.
  \item The column \code{Source field} displays the field, from which you get the meta data for recoding the target field.
It is particularly helpful to understand the sequence of information in the source field, if you want to specify a \code{\%source regular expression} or use \code{Matching on source regex} (for example, if only some source fields contain the relevant information).
  \item The column \code{Old target field} shows the meta data in the target fields prior to your adjustments.
It is particularly helpful if you want to use \code{\%target regular expression} or use \code{Matching on target regex} (for example, if you only want to change the value of a certain set of target fields).
  \item The column \code{New target field} displays the consequences of your adjustment.
It is particular helpful to check if your recoding will be successful or if some recoding outcomes are actually undesired (for example, if the target field already contained the relevant information, but is recoded nevertheless).
\end{itemize}

Your recodings are only applied, if you press the button \code{Recode} (on the lower right of the \code{Recode document meta data} window, see Figure~\ref{fig:recodewin}).
\emph{Once this is applied, it cannot be undone!} So please control the consequences of your recodings by using the table at the lower half of the window.
However, before pressing the \code{Recode} button, you \emph{can} revert all adjustments by pressing the button \code{Revert changes} and therefore are able to experiment with the meta data (regex) specifications.

As noted previously, all documents from the file \code{sample.dna} do not specify any meta data concerning the type of the respective document.
Both Figure~\ref{fig:recodewin} and Figure~\ref{fig:recoderegex} illustrate an exemplary semi-automatic procedure for complementing this information based on the information stored in the document title (here: The organisation, to which the respective speaker belongs to).
Thus in both examples, \code{Type} is selected as \code{Target field}, while \code{Title} is selected as \code{Source field}.

The example in Figure~\ref{fig:recodewin} uses manual search terms to specify the meta information for the document type.
By entering ``NGO'' in the field \code{Matching on source regex} the adjustments are limited to the documents, which contain ``NGO'' in the document title.
By entering ``NGO'' in the field \code{New target field}, the new value for \code{Type} is specified for the selected documents.
As you can see in the table on the lower half of the \code{Recode meta-data} window, this very simple procedure is insofar successful, as only the target fields of documents containing hearings of NGO-representatives are changed and the target fields of all other documents (including those with already correct \code{Type} information) remain unchanged.
However, this procedure would have to be repeated for each kind of organisation from the sample (NGO, GOV, BUS).

The more elegant way of semi-automatically specifying meta information is depicted in Figure~\ref{fig:recoderegex}, which uses the \emph{Regex-syntax}.
Here, by entering \texttt{\^?} in the field \code{Matching on target regex}, only those documents are selected for amendment, which do not already contain any information about the document type (therefore excluding those documents with already correct \code{Type} information).
By specifying \code{(?<=.+?-)[A-Z]+} as \code{\%source regular expression} (and accordingly \code{\%source} as \code{New target field}), \dna\ is instructed to filter any string of upper-case characters before a minus in the document title and set it as a new value for \code{Type}.
Thus you can recode the document type for all documents at once, ensuring that already specified values are not overwritten---as evident from the table in the lower half of the window.
\begin{figure}
  \includegraphics[width=\linewidth]{05-11-recoderegex}
  \caption{Meta information recode window (regex explained)}
  \label{fig:recoderegex}
\end{figure}


\FloatBarrier
\chapter{Coding the Data}\label{chp:dna-coding}
\chapterauthor{Johannes Gruber}
Now that you know how to create a database and organise the documents in it, it's time to start with the actual coding.
This section describes how to create, edit and navigate through statements as well as how to employ the regex highlighter and search function to make coding of statements easier and faster.
If you want to recreate the steps outlined in this section for practice, you should download the file \code{sample.dna} from the \dna\ \url{https://github.com/leifeld/dna/releases} and open it with the newest version of \dna\ from the same page---if you haven't already done that.
This sample is a small excerpt from a larger empirical research project that tries to map the ideological debates around American climate politics in the U.S. Congress over time.
Details about the dataset from which this excerpt is taken are provided by \citet{fisher2013mapping, fisher2013where}.
Here, it suffices to say that the \texttt{sample.dna} file contains speeches from hearings in the U.S.\ Congress in which interest groups and legislators make statements about their views on climate politics.

\section{Creating a \dna\ Statement}\label{sec:createstat}
\FloatBarrier
For the sake of this tutorial, we can create a new coder.
You could also select one of the existing coders from the drop-down menu to create statements but it's just nicer if our new statement is associated with our names.
Simply click on the plus sign in the coder menu in the upper left corner of the main window (see Figure~\ref{fig:newuser}, step 1).
In the new menu that opens, enter your name and just leave all permissions selected for now (see Figure~\ref{fig:newuser}, step 2).
Then you should choose a personal colour.
Either you select a predefined one from the swatches tab or you define your own colour using one of the menus in the other tabs (see Figure~\ref{fig:newuser}, step 3).
Later on, the statements you have created will be highlighted in this colour.
After accepting the edit, you can select yourself from the drop-down coder menu.
This should always be the first step before you or one of your coders start to work on the database.
\begin{figure}
  \includegraphics[frame, width=\linewidth]{06-01-newuser}
  \caption{Create your own personal user for this exercise}
  \label{fig:newuser}
\end{figure}
\begin{figure}
  \includegraphics[frame, width=\linewidth]{06-02-newstatement}
  \caption{Create new \dna\ Statement}
  \label{fig:newstatement}
\end{figure}

To code a new \dna\ Statement, simply select a chunk of text by pressing and holding your left mouse button while sliding over it.
When all the text you want to include in the statement is selected, push the right mouse button and select \code{Format as DNA Statement} from the appearing drop-down menu (see Figure~\ref{fig:newuser}).
In the new menu that opens, you need to provide the details for the statement.
Every \dna\ Statement consists of four pieces of information which you should ideally all provide (Figure~\ref{fig:statdetails}):

\begin{description}
  \item[Person] The person or actor who speaks or makes the statement.
In discourse network analysis it is most often assumed that the organisation, not the person, are the important actors in a policy process.
So if you have decided from the start of a project that this is the case, you could leave the person filed empty.
However, it might nevertheless be interesting in a later step if there are differences between persons from the same organisation, so it is advised that you complete all fields.
  \item[Organization] The organisation the person who makes a statement is affiliated with.
  \item[Concept] The concept to which the statement refers.
Concepts are usually abstract representations of the topics which are discussed.
In advocacy coalition research, for example, concepts are claims for
policy instruments such as ``CO\textsubscript{2} legislation will not hurt the economy''.
  \item[Agreement] A dummy variable (i.\,e., a variable with only two possible outcomes) indicating whether the actor agrees with the category or not.
Often this is a question of sentiment: if the speaker talks about the concept in a positive way we assume s/he agrees with it.
And likewise that s/he disagrees when making a statement in a negative tone.
As opposed to the other three fields, it is not possible to \emph{not} provide this information.
If you do not tick the box \code{agreement} you indicate disagreement with the selected concept.
\end{description}

There are two ways to provide the information: you can either click inside one of the boxes to write in a new category or you can choose a category from the drop-down menu (see Figure~\ref{fig:statdetails}).
In the first case, \dna\ will try to auto-complete your code by using all existing categories, so you can leave a field incomplete and thereby choose an existing category.
In the latter case, the drop-down menu shows all previously coded information, so that it is not necessary to reenter previously coded categories.
This is not only convenient but also serves a reliability purpose: multiple similar but not identical categories, created by misspelling or incorrect abbreviation, would lead to spurious results which could jeopardize analysis later on.

As you can see in Figure~\ref{fig:statdetails}, each statement also has a unique identification number.
The ID can't be changed by the user, but it can be used as a primary key if you want to record additional information about your statements in a separate database---e.\,g., to automatically merge the information again once you move to analysis in \R\ or other programs.

There are also two more symbols in the upper right corner of the Statement window: a plus sign, which creates a copy of the current statement.
This can be helpful if the same text passage can be coded as multiple statements, for example, when it mentions multiple persons or organisations which refer to the same concept.
And a trash bin symbol, which completly removes your \dna\ Statement, but leaves the document intact (see Figure~\ref{fig:statdetails}).

After you are done providing the detail information, you can simply click anywhere outside the menu to return to the main window.
The moment you leave the \dna\ Statement window, all your edits are saved in your database---so there is no need to apply the changes or save your progress periodically (see Figure~\ref{fig:statdetails}).
If you click outside the window by accident, you can return to edit the statement by simply clicking on it again (more in the next section).
After you return to the main page, the statement is now highlighted either in yellow or in the personal colour of the coder you selected (see Figure~\ref{fig:stathl}).
All \dna\ Statements have the colour which was selected when the \dna\ database was created.
The only way to change the highlight colour would be to create a new database.
However, by opening the \code{Settings} and selecting \code{Color statements by coder} the highlight colour changes to the colour of the coder who created each statement.
\begin{figure}
  \includegraphics[frame, width=\linewidth]{06-03-statdetails}
  \caption{Create new \dna\ Statement}
  \label{fig:statdetails}
\end{figure}
\begin{figure}
  \includegraphics[frame, width=\linewidth]{06-04-stathl}
  \caption{\dna\ Statements are highlighted in yellow, annotations in grey by default}
  \label{fig:stathl}
\end{figure}

Besides \dna\ Statements, you can also create annotations.
In the sample database, annotations are highlighted in grey.
You can use this feature, for example, if you are not quite sure if a text passage is a statement.
Then you or another coder could review all annotations later on and decide if it should be coded as a statement or not.
In projects with multiple coders who work on a remote database, the feature can also serve as a mean of communication between coders, for example, to provide instructions for a specific document or to make each other aware of a certain sentence or paragraph which might not be a statement but nevertheless important in other ways.
Basically, annotations in \dna\ work in the same way as in MS Word, Goocle Docs and other applications.


\FloatBarrier
\section{Navigating Through Statements}
For this subsection, we focus our attention on the \code{Statements} window on the upper right corner of the main menu.
This window is presented as a table with an ``ID'' column and a ``Text'' column in which the text underlying a \dna\ Statement is displayed (see Figure~\ref{fig:statements}).

There are three ways to navigate through statements.
The first is displayed if you select the option \code{all} below the list of statements.
As the name suggests, the window shows all statements, ordered by appearance in your database.
The statements in the first document are on top, the statements in the last document you added are at the bottom of the table.
By clicking on a statement you jump to the document and position it appears.

The second option is called \code{current}.
It provides a filter, so only the statements in the currently selected document are displayed in the Statements table.
If you navigate to a different document, this selection will change.
On top of the table, you should see the new statement you just created.
If you haven't added or removed a statement before, the ID of the statement should be ``42''.

The third option is called \code{filter}.
When you choose this option, the ``Statements'' box increases in size and shows five new fields in which you can provide keywords to filter by for different fields in the \dna\ Statement: ID, person, organization, concept, agreement and a drop-down menu where you an switch between statement types.
By entering, for instance, ``There should be legislation to regulate emissions.'' in the concept field, only statements regarding that concept will be displayed.
Again, it is not necessary to provide the full text but it is sufficient to write in enough characters to differentiate the code clearly from other ones in the database.
For example, if there would be a different concept called ``There should be legislation for cap and trade.'', then you would have to enter at least ``There should be legislation \emph{t}'' before \dna\ can differentiate the two concepts.
Or even easier, you could simply enter ``emissions'' to display only statements for which the concept contains the word ``emissions''.
Additionally, you can again use regular expressions in the filter fields (see Section~\ref{subsubsec:adjregex}).
For instance, if you type \code{1[2-5]} into the ID field, \dna\ will display all statements between 12 and 15.

Instead of searching statements by just one variable, you can also combine filters.
By entering ``There should be legislation to regulate emissions.'' in the concept field and ``Bob'' in the person field, only the \dna\ Statements regarding this specific concept and were made by Bob Slaughter---who is the only Bob in the database---are displayed.

The filter fields all work in this way except \code{agreement} which only takes the values 0 for disagreement and 1 for agreement.
\begin{figure}
  \includegraphics[width=\linewidth]{06-05-statements}
  \caption{Statements window in detail}
  \label{fig:statements}
\end{figure}


\section{Editing a Statements}
\FloatBarrier
To edit a statement's details, you first need to select it, either by choosing it from the Statements table (see Figure~\ref{fig:statements}) or directly in the text.
This will open the same window as in Figure~\ref{fig:statdetails} which we discussed in detail in Section~\ref{sec:createstat}.

Besides that, you can also provide additional information for the variables in the \dna\ Statements.
To find this menu you first have toggle the option to display a new window beneath the text window.
To do that, find the view options in the top right corner of the main window (see Figure~\ref{fig:editdetails}, step 1).
The three options you see here control how \dna\ looks or, more precisely, which menus are displayed.
The left option displays/hides the ``Coder'' and ``Document properties'' menus, while the right one displays/hides the ``Statements'', ``Search within document'' and ``Regex highlighter'' menus.
However, now we need the middle option which displays the mentioned field beneath the document text.
\begin{figure}
  \includegraphics[width=\linewidth]{06-06-editdetails}
  \caption{Statements window in detail}
  \label{fig:editdetails}
\end{figure}

In this window, you can make edits to persons, organizations or concepts in all \dna\ Statements at once, rather than to single statements.
The advantage is that you do not have to edit each individual statement if you notice that you misspelt a name, that two concepts which you have coded actually represent the same idea or if you simply find a better concept name which adds some clarity.
When you edit a concept (or person or organisation) in this menu, it is changed in all \dna\ Statements at once.
When this menu is first opened, it contains no information before you select either \code{DNA Statement} or \code{Annotation} from the left drop-down menu (see Figure~\ref{fig:editstatm}).
\begin{figure}
  \centering
  \includegraphics[frame, width=0.6\linewidth]{06-07-editstatm}
  \caption{Recode a whole concept label instead of individual statement}
  \label{fig:editstatm}
\end{figure}

To provide additional information for persons, organizations or concepts you have to click on the label symbol in the top right corner of the same window (see Figure~\ref{fig:editdetails}, step 2).
As in the other menu, the window contains no entries before you select if you want to edit statements or annotations from the left drop-down list (see Figure~\ref{fig:editdetails}, step 3).
In the table presented now, you can add colour, type, alias and notes for each person, organization or concept.
This information is not directly used in \dna\ but, as we will see in Chapter~\ref{chp:rdna}, can be used during analysis with \rdna.

\FloatBarrier
\section{Using the Regex Highlighter and Search Function} \label{sec:regex}
The last part of this section about coding statements in \dna\ describes how you can use the ``Regex highlighter'' and ``Search within document'' functions to help you quickly identify relevant parts of the documents you will be coding.
Both functions are located on the right panel of the main window and can be folded or unfolded like every window in \dna\  (see Figure~\ref{fig:search}).

We begin with the \code{Search within document} menu as it is what most people are already familiar with.
By entering a search term in the text field, you can jump between all occurrences of that keyword in the current document in the same way as it works in popular text processing programs (such as MS Word) or in your web browser.
You can test this by entering the term ``greenhouse effect'' and browse through the occurrences in Navin Nayak's hearing in the U.S.
Congress (that is the second to last document in the database) with the left and right arrow buttons.
As you might notice, all mentions of the greenhouse effect are fairly close together and Coder 1 has formatted a large part the section they are mentioned as \dna\ Statement.
This is what will often happen during coding: as authors of a text or speakers will order their document or speech around sub-topics, the statements you wish to code can be close together.
If you already know the most important keywords, it might therefore not be necessary to closely read the whole article or speech but can be more fruitful and efficient to jump between potential statements and only skim the remaining lines.

One feature which sets the search function in \dna\ apart from other programs with which you are already familiar is again that it can take Regex expressions.
One useful feature of Regex commands in this scenario is that you can use multiple keywords at the same time by putting the vertical bar character \code{|}, which represents the \emph{OR}, operator between them.
You find this character next to the Shift button on your keyboard.
So for instance, you could write ``global warming\textbf{|}climate change" to browse through all mentions of either of those words in the current document.
Combining multiple keywords can serve as a quick way to browse through documents which are yet to code or which have already been coded to identify the most relevant parts.

The \emph{``Regex highlighter''} works in much the same way.
But instead of browsing through the occurrences of a key term, you can change its font colour to a colour of your choice.
Looking again at Figure~\ref{fig:search}, you can see that multiple words have a different font colour: ``dioxide'' is orange, ``greenhouse'' is green, ``clima'' is blue and so on.
This can help coders to notice when theory induced keywords appear in a paragraph.
Again, it can be of help if you know a few basics about regular expressions, but as you can see, most of the keywords are written in plain English and you can use that as well if you prefer.
However, consider that instead of the regular expression C[Oo0]2 you would have to enter both ``CO2'' and ``C02''.
The Regex highlighter, just as the search function, is by default case insensitive.
If you wish to highlight only specific cases of a word, you can put the option \code{(?-i)} at the beginning of your search string.

\begin{figure}
  \includegraphics[width=\linewidth]{06-08-search}
  \caption{The Search and Regex highlighter functions}
  \label{fig:search}
\end{figure}
This concludes the section about how to code statements in \dna.
Even though this chapter turned out to be shorter than the two previous ones, this is where you will spend most of your time while using \dna.


\chapter{Exporting the Coded Data}\label{chp:dna-export}
\chapterauthor{Johannes Gruber}
\FloatBarrier
This section will walk you through the process of exporting data from \dna\ to continue analysis in different programs and provide detailed information about each of the 15 options you can change during export.
If you intend to use \rdna\ to analyse your data---which is recommended---you will not have to export the data from \dna\ first as you can directly use the .dna database.
However, \rdna's function \code{dna\_network} mirrors the options in the export window and hence it makes sense to familiarise yourself with the different ways in which networks can be exported from \dna.

To open the window, simply open the drop-down menu ``Export'' and click on the only entry \code{Export network...}.
This will open the export window as depicted in Figure~\ref{fig:exportwin}.
The first thing you should do when you are not familiar with the different options is to tick the box \code{Display tooltips with instructions}.
This will put some further information about each option on your screen when you rest your mouse cursor over one of them.
\begin{figure}
  \includegraphics[width=\linewidth]{07-01-exportwin}
  \caption{The export window}
  \label{fig:exportwin}
\end{figure}


\section{Type of Network}\label{sec:typeofn}
The first thing to consider is which type of network you want to export.
There are three options: one-mode networks, two-mode networks and event lists.

If you select to export a \emph{``one-mode network''}, the resulting matrix will have the same nodes in the rows and columns  (e.\,g., organizations $\times$ organizations) (see Table~\ref{tab:onemode}).
The values in the cells in Table~\ref{tab:onemode} are a function of how often Variable 1 and Variable 2 are referenced together in \dna\ Statements.
These values represent the edge weights of a network.
All other options left on default, in the one-mode network case, that is the sum of all products of co-references by e.\,e., ogranisation A with organisation B (see Section~\ref{subsec:ignore}).
Note, that the edge weight of an organisation with itself is always zero.
One could also expect that these values would be aggregations of all statements an organisation made.
However, this would not make theoretical sense as there is no edge between an organisation and itself.
The the choice of the value zero in cells where row and column mention the same organisation are all zero is deliberate.

<<eval=TRUE, echo=FALSE, results='hide', message=FALSE, warning=FALSE>>=
#' prep environment
library("rDNA")
library("kableExtra")
dna_init("dna-2.0-beta22.jar")
truncate <- function(x, trunc = 12){
  x <- ifelse(nchar(x) > trunc,
                        paste0(gsub("\\s+$", "",
                                    substr(x, start = 1, stop = trunc)),
                               "..."),
                        x)
}
trim <- function(x, n = 12, e = "..."){
  ifelse(nchar(x) > n,
         paste0(gsub("\\s+$", "",
                     strtrim(x, width = n)),
                e),
         x)
}
@
<<eval=TRUE, echo=FALSE, warning=FALSE>>=
conn <- dna_connection(dna_sample())
dt <- dna_network(conn,
                  networkType = "onemode",
                  statementType = "DNA Statement",
                  variable1 = "organization",
                  variable2 = "concept",
                  qualifier = "agreement",
                  qualifierAggregation = "ignore",
                  normalization = "no",
                  isolates = FALSE,
                  duplicates = "include",
                  timewindow = "no",
                  windowsize = 100,
                  excludeValues = character(),
                  verbose = TRUE)
colnames(dt) <- truncate(colnames(dt))
kable(dt, format = "latex", booktabs = TRUE, linesep = "",
      caption = "One-mode network (organizations $\\times$ organizations over concept)\\label{tab:onemode}") %>%
  kable_styling(latex_options = c("striped", "scale_down", "HOLD_position")) %>%
  row_spec(0, angle = 90)

@

A \emph{``Two-mode network''}, on the other hand, has different sets of nodes in the rows and columns of the resulting matrix (e.\,g., concepts $\times$ organizations) (see Table~\ref{tab:twomode}).
In this case, what you select in Variable 1 will become the rows of the matrix while Variable 2 will be the columns.
In the two-mode network case, the edge weights will be, all other options left on default, simple counts of all co-references of Variable 1 and 2.
<<eval=TRUE, echo=FALSE, warning=FALSE>>=
conn <- dna_connection(dna_sample())
dt <- dna_network(conn,
                  networkType = "twomode",
                  statementType = "DNA Statement",
                  variable1 = "concept",
                  variable2 = "organization",
                  qualifier = "agreement",
                  qualifierAggregation = "ignore",
                  normalization = "no",
                  isolates = FALSE,
                  duplicates = "include",
                  timewindow = "no",
                  windowsize = 100,
                  excludeValues = character(),
                  verbose = TRUE)
colnames(dt) <- truncate(colnames(dt))
kable(dt, format = "latex", booktabs = TRUE, linesep = "", 
      caption = "Two-mode network (concept $\\times$ organizations)\\label{tab:twomode}") %>%
  kable_styling(latex_options = c("striped", "scale_down", "HOLD_position")) %>%
  row_spec(0, angle = 90)

@

The \emph{``Event list''} is a little different from the other two options.
Instead of summarising the statements over Variable 1 and 2, it simply lists all \dna\ Statements with each row containing all variables of a statement including the time which was set for the document a statement occurred in.
The event list is thus a more detailed version of the statements window in the \dna\ user interface.
As you can see in Table~\ref{tab:eventl}, this table is a lot larger than the matrices produced by the summarising algorithms used to produce one-mode and two node networks.
<<eval=TRUE, echo=FALSE, warning=FALSE>>=
conn <- dna_connection(dna_sample())
dt <- dna_network(conn,
                  networkType = "eventlist",
                  statementType = "DNA Statement",
                  variable1 = "concept",
                  variable2 = "organization",
                  qualifier = "agreement",
                  qualifierAggregation = "ignore",
                  normalization = "no",
                  isolates = FALSE,
                  duplicates = "include",
                  timewindow = "no",
                  windowsize = 100,
                  excludeValues = character(),
                  verbose = TRUE)
colnames(dt) <- truncate(colnames(dt))
kable(dt, format = "latex", booktabs = TRUE, linesep = "", 
      caption = "Event list\\label{tab:eventl}") %>%
  kable_styling(latex_options = c("striped", "scale_down", "float_left")) %>%
  row_spec(0, angle = 90) %>%
  landscape()

@


\section{Statement Type}
There are usually two options for \emph{``Statement type''} and you will probably never want to choose the second one---Annotations.
The default is to export \dna\ Statements instead, which are the type that actually contains information worth exporting as a network.
If you do wish to export your annotations, you will notice that the choices in Variable 1 and 2 are reduced to just the document variables: author, source, section and type.
This is because the other variables, like concept or organization, do not exist in the type ``Annotation''.
Thus the only option that really makes sense is to export annotations as event lists.

Choosing the statement type makes the most sense if you have created your own types while setting up the database (see Section~\ref{sec:stattype}).
As outlined there, very are few scenarios exist in which you have more than one relevant statement type.

\section{File Format}
There are three diffrent file formats which can be used for export at the moment: .csv, .dl and .graphml.
Alternativly, the data can be brought to \R\ using the \rdna\ pacakge to import data directly from a .dna file (see Chapter~\ref{chp:rdna}).

The \code{.csv} format (short for  comma-separated values) is basically a spreadsheet format in a text file.
If you open it with a text editor, it will look something like this:
<<eval=FALSE, engine = 'bash', results = 'tex'>>=
"";"Alliance to...";"Energy and E...";"Environmenta..."; "National Pet...";...
"CO2 legislation will not hurt the economy.";0;2;1;2;2;2;1
"Cap and trade is the solution.";0;0;0;0;0;0;1
"Climate change is caused by greenhouse gases (CO2).";1;0;0;0;0;1;1
"Climate change is real and anthropogenic.";0;0;0;0;0;1;2
"Emissions legislation should regulate CO2.";2;2;0;0;1;1;1
"There should be legislation to regulate emissions.";1;0;10;2;2;1;0
@
This is the exact same table as Table~\ref{tab:twomode}\footnote{Except that column names needed to be truncated to fit on this page.}
but instead of rows and columns, the values are stored in plain text, with each line containing one row of the table and the values separated by semicolons.
You can usually open these files in MS Excel, LibreOffice, or Numbers (Apple's spreadsheet application) or other spreadsheet applications without any further steps.
Note, however, that your spreadsheet application sometimes needs to be set to use semicolons to separate values instead of commas.\footnote{This is usually only a problem in MS Excel running on a Windows computer where the CSV separator is set system-wide in the language options.
If you experience problems, you need to use the import function in Excel rather than opening the .csv file directly.}

\code{.dl} files are for use with the network analysis software \ucinet.
Again, information in .dl files is stored in plain text and you can open the files using the text editor of your choice if you want to inspect the format in detail.
Taking again Table~\ref{tab:twomode} as an example, the respective .dl file looks like this:

<<eval=FALSE, engine = 'bash', results = 'tex'>>=
dl nr = 6, nc = 7, format = fullmatrix
row labels:
"CO2 legislation will not hurt the economy."
"Cap and trade is the solution."
"Climate change is caused by greenhouse gases (CO2)."
"Climate change is real and anthropogenic."
"Emissions legislation should regulate CO2."
"There should be legislation to regulate emissions."
col labels:
"Alliance to Save Energy"
"Energy and Environmental Analysis, Inc."
"Environmental Protection Agency"
"National Petrochemical & Refiners Association"
"Senate"
"Sierra Club"
"U.S. Public Interest Research Group"
data:
 0 2 1 2 2 2 1
 0 0 0 0 0 0 1
 1 0 0 0 0 1 1
 0 0 0 0 0 1 2
 2 2 0 0 1 1 1
 1 0 10 2 2 1 0
@
\ucinet\ is a Windows application which also runs on Mac and Linux if you use it with and emulators such as Wine or Bootcamp.
A 60-day free trial version with all features included is available for free on their website:
\url{https://sites.google.com/site/ucinetsoftware/downloads}.
However, we advise to use either \visone\ or \R\ for visualising your networks as both are free, under active development and---in the case of \R---open source.

The \code{.graphml} format is based on the open standard XML and while still readable by humans, the output file for Table~\ref{tab:twomode} would be too long to display it here.
\code{.graphml} files can be opened using \visone, which is ``a software tool intended for research and teaching in social network analysis'' (see \url{http://visone.info/html/about.html}).
Since it is a \java\ program like \dna\ itself, it can be run on any operating system capable of installing \java---which means bascially all of them.
You can take a glimpse at its visualisation capabilities in Figure~\ref{fig:visone}, which shows the plot of Table~\ref{tab:twomode}.

\begin{figure}
  \centering
  \includegraphics[width=\linewidth]{07-02-visone}
  \caption{Plotting the network portrayed in Table~\ref{tab:twomode} in Visone}
  \label{fig:visone}
\end{figure}


\section{Variable 1 and 2}
As mentioned earlier, Variable 1 and 2 do different things depending on whether you select one-mode or two-mode under \code{Type of network}.
In a one-mode network, Variable 1 will contain the node class used both for the rows and columns of the matrix.
For example, select the variable for organizations in order to export an organization $\times$ organization network such as shown in Table~\ref{tab:onemode}.
Variable 2 in a one-node network will denote the variable through which the edges are aggregated (i.\,e., the values in the cells).
For example, if you export a one-mode network of organizations, what makes the most sense is to aggregate how often they co-reference the same concept.
That means that the second variable should be set to \code{concept}.

In a two-mode network, the first variable denotes the node class for the rows, while the second variable denotes the node class used for the columns of the resulting network matrix.
For Table~\ref{tab:twomode}, for example, we set \code{Variable 1} to \code{concept} and \code{Variable 1} to \code{organization}.
Instead of seeing which concepts were  referenced by which organisation, we could also create a table that displays person $\times$ concept.
To do that we simply set \code{Variable 1} to \code{concept} and \code{Variable 2} to \code{person}.
To make the logic underlying the use of the two variables even more clear we can also choose to create a somewhat redundent two-mode matrix.
By setting \code{Variable 1} to \code{organization} and \code{Variable 2} to \code{person} we can count how often statements reference organisations and persons together.
Since every person belongs to just one organisation, the resulting matrix will count all statements by a person in one cell of her/his column while the result of all other cells in her/his column is zero:
<<eval=TRUE, echo=FALSE, warning=FALSE>>=
conn <- dna_connection(dna_sample())
dt <- dna_network(conn,
                  networkType = "twomode",
                  statementType = "DNA Statement",
                  variable1 = "organization",
                  variable2 = "person",
                  qualifier = "agreement",
                  qualifierAggregation = "ignore",
                  normalization = "no",
                  isolates = FALSE,
                  duplicates = "include",
                  timewindow = "no",
                  windowsize = 100,
                  excludeValues = character(),
                  verbose = TRUE)
colnames(dt) <- truncate(colnames(dt))
kable(dt, format = "latex", booktabs = TRUE, linesep = "", 
      caption = "Two-mode network (variable1 = organization $\\times$ variable2 = person)\\label{tab:variables}") %>%
  kable_styling(latex_options = c("striped", "scale_down", "HOLD_position")) %>%
  row_spec(0, angle = 90)

@
When you look at the column with Christine Todd Whitman in above's table, you can see that all of her statements co-reference the Environmental Protection Agency.
This is, therefore, the organisation she belongs to.
A table like that would be rarely useful in a research project though, except maybe for checking if there had been inconsistencies during the coding process.

\section{Qualifier and Qualifier Aggregation}\label{sec:qualifier}
The qualifier is a binary or integer variable which indicates different qualities or levels of association between variable 1 and variable 2.
In many cases, the qualifier is binary, with the only possible outcomes being agreement and disagreement.
However, the case could also be made to code different increments of agreement, for instance, by using a scale with -5 = strongly disagree to 5 = strongly agree.
You already saw the qualifier in Section~\ref{sec:createstat} when we created a new statement and had to tick a box in order to indicate support or rejection of the chosen concept.

The combination of qualifier and qualifier aggregation determines which algorithm will be used to calculate the edge weights (see Chapter~\ref{chp:algorithms}).
Again, this also depends on the chosen network type.
In case of the event list, for instance, there is just one option: ignore.
Choices for the other two network types are explained in the next two sub-sections.

\subsection{One-mode Networks}
\begin{description}
\item[Ignore] This does the same as in two-mode networks: it ignores the qualifier and simply aggregates all co-references of Variable 1 and 2.
As the ignore algorithm in \dna\ can be a little confusing for new users, this might be a good opportunity to complement Section~\ref{subsec:ignore} with an example.
As you can see in Table~\ref{tab:onemode}, the edge weight between the Senate and the Sierra Club is seven.
We can use Table~\ref{tab:disagree} to calculate this number by hand.

<<eval=TRUE, echo=FALSE>>=
conn <- dna_connection(dna_sample(verbose = FALSE))
dt1 <- dna_network(conn,
                  networkType = "twomode",
                  statementType = "DNA Statement",
                  variable1 = "organization",
                  variable2 = "concept",
                  qualifier = "agreement",
                  qualifierAggregation = "ignore",
                  normalization = "no",
                  isolates = TRUE,
                  duplicates = "include",
                  timewindow = "no",
                  windowsize = 100,
                  excludeValues = list(agreement = 0),
                 
                  verbose = TRUE)
colnames(dt1) <- paste(colnames(dt1), "- agree")
dt2 <- dna_network(conn,
                  networkType = "twomode",
                  statementType = "DNA Statement",
                  variable1 = "organization",
                  variable2 = "concept",
                  qualifier = "agreement",
                  qualifierAggregation = "ignore",
                  normalization = "no",
                  isolates = TRUE,
                  duplicates = "include",
                  timewindow = "no",
                  windowsize = 100,
                  excludeValues = list(agreement = 1),
                 
                  verbose = TRUE)
colnames(dt2) <- paste(colnames(dt2), "- disagree")

dt <- cbind(dt1, dt2)
#dt <- dt[, !colSums(dt) == 0|grepl("Senate|Sierra", colnames(dt))]
#dt <- dt[!rowSums(dt) == 0|grepl("Senate|Sierra", rownames(dt)), ]
dt <- dt[ , order(colnames(dt))]
colnames(dt) <- gsub(".*- ", "", colnames(dt))

kable(dt, format = "latex", booktabs = TRUE, linesep = "", 
      caption = "Agreement and disagreement to concepts by Senate and Sierra Club\\label{tab:disagree}") %>%
  kable_styling(latex_options = c("striped", "scale_down", "HOLD_position")) %>%
  add_header_above(c(" " = 1,
                     "Cap and trad..." = 2,
                     "Climate chan..." = 2,
                     "Climate chan..." = 2,
                     "CO2 legislat..." = 2,
                     "Emissions le..." = 2,
                     "There should..." = 2)) %>%
  column_spec(4, border_left = T) %>%
  column_spec(6, border_left = T) %>%
  column_spec(8, border_left = T) %>%
  column_spec(10, border_left = T) %>%
  column_spec(12, border_left = T) %>%
  row_spec(5, bold = T) %>%
  row_spec(6, bold = T)
@

Each value in Table~\ref{tab:disagree} represents a specific value in a three-dimensional array which was descibed in Chapter~\ref{chp:algorithms}.
One of the $x_{ijk}$ values is, for example, the count for $i =$ \emph{``CO2 legislation will not hurt the economy.''}, $j =$  \emph{``Senate''} and $k =$ \emph{disgaree}.
Look in Table~\ref{tab:disagree} to see that this specific value is 2.

You can also see that the Senate and Sierra Club co-reference three concepts: `\emph{`CO2 legislation will not hurt the economy.''}, \emph{``Emissions legislation should regulate CO2.''} and \emph{``There should be legislation to regulate emissions.''}.
The other three are not co-refenrced and are therefore not counted at all.
For the co-referenced concepts, the edge weight is number of statments from Senate multiplied with number of statments from Sierra Club (see Equation~\ref{eq:ignore}; we are ignoring the normalisation for now):
\infobox{9cm}{
  First, we calculate $\left( \sum_{k} x_{ijk} \right) \left( \sum_{k} x_{i'jk} \right)$ with $i = \text{Senate}$ and $i = \text{Sierra Club}$ for each concept j:\\

  Concept 1: \hfill $(0 + 0) \cdot (0 + 0) = 0$ \\
  Concept 2: \hfill $(0 + 0) \cdot (1 + 0) = 0$ \\
  Concept 3: \hfill $(0 + 0) \cdot (1 + 0) = 0$ \\
  Concept 4: \hfill $(0 + 2) \cdot (2 + 0) = 4$ \\
  Concept 5: \hfill $(0 + 1) \cdot (1 + 0) = 1$ \\
  Concept 6: \hfill $(2 + 0) \cdot (1 + 0) = 2$ \\
 
  Then we solve $\sum_{j = 1}^n$:\\
     
  \centerline{\textbf{$0 + 0 + 0 + 4 + 1 + 2 = 7$}}
 
}

\item[Congruence] This option is only available for one-mode networks.
It means that only similarity or matches on the qualifier variable are counted in order to construct an edge.
In case of a binary second variable (e.\,g., (dis-)agreement) this means that the only statements counted are those where, for example, two organisations co-support or both co-reject a concept.
You can see how this changes the output when you compare Table~\ref{tab:onemode} and Table~\ref{tab:qual3}.
Focusing again on the edge between the Senate and the Sierra Club, you can see now that the value dropped from seven to two as they do not support or reject all of the same concepts.
We can repeat the same calculation, this time using Equation~\ref{eq:congruence_binary} from Section~\ref{subsec:congruence}:

\infobox{9cm}{
  First, we calculate $\sum_{k} x_{ijk} x_{i'jk}$ for each concept $j$ and $k = \text{agreement}$ as well as $k = \text{disagreement}$:\\
 
  Concept 1:    \hfill $0 \cdot 0 + 0 \cdot 0 = 0$ \\
  Concept 2:    \hfill $0 \cdot 1 + 0 \cdot 0 = 0$\\
  Concept 3:    \hfill $0 \cdot 0 + 0 \cdot 1 = 0$\\
  Concept 4:    \hfill $0 \cdot 2 + 2 \cdot 0 = 0$\\
  Concept 5:    \hfill $0 \cdot 1 + 1 \cdot 0 = 0$\\
  Concept 6:    \hfill $2 \cdot 1 + 0 \cdot 0 = 2$\\

  Then we solve $\sum_{j = 1}^6$:\\
     
  \centerline{\textbf{$0 + 0 + 0 + 0 + 0 + 2 = 2$}}
}

<<eval=TRUE, echo=FALSE>>=
conn <- dna_connection(dna_sample(verbose = FALSE))
dt <- dna_network(conn,
                  networkType = "onemode",
                  statementType = "DNA Statement",
                  variable1 = "organization",
                  variable2 = "concept",
                  qualifier = "agreement",
                  qualifierAggregation = "congruence",
                  normalization = "no",
                  isolates = FALSE,
                  duplicates = "include",
                  timewindow = "no",
                  windowsize = 100,
                  excludeValues = character(),
                  verbose = TRUE)
colnames(dt) <- truncate(colnames(dt))
kable(dt, format = "latex", booktabs = TRUE, linesep = "", 
      caption = "One-mode congruence network (organizations $\\times$ organizations over concept)\\label{tab:qual3}") %>%
  kable_styling(latex_options = c("striped", "scale_down", "HOLD_position")) %>%
  row_spec(0, angle = 90)
@

With an integer qualifier variable, the inverse of the absolute distance plus one (i.\,e., the proximity) is used instead of a match.

\item[Conflict] Conflict is basically the opposite of congruence: while in the congruence network the matches (i.\,e., co-supported or co-rejected concepts) are counted, conflict counts only non-matches.
With a binary qualifier, this means that organizations A and B are connected if one of them supports a concept and the other one rejects the concept.
We can see how this plays out in Table~\ref{tab:qual4}.
Taking the same example as above we can again calculate the edge weight of Senate and Sierra Club using Equation~\ref{eq:conflict_binary} from Section~\ref{subsec:conflict}:

\infobox{10cm}{
  First, we calculate $\sum_{k} x_{ijk} x_{i'jk'}$ for each concept $j$.
To make that easier to follow, we calcuate $x_{ijk} x_{i'jk'}$ individually for the two possible cases: $k = \text{agreement}$ and $k' = \text{disagreement}$ as well as $k = \text{disagreement}$ and $k' = \text{agreement}$:\\
 
  Concept 1 ($k = \text{agreement;} k' = \text{disagreement}$): \hfill $0 \cdot 0 = 0$ \\
  Concept 1 ($k = \text{disagreement;} k' = \text{agreement}$): \hfill $0 \cdot 0 = 0$ \\
  Concept 2 ($k = \text{agreement;} k' = \text{disagreement}$): \hfill $0 \cdot 0 = 0$ \\
  Concept 2 ($k = \text{disagreement;} k' = \text{agreement}$): \hfill $0 \cdot 1 = 0$ \\
  Concept 3 ($k = \text{agreement;} k' = \text{disagreement}$): \hfill $0 \cdot 0 = 0$ \\
  Concept 3 ($k = \text{disagreement;} k' = \text{agreement}$): \hfill $0 \cdot 1 = 0$ \\
  Concept 4 ($k = \text{agreement;} k' = \text{disagreement}$): \hfill $0 \cdot 0 = 0$ \\
  Concept 4 ($k = \text{disagreement;} k' = \text{agreement}$): \hfill $2 \cdot 2 = 4$ \\
  Concept 5 ($k = \text{agreement;} k' = \text{disagreement}$): \hfill $0 \cdot 0 = 0$ \\
  Concept 5 ($k = \text{disagreement;} k' = \text{agreement}$): \hfill $1 \cdot 1 = 1$ \\
  Concept 6 ($k = \text{agreement;} k' = \text{disagreement}$): \hfill $2 \cdot 0 = 0$ \\
  Concept 6 ($k = \text{disagreement;} k' = \text{agreement}$): \hfill $0 \cdot 1 = 0$ \\
 
  Then we solve $\sum_{j = 1}^6 \sum_{k}$:\\
     
  \centerline{\textbf{$(0 + 0) + (0 + 0) + (0 + 0) + (0 + 4) + (0 + 1) + (0 + 0) = 5$}}
}

<<eval=TRUE, echo=FALSE>>=
conn <- dna_connection(dna_sample(verbose = FALSE))
dt <- dna_network(conn,
                  networkType = "onemode",
                  statementType = "DNA Statement",
                  variable1 = "organization",
                  variable2 = "concept",
                  qualifier = "agreement",
                  qualifierAggregation = "conflict",
                  normalization = "no",
                  isolates = FALSE,
                  duplicates = "include",
                  timewindow = "no",
                  windowsize = 100,
                  excludeValues = character(),
                  verbose = TRUE)
colnames(dt) <- truncate(colnames(dt))
kable(dt, format = "latex", booktabs = TRUE, linesep = "", 
      caption = "One-mode conflict network (organizations $\\times$ organizations over concept)\\label{tab:qual4}") %>%
  kable_styling(latex_options = c("striped", "scale_down", "HOLD_position")) %>%
  row_spec(0, angle = 90)
@

Note, that the result of the congruence network plus the result of the conflict network equals the outcome of the ignore algorithm.
This is because the difference between congruence and conflict is the factor which is ignored in the ignore network and all cases are added up.
     
With an integer qualifier, the absolute distance is used instead (see Section~\ref{subsec:conflict}).

\item[Subtract] If the ignore network can be defined as adding up the cases on congruence and conflict, subtract is the exact opposite: to calculate it, a congruence network and a conflicting network are created separately and then the conflict network ties are subtracted from the congruence network ties.
Taking the example from above, the edge between Senate and Sierra Club would be calculated using above results:
\infobox{9cm}{
  Congruence: $y_{ii'}^\text{congruence binary} = 2$\\
  Conflict: $y_{ii'}^\text{conflict binary} = 5$\\
  Therefore:\\
 
  \centerline{\textbf{$2 - 5 = -3$}}
}

See Table~\ref{tab:qual5} to check that this is correct.
The meaning of this outcome is that Senate and Sierra Club do have in fact more differences than they have in common.
You can also see that this is true for several of the other organisation pairs, suggesting a quite controversial discourse.
<<eval=TRUE, echo=FALSE>>=
conn <- dna_connection(dna_sample(verbose = FALSE))
dt <- dna_network(conn,
                  networkType = "onemode",
                  statementType = "DNA Statement",
                  variable1 = "organization",
                  variable2 = "concept",
                  qualifier = "agreement",
                  qualifierAggregation = "subtract",
                  normalization = "no",
                  isolates = FALSE,
                  duplicates = "include",
                  timewindow = "no",
                  windowsize = 100,
                  excludeValues = character(),
                  verbose = TRUE)
colnames(dt) <- truncate(colnames(dt))
kable(dt, format = "latex", booktabs = TRUE, linesep = "", 
      caption = "One-mode subtract network (organizations $\\times$ organizations over concept)\\label{tab:qual5}") %>%
  kable_styling(latex_options = c("striped", "scale_down", "HOLD_position")) %>%
  row_spec(0, angle = 90)
@
\end{description}

\subsection{Two-mode (Affiliation) Networks}
\begin{description}
\item[Ignore] The agreement qualifier variable is ignored, i\,.e., the network is constructed as if all values on the qualifier variable were the same.
In the most common case, the qualifier being (dis-)agreement, this means that co-references of a concept are aggregated, no matter if persons or organisations agree on them or not (see Table~\ref{tab:onemode}).
The way in which the edge weights are constructed was already mentioned in Section~\ref{sec:twomode}.
In the simplest case, a two-mode network with a binary qualifier, the edge weights for ignore are simple counts of co-references of variable 1 and 2.

\item[Subtract] If the qualifier is a regular binary (dis-)agreement value, all disagreeing statements will be subtracted from all agreeing statements.
For example, if an organisation mentions a concept two times in a positive way and three times in a negative way, there will be an edge weight of -1 between the organization and the concept.
You can see how this plays out for the sample file in Table~\ref{tab:qual}.
The \emph{National Petrochemical \& Refiners Association}, for example, disagreed twice to both \emph{``CO2 legislation will not hurt the economy''} and \emph{``There should be legislation to regulate emissions.''} resulting in edge weights of -2 for both concepts.
If the qualifier is an integer value, subtract basically does the same: the absolute values of the negative qualifier codes are subtracted from the positive ones.
In the example from above, if an organisation agrees somewhat (+1) to a concept twice and then disagrees strongly (-2) to the same concept twice, the edge weight would be -2.
<<eval=TRUE, echo=FALSE>>=
conn <- dna_connection(dna_sample(verbose = FALSE))
dt <- dna_network(conn,
                  networkType = "twomode",
                  statementType = "DNA Statement",
                  variable1 = "organization",
                  variable2 = "concept",
                  qualifier = "agreement",
                  qualifierAggregation = "subtract",
                  normalization = "no",
                  isolates = FALSE,
                  duplicates = "include",
                  timewindow = "no",
                  windowsize = 100,
                  excludeValues = character(),
                  verbose = TRUE)
colnames(dt) <- truncate(colnames(dt))
kable(dt, format = "latex", booktabs = TRUE, linesep = "", 
      caption = "Two-mode network (concept $\\times$ organizations) with disagreements subtracted from agreements\\label{tab:qual}") %>%
  kable_styling(latex_options = c("striped", "scale_down", "HOLD_position")) %>%
  column_spec(5, bold = T) %>%
  row_spec(0, angle = 90)
@

\item[Combine] This option is only available for two-mode networks.
Instead of doing mathematical operations on the qualifier values, combine creates a set of qualitative categories.
In the case of a organization $\times$ concept matrix, there would be four possible outcomes: If an organsation neither disagrees nor agrees (i.\,e., never references the concept at all) the value would be 0; if the organisation always agreed on the concept whenever it mentioned it, the value is 1; if the organisation always disagreed on the concept whenever it mentioned it, the value is 2; and when they reference a statment both in a positve and negative (i.\,e., when heir stance towards it is mixed or ambiguous), the edge value is 3.
You can see this in Table~\ref{tab:qual2}.
With an integer variable, this may become more complex.
As more combinations are possible and hence more qualitative categories need to be created (see Section~\ref{sec:twomode}.
<<eval=TRUE, echo=FALSE>>=
conn <- dna_connection(dna_sample(verbose = FALSE))
dt <- dna_network(conn,
                  networkType = "twomode",
                  statementType = "DNA Statement",
                  variable1 = "concept",
                  variable2 = "organization",
                  qualifier = "agreement",
                  qualifierAggregation = "combine",
                  normalization = "no",
                  isolates = FALSE,
                  duplicates = "include",
                  windowsize = 100,
                  excludeValues = character(),
                  verbose = TRUE)
colnames(dt) <- truncate(colnames(dt))
kable(dt, format = "latex", booktabs = TRUE, linesep = "", 
      caption = "Two-mode network (concept $\\times$ organizations) with disagreements and agreements combined\\label{tab:qual2}") %>%
  kable_styling(latex_options = c("striped", "scale_down", "HOLD_position")) %>%
  column_spec(3, bold = T) %>%
  row_spec(0, angle = 90)
@
\end{description}

\section{Normalization}\label{sec:normal}
In most real-life networks, normalisation is important.
The reason is that some actors are more salient than others in public discussion.
Government officials in charge of a certain problem will usually be covered more intensely than public interest groups who have an opposite stance.
Other actors, such as political parties, follow a membership logic according to which they are by definition more diverse than others with regard to opinions.
That means that due to their activity and diversity, these actors have agreement or disagreement ties to most other actors in the network at some point.
This can obscure the real structure of a network and make it much harder to identify coalitions \citep{leifeld2017discourse}.

To cancel out this activity effect, one can divide the edge weight between two organisations by a function of the two organizations' activity or frequency of making statements (see Section~\ref{sec:normalis}).
This normalization procedure leaves us with a network of edge weights that reflect similarity in opinions without taking into account centrality or activity.
Once again, which options are available depends on the network type.
Only the option \code{no} is available in both types---which switches off normalization and is the default value.

\begin{itemize}
  \item \textbf{Two-mode network}
  \begin{description}
    \item[Activity] This divides the edge weights through the activity of the node from the first variable.
For example, in concept $\times$ organisation network in which organisation A has made four statements in total, the edge with concept B, which A mentioned once, has the value 0.25.
    \item[Prominence] This divides the edge weights through the prominence of the node from the second variable.
For example, in concept $\times$ organisation network in which concept B was mentioned eight times by all organisations and once by organisation A, the edge of A and B has the value 0.125.
  \end{description}
  \item \textbf{One-mode network}
  \begin{description}
    \item[Average activity] Average activity divides edge weights between first-variable nodes by the average number of different second-variable nodes they are adjacent with in a two-mode network.
For example, if organization A makes statements about 20 different concepts and B makes statements about 60 different concepts, the edge weight between A and B in the congruence network is divided by 40.
To achieve a better scaling, all edge weights in the resulting normalized one-mode network matrix are scaled between 0 and 1.
The respective algorithm can be found in Equation~\ref{eq:activity}.
    \item[Jaccard similarity] Jaccard similarity is a similarity measure with known normalizing properties.
In contrast to \code{average activity}, it divides the co-occurrence frequency by the activity count of both separate actors plus their joint activity.
The algorithm used for this normalisation can be found in Equation~\ref{eq:jaccard}.
    \item[Cosine similarity] Cosine similarity is another similarity measure with normalizing properties.
It divides edge weights by the product of the nodes' activity.
Find this algorithm in Equation~\ref{eq:cosine}.
  \end{description}
\end{itemize}
Normalisation can be used in the different networks described in Section~\ref{sec:qualifier} to correct potential biases introduced by very active nodes.
By using a threshold value on the edge weights before visualising the network, normalisation can make it easier to remove low-intensity ties without discriminating against organisations with a low media profile.
Furthermore, the two normalisation algorithms which are based on vector similarities (Cosine and Jaccard) prepare networks to be fed into hierarchical cluster analyses, nonmetric multidimensional scaling, or other clustering techniques that are based on distance or similarity measures in order to identify coalitions in a policy debate as an alternative to community detection.
Details about normalisation can be found in \citet{leifeld2017discourse}.
In Table~\ref{tab:normal} you can see how the two-mode network from Table~\ref{tab:twomode} looks like after applying the activity normalisation algorithm.

<<eval=TRUE, echo=FALSE>>=
conn <- dna_connection(dna_sample(verbose = FALSE))
dt <- dna_network(conn,
                  networkType = "twomode",
                  statementType = "DNA Statement",
                  variable1 = "concept",
                  variable2 = "organization",
                  qualifier = "agreement",
                  qualifierAggregation = "ignore",
                  normalization = "activity",
                  isolates = FALSE,
                  duplicates = "include",
                  timewindow = "no",
                  windowsize = 100,
                  excludeValues = character(),
                  verbose = TRUE)
colnames(dt) <- truncate(colnames(dt))
kable(dt, format = "latex", booktabs = TRUE, linesep = "", 
      caption = "Two-mode network (organizations $\\times$ concept) activity normalised\\label{tab:normal}") %>%
  kable_styling(latex_options = c("striped", "scale_down", "HOLD_position")) %>%
  row_spec(0, angle = 90)

@


\section{Duplicates}
The next option in the export window would be \code{Isolates}, but as you will see, it makes more sense to talk about that option last.
The background of the Duplicates feature is that multiple statements from the same person or organisation referring to the same statement do not always indicate stronger agreement or disagreement.
For removing duplicates, there are again several options.
This time, these are independent of your other choices in the export window:

\begin{description}
  \item[Include all duplicates] By default, all statements are included in a network export.
  \item[Ignore per document] This removes duplicated statements which occur in the same document.
In a newspaper article, for example, the number of times an actor is quoted with a statement may be a function of the journalist's agenda or the reporting style of the news media outlet, rather than the actor's deliberate attempt to speak multiple times about a specific topic.
In these cases it makes sense to remove all duplicated statements in this article.
  \item[Ignore per calendar week/month/year] In cases in which one or several newspapers reprint interviews, quotes or reports multiple times in different documents, it might make sense to remove these artefacts over time rather than on a document level.
  \item[Ignore across date range] This removes all duplicated statements in all documents in the database.
Consequently, edge weights of the different networks are converted to 1 if there has been co-reference or co-rejection or 0 if there has not.
Usually, \code{Ignore across date range.} therefore converts the network into a boolean matrix.
However, as you can see in Table~\ref{tab:dupl}, there is one 2 value at the edge of ``Energy and Environmental Analysis, Inc.'' and ``Emissions legislation should regulate CO2.''.
This happened because this organisation contradicted itself by expressing both agreement and disagreement during the hearings.
\end{description}

Again we apply this transformation to Table~\ref{tab:twomode} to illustrate how the resulting matrix is affected:
<<eval=TRUE, echo=FALSE>>=
conn <- dna_connection(dna_sample(verbose = FALSE))
dt <- dna_network(conn,
                  networkType = "twomode",
                  statementType = "DNA Statement",
                  variable1 = "organization",
                  variable2 = "concept",
                  qualifier = "agreement",
                  qualifierAggregation = "ignore",
                  normalization = "no",
                  isolates = FALSE,
                  duplicates = "acrossrange",
                  timewindow = "no",
                  windowsize = 100,
                  excludeValues = character(),
                  verbose = TRUE)
colnames(dt) <- truncate(colnames(dt))
kable(dt, format = "latex", booktabs = TRUE, linesep = "", 
      caption = "Two-mode network (organization $\\times$ concept) duplicates excluded across date range\\label{tab:dupl}") %>%
  kable_styling(latex_options = c("basic", "scale_down", "HOLD_position")) %>%
  row_spec(0, angle = 90)
@


\section{Time Series Options}
The options for time series are controlled by the four fields in the fourth row of the export window.
\code{Include from} and \code{Include until} can be used to narrow down the number of documents which are included.
The date and time which matter here are not when the \dna\ Statement was coded, but when the statement was made, which means the date set in the metadata of the document it occurred in.
The standard values are the time and date of the oldest and newest document in the database, respectively.
If you set, for example, \code{Include from} to ``2005-01-27 - 00:00:00'' the document titled ``109-876: Bluestein, Joel-BUS-Y'' will be excluded as this hearing was held one day earlier.

\code{Moving time window}, when set to anything other than the default \code{no time window}, \dna\ will create multiple overlapping time slices that are moved forward along the time axis.
For each time slice, the network will be constructed and exported as an individual file.
The statements in the sample database were made over a short period of time which leaves us only with the short slice units---seconds, minutes, hours and days---and windows for experimenting.
However, this is sufficient to emphasize how times series are exported in \dna.

You could, for example, be interested in a fast moving debate with actors changing opinions and/or sides daily.
In this case, you could set \code{Moving time window} to \code{using days} and \code{time window size} to one.
Consequently, \dna\ would export one network per day.
These networks do not necessarily feature all organisations or concepts but will only show active nodes, as you can see in Table~\ref{tab:time}.
Since there were only hearings on five of the selected 20 days (we excluded January 26th in the last step), the remaining 15 matrices are even completely empty.
<<eval=FALSE, echo=FALSE>>=
conn <- dna_connection(dna_sample(verbose = FALSE))
dt <- dna_network(conn,
                  networkType = "twomode",
                  statementType = "DNA Statement",
                  variable1 = "organization",
                  variable2 = "concept",
                  qualifier = "agreement",
                  qualifierAggregation = "ignore",
                  normalization = "no",
                  isolates = FALSE,
                  duplicates = "acrossrange",
                  start.date = "27.01.2005",
                  stop.date = "16.02.2005",
                  start.time = "00:00:00",
                  stop.time = "00:00:00",
                  timewindow = "days",
                  windowsize = 1,
                  excludeValues = character(),
                  verbose = TRUE)
dt6 <- as.data.frame(dt[["networks"]][6], check.names = FALSE)
dt14 <- as.data.frame(dt[["networks"]][14], check.names = FALSE))
dt20 <- as.data.frame(dt[["networks"]][20], check.names = FALSE))
colnames(dt6) <- truncate(colnames(dt6))
colnames(dt14) <- truncate(colnames(dt14))
colnames(dt20) <- truncate(colnames(dt20))

row.names(dt6) <- truncate(row.names(dt6), 20)
row.names(dt14) <- truncate(row.names(dt14), 20)
row.names(dt20) <- truncate(row.names(dt20), 20)

#approach 1
kable(list(dt6, dt14, dt20), format = "latex", booktabs = TRUE, linesep = "",
      caption = "Time series two-mode networks (organizations $\\times$ concept)\\label{tab:time}") %>%
  group_rows(as.character(dt[["time"]][6]), 0, 2) %>%
  group_rows(as.character(dt[["time"]][14]), 3, 4) %>%
  group_rows(as.character(dt[["time"]][20]), 5, 7) %>%
  kable_styling(full_width = T) %>%
  column_spec(1, width = "5cm")

#approach 2
kable(dt6, format = "latex", booktabs = TRUE, linesep = "",
            caption = "Time series two-mode networks (organizations $\\times$ concept)\\label{tab:time}") %>%
  group_rows(as.character(dt[["time"]][6]), 0, 2)  %>%
  kable_styling(full_width = T) %>%
  row_spec(0, angle = 90) %>%
  column_spec(1, width = "5cm")
kable(dt14, format = "latex", booktabs = T) %>%
  group_rows(as.character(dt[["time"]][14]), 0, 1)  %>%
  kable_styling(full_width = T) %>%
  row_spec(0, angle = 90) %>%
  column_spec(1, width = "5cm")
kable(dt20, format = "latex", booktabs = T) %>%
  group_rows(as.character(dt[["time"]][20]), 0, 3)  %>%
  kable_styling(full_width = T) %>%
  row_spec(0, angle = 90) %>%
  column_spec(1, width = "5cm")


#kable_styling(kb, latex_options = c("HOLD_position"))
@


% bug in kableExtra (https://github.com/haozhu233/kableExtra/issues/134), table is hardcoded for now
\begin{knitrout}
\rowcolors{3}{white}{white} %rowcolors turned off
\begin{table}[H]

\caption{Time series two-mode networks (organizations $\times$ concept) (only three of five active days are included)\label{tab:time}}
\centering
\begin{tabu} to 0.6\linewidth {>{\raggedright\arraybackslash}p{5cm}>{\raggedleft}X>{\raggedleft}X>{\raggedleft}X}
\toprule
\addlinespace[0.3em]
\multicolumn{4}{l}{\textbf{2005-02-02}}\\
\rotatebox{90}{\hspace{1em} } & \rotatebox{90}{CO2 legislat...} & \rotatebox{90}{Emissions le...} & \rotatebox{90}{There should...}\\
\midrule
\hspace{1em}Environmental Protec... & 1 & 0 & 1\\
\hspace{1em}Senate & 1 & 1 & 1\\
\bottomrule
\end{tabu}
\vspace{2mm}

\begin{tabu} to 0.6\linewidth {>{\raggedright\arraybackslash}p{5cm}>{\raggedleft}X>{\raggedleft}X>{\raggedleft}X}
\toprule
\addlinespace[0.3em]
\multicolumn{4}{l}{\textbf{2005-02-10}}\\
\rotatebox{90}{\hspace{1em} } & \rotatebox{90}{Climate chan...} & \rotatebox{90}{Emissions le...} & \rotatebox{90}{There should...}\\
\midrule
\hspace{1em}Alliance to Save Ene... & 1 & 1 & 1\\
\bottomrule
\end{tabu}
\vspace{2mm}

\begin{tabu} to 0.6\linewidth {>{\raggedright\arraybackslash}p{5cm}>{\raggedleft}X>{\raggedleft}X>{\raggedleft}X>{\raggedleft}X>{\raggedleft}X>{\raggedleft}X}
\toprule
\addlinespace[0.3em]
\multicolumn{7}{l}{\textbf{2005-02-16}}\\
\rotatebox{90}{\hspace{1em} } & \rotatebox{90}{CO2 legislat...} & \rotatebox{90}{Cap and trad...} & \rotatebox{90}{Climate chan...} & \rotatebox{90}{Climate chan...} & \rotatebox{90}{Emissions.le...} & \rotatebox{90}{There.should...}\\
\midrule
\hspace{1em}National Petrochemic... & 1 & 0 & 0 & 0 & 0 & 1\\
\hspace{1em}Sierra Club & 1 & 0 & 1 & 1 & 1 & 1\\
\hspace{1em}U.S. Public Interest... & 1 & 1 & 1 & 1 & 1 & 0\\
\bottomrule
\end{tabu}
\end{table}

\end{knitrout}
In this special case, time slices are non-overlapping since it is moved one day at a time.
However, if we would select two or a higher number in \code{time window size}, slices would overlap by the chosen time unit.
For example, if there had been hearings on the first, second and third of February, a network would be created for the first and second February and a second network would feature statements from the second and third.
If there were more dates included in the time range, this process would continue until the end of the time period is reached.
Other time units work in the same way.
To get mutually exclusive (i\,.e., non-overlapping) time slices, the user should select them manually from the output.
For example, if the window size is 10 days, you could only select file number one, eleven, 21 and so on.

Instead of time units, it is also possible to use \code{event time}.
This will create time slices of exactly 100 statement events, for example.
However, it is possible that multiple events have identical timestamps.
In this case, the resulting network time slice is more inclusive and also contains those statements that happened at the same time.

\section{Exclude from Variable}
The last row in the export window includes options which can be used to exclude statements which reference certain values.
This can make sense if you are, for example, only interested in a certain kind of organisation.
To include only NGOs in your network, you could select organization from \code{Exclude from variable} and then while holding the Ctrl key select ``Energy and Environmental Analysis, Inc.'', ``Environmental Protection Agency'', ``National Petrochemical \& Refiners Association'' and ``Senate'' from the list on the right.
The preview will now show which nodes are excluded.
The exported network will miss all statements which reference one of these organisations and only contains statements made by NGOs:
<<eval=TRUE, echo=FALSE>>=
conn <- dna_connection(dna_sample(verbose = FALSE))
dt <- dna_network(conn,
                  networkType = "twomode",
                  statementType = "DNA Statement",
                  variable1 = "concept",
                  variable2 = "organization", 
                  qualifier = "agreement",
                  qualifierAggregation = "subtract",
                  normalization = "no",
                  isolates = FALSE,
                  duplicates = "acrossrange",
                  timewindow = "no",
                  windowsize = 1,
                  excludeValues = list("organization" =
                                         c("Energy and Environmental Analysis, Inc.",
                                           "Environmental Protection Agency",
                                           "Senate",
                                           "National Petrochemical & Refiners Association")),
                  verbose = TRUE)
colnames(dt) <- truncate(colnames(dt))
kable(dt, format = "latex", booktabs = TRUE, linesep = "",  caption = "Two-mode network (organizations $\\times$ concept) with all non-NGOs excluded\\label{tab:exclud}") %>%
  kable_styling(latex_options = c("striped")) %>%
  row_spec(0, angle = 90)
@

Another common application for this feature is to exclude concepts which are very general and do not add to the structure of the discourse network.
This will be shown in Section~\ref{sec:retnet} where one concept makes the difference between a fairly convoluted and a much more telling network visualisation.


\section{Isolates}
We saved isolates for last because now that all non-NGOs are removed from the export, it is easier to explain what this option does.
If you look again at Figure~\ref{tab:exclud}, you will notice that not only are the edge weights smaller than before the exclusion, the non-NGO organisation nodes are gone as well.
If one of the concepts would have never been referenced by any of the NGOs, this concept would be gone as well.
This is because by default only nodes that show at least minimal activity are included in the exported networks.
As you can also see in Table~\ref{tab:time}, exclusion of statements can lead to very differently sized matrices.
However, if you have ever manually merged matrices of different sizes, you will know that it can be tedious to get everything into the right form.
In these situations, it is easier to merge multiple networks if they have the same matrix dimensions.
To achieve compatibility of the matrix dimensions anyway, it is possible to include all nodes of the selected variable(s) in the whole database, irrespective of time, qualifiers, and excluded values (but without any edge weights larger than 0, i\,.e., as isolates).
You can see this in Table~\ref{tab:isolates} which mirrors the information of Table~\ref{tab:exclud} but has the same dimensions as, for example, Table~\ref{tab:twomode}.

<<eval=TRUE, echo=FALSE>>=
conn <- dna_connection(dna_sample(verbose = FALSE))
dt <- dna_network(conn,
                  networkType = "twomode",
                  statementType = "DNA Statement",
                  variable1 = "concept",
                  variable2 = "organization",
                  qualifier = "agreement",
                  qualifierAggregation = "subtract",
                  normalization = "no",
                  isolates = TRUE,
                  duplicates = "acrossrange",
                  timewindow = "no",
                  windowsize = 1,
                  excludeValues = list("organization" =
                                         c("Energy and Environmental Analysis, Inc.",
                                           "Environmental Protection Agency",
                                           "Senate",
                                           "National Petrochemical & Refiners Association")),
                  verbose = TRUE)
colnames(dt) <- truncate(colnames(dt))
kable(dt, format = "latex", booktabs = TRUE, linesep = "",
      caption = "Two-mode network showing columns and rows with all values (even excluded or empty ones)\\label{tab:isolates}") %>%
  kable_styling(latex_options = c("striped", "scale_down", "HOLD_position")) %>%
  row_spec(0, angle = 90)

@
As became clear in this section, the many options in \dna\ during export allow for an effective preparation pof you data for further analysis.
From the created network tables it is often already possible to draw some cautious inferences.
From Table~\ref{tab:qual5}, for instance, it already became clear, that there is more conflict than congruence between several of the actors, such as between Senate and Sierra Club.
<<eval=TRUE, echo=FALSE, warning=FALSE>>=
unlink(dna_sample())
@


\chapter{\rdna: Using \dna\ from \R} \label{chp:rdna}
\chapterauthor{Philip Leifeld and Johannes Gruber}
\FloatBarrier
\dna\ can be connected to the statistical computing environment \R\ \citep{coreteam2017r} through the \rdna\ package \citep{leifeld2018rdna}.
There are two advantages to working with \R\ on \dna\ data.

The first advantage is replicability.
The network export function of \dna\ has many options.
Remembering what options were used in an analysis can be difficult.
If the analysis is executed in \R, commands---rather than mouse clicks---are used to extract networks or attributes from \dna.
These commands are saved in an \R\ script file.
This increases replicability because the script can be re-used many times.
For example, after discovering a wrong code somewhere in the \dna\ database, it is sufficient to fix this problem in the \dna\ file and then re-run the \R\ script instead of manually setting all the options again.
This reduces the probability of making errors and increases replicability.

The second advantage is the immense flexibility of \R\ in terms of statistical modelling.
Analysing \dna\ data in \R\ permits many forms of data analysis beyond simple visualization of the resulting networks.
Examples include cluster analysis or community detection, scaling and application of data reduction techniques, centrality analysis, and even statistical modelling of network data.
\R\ is also flexible in terms of combining and matching the data from \dna\ with other data sources.

\section{Getting Started with \rdna}
The first step is to install \R---which was explained in Section~\ref{chp:installation}).
Installing additional \R\ packages for network analysis and clustering, such as \texttt{statnet} \citep{handcock2008statnet, goodreau2008statnet, handcock2016statnet}, \texttt{xergm} \citep{leifeld2018temporal, leifeld2017xergm}, \texttt{igraph} \citep{csardi2006igraph}, and \texttt{cluster} \citep{maechler2017cluster}, is recommended.
Moreover, it is necessary to install and correctly set up the \texttt{rJava} package \citep{urbanek2017rjava}, on which the \rdna\ package depends, and the \texttt{devtools} package \citep{wickham2018devtools}, which permits installing \R\ packages from GitHub (see Section~\ref{sec:installdna}).
To install the packages neccessary for this section, simply execute the following commands:

<<eval=FALSE>>=
install.packages("statnet")
install.packages("xergm")
install.packages("igraph")
install.packages("cluster")
install.packages("rJava")
install.packages("devtools")
@

Nefore going on, the \rdna\ package must be attached to the workspace.
If you have already set up \rdna\ (see Section~\ref{sec:installdna}) you can do this with:

<<eval=TRUE, results = 'tex', message = FALSE>>=
library("rDNA")
@

To ensure that the following results can be reproduced exactly, we should set the random seed in \R:

<<eval=TRUE, results = 'tex', message = FALSE>>=
set.seed(12345)
@

Now we are able to use the package.
The first step is to initialize \dna.
Out of the box, \rdna\ does not know where the \dna\ \texttt{.jar} file is located, but will look for the newest version of it in your working directory.
We also need to register \dna\ with \rdna\ to use them together.
To do that, you need to save the \dna\ \texttt{.jar} file to the working directory of the current \R\ session.
If you haven't already downlaoded \dna, you can do this from \R\ by running \code{dna\_downloadJar()}.
This will place the newest version of \dna\ in your working directory.
Then  you can initialize \dna\ as follows (with \texttt{dna-2.0-beta22.jar} in this example):

<<eval=FALSE>>=
dna_init("dna-2.0-beta22.jar")

# If you use the current beta version (e.g., you have just downloaded it via
# dna_downloadJar) you can omit the file name
dna_init()
@

After initializing \dna, we can open the \dna\ graphical user interface from the \R\ command line:

<<eval=FALSE>>=
dna_gui()
@

Alternatively, we can provide the file name of a local \dna\ database as an argument, and the database will be opened in \dna.
For example, we could open the \texttt{sample.dna} database which comes with the package.
To that, we included a convenience function called \code{dna\_sample}.
You can use this function in one of two ways: either, you use it to copy the sample database into your current working directory, from where \R\ can easily find it, or you use it directly in the package folder or \rdna:
<<eval=FALSE>>=
# copy sample.dna to your working directory and then open it via dna_gui
dna_sample()
dna_gui("sample.dna")

# or directly open the sample database
dna_gui(dna_sample())
@

In addition to opening the GUI, we will want to retrieve networks and attributes from \dna---which is the main purpose of \rdna!
For this to happen, a connection with a \dna\ database must first be established using the \code{dna\_connection} function, which works similar to \code{dna\_gui}:

<<eval=TRUE, warning=FALSE>>=
conn <- dna_connection(dna_sample())
@

The \code{dna\_connection} function accepts a file name of the database including full or relative path (or, alternatively, a connection string to a remote \texttt{MySQL} database) and optionally the login and password for the database (in case a remote \texttt{MySQL} database is used).
Details about the connection can be printed by calling the resulting object named \code{conn}.

After initializing \dna\ and establishing a connection to a database, we can now retrieve data from \dna.
We will start with a simple example of a two-mode network from the sample database.
To compute the network matrix, the connection that we just established must be supplied to the \code{dna\_network} function:

<<eval=TRUE, results = 'tex'>>=
nw <- dna_network(conn)
@

The resulting matrix is the same that you have seen in Table~\ref{tab:twomode}.
Another useful fact about the \R\ environment is that there are functions for nearly every task you might want to perform in network analysis and it is also very well suited for plotting.
The object we just created for example can easily be plotted with a single line of code in the  \texttt{statnet} suite of packages:
<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 4, fig.height = 4, crop = TRUE>>=
# First attach the statnet package
library("statnet")

# Then simply use gplot
gplot(nw)
@

It is also easily possible to retrieve the attributes of a variable, for example the colours and types of ogranisations , using the \code{dna\_attributes} function:

<<eval=TRUE, results = 'tex'>>=
at <- dna_getAttributes(conn)
@

The result is a data frame with organizations in the rows and one column per organizational attribute:
<<eval=TRUE, echo = FALSE, results = 'tex'>>=
kable(at, format = "latex", booktabs = TRUE, linesep = "",
      caption = "Attributes of \"organization\" from the sample database\\label{tab:attribute}") %>%
  kable_styling(latex_options = c("striped", "scale_down", "HOLD_position"))
@
The next section will provide usage examples of both the \code{dna\_network} and the \code{dna\_attributes} functions.


\section{Retrieving Networks and Attributes}\label{sec:retnet}
This section will explore the \code{dna\_network} function and facilities for retrieving attributes in more detail.
The \code{dna\_network} function has a number of arguments, which resemble the export options in the \dna\ export window (see Chapter~\ref{chp:dna-export}).
The help page for the \code{dna\_network} function provides details on these arguments.
It can be opened using the command

<<eval=FALSE>>=
help("dna_network")
@

If you are using \rstudio, this will open the help window, which is by default in the lower right corner of the user interface. Instead of typing \code{help("Name\_of\_Function")} into the console you can also use \code{?Name\_of\_Function} or use the search bar in the help window in \rstudio.

We will start with a simple example: a one-mode congruence network of organizations in a policy debate.
We will use the same \texttt{sample.dna} database as in Chapter~\ref{chp:dna-export})
As mentioned above, it is a small excerpt from a larger empirical research project that tries to map the ideological debates around American climate politics in the U.S. Congress over time.
Accordingly, one should expect to find a polarized debate with environmental groups on one side and industrial interest groups on the other side.
To compute a one-mode congruence network, the following code can be used:

<<eval=TRUE, results = 'tex'>>=
congruence <- dna_network(conn,
                          networkType = "onemode",
                          statementType = "DNA Statement",
                          variable1 = "organization",
                          variable2 = "concept",
                          qualifier = "agreement",
                          qualifierAggregation = "congruence",
                          duplicates = "document")
@

The result is an organization $\times$ organization matrix, where the cells represent on how many concepts any two actors (i\,e., the row organization and the column organization) had the same issue stance (by values of the qualifier variable \code{agreement}).

If you have read Chapter~\ref{chp:dna-export}, you can clearly see now that the arguments of the \code{dna\_network} function resemble the options in the \dna\ export window.
Details on the various arguments of the function can be obtained by displaying the help page (\code{?dna\_network}).
In the code chunk above, \code{statementType = "DNA Statement"} indicates which statement type should be used for the network export.
In this case, the statement type \texttt{\dna\ Statement} contains the variables \code{organization}, \code{concept}, and \code{agreement}.
The argument \code{qualifierAggregation = "congruence"} causes \rdna\ to count how often the unique elements of \code{variable1} have an identical value on the \code{qualifier} variable (here: \code{agreement}) when they refer to a concept (\code{variable2}; more details can be found in Section~\ref{sec:typeofn}).

If the algorithm finds duplicate statements within documents---i.\,e., statements containing the same organization, concept, and agreement pattern---, only one of them is retained for the analysis (\code{duplicates = "document"}).

The resulting matrix can be converted to a network object and plotted as follows:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 5, crop = TRUE>>=
nw <- network(congruence)
plot(nw,
     edge.lwd = congruence^2,
     displaylabels = TRUE,
     label.cex = 0.5,
     usearrows = FALSE,
     edge.col = "gray"
     )
@

A little background in \R\ is needed to understand what is happening here.
The plot function in \R\ is somewhat special as it can handle a lot of different objects.
Instead of trying to plot every object in the same way, \R\ will automatically detect its class and use a specific plotting method for it.
You can detect a class of the object we are plotting yourself with the function \code{class(nw)}.
Since the class of this object is \code{network}, \R\ will automatically use \code{plot.network}.
That means tha if you want to learn more about the plotting method we used above, you can simply type \code{help(plot.network)} into the \R\ Console.

Here, we additionally used the \code{edge.lwd} argument of the \code{plot.network} function to make the line width proportional to the strength of congruence between actors.
We used squared edge weights to emphasize the difference between low and high edge weights.
And we also displayed the labels of the nodes at half the normal size, suppressed arrow heads, and changed the colour of the edges to grey.
More information about the visualization capabilities of the \texttt{network} and \texttt{sna} packages are provided by \citet{butts2008social, butts2008network, butts2015network}.

As you can see now, the network is not particularly polarized.
We can suspect that some of the concepts are not very contested.
If they are supported by all actors, this may mask the extent of polarization with regard to the other concepts. We can see if this is the case by plotting the agreement and disagreement towards concepts with \code{dna\_barplot}:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 10, fig.height = 5, crop = TRUE>>=
dna_barplot(conn, of = "concept", fontSize = 10)
@

It looks like the the concept ``There should be legislation to regulate emissions.'' is in fact very consensual.
By repeating the same command with a few different options we can check if the organizations do in fact all agree on this matter:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 10, fig.height = 5, crop = TRUE>>=
dna_barplot(conn,
            of = "organization",
            fontSize = 10,
            excludeValues = list("concept" =
            "There should be legislation to regulate emissions."),
            invertValues = TRUE)
@

Instead of plotting agreement towards the concept, \code{of = "organization"} shows how often organizations agreed or disagreed with any statements.
Yet, by combining \code{excludeValues} and \code{invertValues = TRUE}, we tell \R\ to \emph{only} regard the statement we want to look at.
Again, you can get help with this function with the command \code{help(dna\_barplot)}.

The plot shows that everyone but the ``National Petrochemical \& Refiners Association'' agrees to the concept we chose, which indicates that consensus on the statement ``There should be legislation to regulate emissions.'' obfuscates the real structure of the network.
Therefore we should exclude it from the congruence network.
To do that, we need to export and plot the congruence network again and use the \code{excludeValues} argument this time:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 5, crop = TRUE>>=
congruence <-
  dna_network(conn,
              networkType = "onemode",
              statementType = "DNA Statement",
              variable1 = "organization",
              variable2 = "concept",
              qualifier = "agreement",
              qualifierAggregation = "congruence",
              duplicates = "document",
              excludeValues = list("concept" =
              "There should be legislation to regulate emissions."))
nw <- network(congruence)
plot(nw,
     edge.lwd = congruence^2,
     displaylabels = TRUE,
     label.cex = 0.5,
     usearrows = FALSE,
     edge.col = "gray"
     )
@

This reveals the structure of the actor congruence network much better.
There are two camps revolving around environmental groups on the right and industrial interest groups and state actors on the left, with \texttt{Energy and Environmental Analysis, Inc.} taking a bridging position.
The strongest belief congruence ties can be found within, rather than between, the coalitions.

Next, we should tweak the congruence network further by changing the appearance of the nodes.
We can use the colours for the organization types saved in the database and apply them to the nodes in the network.
We can also make the size of each node proportional to its activity.
The \code{dna\_attributes} function serves to retrieve these additional data from \dna.
The result is a data frame with the relevant data for each organization in the \texttt{colour} and \texttt{frequency} columns:%
\footnote{This prints the data.frame directly to the console. If you wish to examine the data.frame in a spreadsheet-style viewer instead, you can use \code{View(at)}.}

<<eval = TRUE, results = 'show'>>=
at <- dna_getAttributes(conn,
                        statementType = "DNA Statement",
                        variable = "organization")
at
@

To use these data in the congruence network visualization, we can use the plotting facilities of the \code{plot.network} function:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 5, crop = TRUE>>=
plot(nw,
     edge.lwd = congruence^2,
     displaylabels = TRUE,
     label.cex = 0.5,
     usearrows = FALSE,
     edge.col = "gray",
     vertex.col = at$color,
     vertex.cex = at$frequency
     )
@

This yields a clear visualization of the actor congruence network, with simultaneous display of the network structure including its coalitions, the actors' activity in the debate, and actor types.

Another way to visualize a discourse network is a two-mode network visualization.
To compute a two-mode network of organizations and concepts, the following code can be used:

<<eval=TRUE, results = 'tex'>>=
affil <- dna_network(conn,
                     networkType = "twomode",
                     statementType = "DNA Statement",
                     variable1 = "organization",
                     variable2 = "concept",
                     qualifier = "agreement",
                     qualifierAggregation = "combine",
                     duplicates = "document",
                     verbose = TRUE)
@

This creates a $7 \times 6$ matrix of organizations and their relations to concepts.
The argument \code{networkType = "twomode"} is necessary because the rows and columns of the \texttt{affil} matrix should contain different variables.
The arguments \code{variable1 = "organization"} and \code{variable2 = "concept"} define which variables should be used for the rows and columns, respectively.
The arguments \code{qualifier = "agreement"} and \code{qualifierAggregation = "combine"} define how the cells of the matrix should be populated:
\code{agreement} is a binary variable, and the \code{combine} option causes a cell to have a value of $0$ if the organization never refers to the concept, $1$ if the organization refers to the respective concept exclusively in a positive way, $2$ if the organization refers to the concept exclusively in a negative way, and $3$ if there are both positive and negative statements by the organization about the concept.
We have covered this before in detail in Section~\ref{sec:qualifier}.
\rdna\ reports on the \R\ console what each combination means (if you set \code{verbose = TRUE}).

As in the previous example, the resulting network matrix can be converted to a \texttt{network} object (as defined in the \texttt{network} package).
The colours of the edges can be stored as an edge attribute, and the resulting object can be plotted with different colours representing positive, negative, and ambivalent mentions.

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 6, crop = TRUE>>=
nw <- network(affil, bipartite = TRUE)
colors <- as.character(t(affil))
colors[colors == "3"] <- "deepskyblue"
colors[colors == "2"] <- "indianred"
colors[colors == "1"] <- "#31a354"
colors <- colors[colors != "0"]
set.edge.attribute(nw, "color", colors)
plot(nw,
     edge.col = get.edge.attribute(nw, "color"),
     vertex.col = c(rep("white", nrow(affil)),
                    rep("black", ncol(affil))),
     displaylabels = TRUE,
     label.cex = 0.5
     )
@

In this example, we first converted the two-mode matrix to a bipartite \texttt{network} object, then created a vector of colours for the edges (excluding zeros), and inserted this vector into the \texttt{network} object as an edge attribute.
It was necessary to work with the transposed \texttt{affil} matrix (using the \code{t} function) because the \code{set.edge.attribute} function expects edge attributes in a row-wise order while the \code{as.character} function returns them in a column-wise order based on the \texttt{affil} matrix.
Finally, we plotted the network object with edge colours and labels.
In the visualization, we used white nodes for organizations and black nodes for concepts.

A side not on colours in \R: As you can see in this code chunk, there are three kinds of colours which \R\ can process.
Simple general colour names like \emph{black} and \emph{white}, very specific colour names like \emph{deepskyblue} and \emph{indianred}, and hexadecimal numbers that represent a specific mixture of red, green and blue.
For finding the specific names, there is an extensive \href{https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf}{R colour cheatsheet}, which is also a good resource to get a quick overview of colours in R as well.
Using hexadecimal numbers might seem overly complicated for the task of picking a colour for a plot at first.
However, there are two reasons why you should consider it anyway.
First, it is nowadays very easy to install a colour picker addon in your web browser, which gives you the opportunity to pick any colour you like on a website and use it for your own plots in \R.
And second, because there is a fantastic website called \url{http://colorbrewer2.org/} that offers a number of very pleasant looking colour palettes which can make your plots appear more modern.
The easiest way to use colours from both sources is by entering the hexadecimal numbers, preceded by \texttt{\#}.

We can now see the opinions of all actors on the various concepts.
The blue edge indicates that \texttt{Energy and Environmental Analysis, Inc.} has both positive and negative things to say about the concept \texttt{``Emissions legislation should regulate CO2''}.
This is why this organization acts as a bridge between the two camps in the congruence network.
Furthermore, we can now see more clearly that the concept we omitted in the congruence network, \texttt{``There should be legislation to regulate emissions''}, is viewed positively by four organizations, but still receives a negative mention by one actor.
The green edges span both camps, and this caused additional connections between the two groups.
The negative tie is ignored in the construction of the congruence network because conflicts are not counted and there is no second red tie to that concept.

\section{Cluster analysis}
In the last section we have seen how you can identify communities in a discourse by visually inspecting network plots.
However, this becomes difficult very quickly as soon as discourse network analysis is performed on a larger database with more than just a few actors.
Additionally, different levels of access to the media or other public forums can lead one organisation to be able to make many more statements than others and therefore mask the strength of ties between actors.
As briefly mentioned in Section~\ref{sec:normal}, it is therefore often fertile to perform certain clustering techniques on DNA networks to get a better grip of the characteristics of a discourse.

\rdna users can perform this very easly, since our function \code{dna\_cluster} already comprises options for several clustering algorithms, along with the appropriate normalisation techniques.
Again you can get help on the function with \code{help(dna\_cluster)}.

We can perform clustering directly on the connection to the sample database we have been using above and do not need to call \code{dna\_network} first:

<<eval=TRUE, results='tex'>>=
# Use the command on the connection to the sample
clust <- dna_cluster(conn)

# And simply type the name of the object to print information about it
clust
@

As you can see from the information printed to the screen when \code{clust} was called, the cluster method used by default is \texttt{ward.D2}.
The distance measure is set automatically, based on the properties of the network.
Normally, this defaults to Euclidean distance, except if the network matrix which is constructed in the background is binary.
Since the default of \code{dna\_cluster} is to exclude duplicated statements on the document level (\code{duplicates = "document"}), the resulting network matrix is binary in this case, meaning that 1 represents a connection between actors and 0 means there is no connection.
This, in effect, normalises the network as it does not matter anymore how often an actor makes the same statement in the same document and since the actors in the sample database do not appear in more than one document each.
In other scenarios, however, actors might make the same statment many times over different documents.
In this case, you might want to use a different setting for \code{duplicates} argument.
\code{duplicates = "acrosstime"}, for example, will make sure a binary network is created in all cases.
If you turn the removal of duplicates off, the Euclidean distance will be used for clustering in this case and the number of statments each organisation has made will influence the clustering results:

<<eval=TRUE, results = 'tex'>>=
clust <- dna_cluster(conn, duplicates = "include")
clust
@

Objects created via \code{dna\_cluster} can be directly plotted in \R\ using the same plot command as above:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 5, fig.height = 7, out.width = "40%", out.height = "40%", crop = TRUE>>=
plot(clust)
@

\code{plot} produces a simple dendrogram in this case, which shows the different clusters as branches and the organisations as leafs of a tree.
This plot might be informative enough in some cases, but it is neither very appealing nor is it particularly easy to customise its shape, colours or labels.
This is why we have integrated a special plot function into \rdna\ to make the life of users easier and to produce arguably nicer plots which can often be directly used for publication:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 5, fig.height = 7, out.width = "40%", out.height = "40%", crop = TRUE>>=
dna_plotDendro(clust)
@

As you can see, this is essentially the same plot but the overly long labels on the x-axis were truncated (you can set the maximum number of characters using the argument \code{truncate =}), the leaves of the dendrogram have the colours chosen for each organisation in \dna, and the branches now consist of dashed lines to highlight where branches split.
When you look into the help for \code{dna\_plotDendro} you will see that the function is highly customisable and offers a range of extra features.

But first of all, what you can see from the plot is that it appears like the discussion consist of three broad clusters.
``National Petrochemical \& Refiners Association'', ``Energy and Environmental Analysis, Inc.'' and the ``Senate'' appear to form one cluster; the ``Alliance to Save Energy'', ``Sierra Club'' and ``U.S. Public Interest Research Group'' make up the second one; while the ``Environmental Protection Agency'' seems to be a single community on its own.
But remember, that data for this plot still includes all duplicated statements and the concept which we have identified in the last section to be so consensual, that is masks the conflict in this discussion.
So we need to remove this now.

Excluding values from the clustering is not a dedicated option in \code{dna\_cluster}, but it is very easy to do anyway.
Note, the description of the \code{...} in \code{help(dna\_cluster)}: ``Additional arguments passed to dna\_network.''
This means that you can use nearly all arguments from \code{dna\_network} by simply writing them within the brackets after \code{dna\_cluster}.
When we completely remove duplicates and the problematic concept as described, the resulting dendrogram looks a bit different:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 5, fig.height = 7, out.width = "40%", out.height = "40%", crop = TRUE>>=
clust <- dna_cluster(conn,
                     duplicates = "acrossrange",
                     attribute1 = "type",
                     cutree.k = 2,
                     excludeValues = list("concept" =
                     "There should be legislation to regulate emissions."))
dna_plotDendro(clust, shape = "diagonal", colours = "brewer", rectangles = "red")
@

Now we can see that the two main branches of the dendrogram show the same communities we have identified in the network plot.
As you have noticed, we have also changed some other things about the plot such as its shape, the selection of colours and we have plotted red rectangles around the two main clusters.
As mentioned before, the function is highly customisable and can visualise up to four sets of information about the variables:

\begin{description}
\item[\code{leaf\_colours}] Takes the values \code{"attribute1"} or \code{"attribute2"} and colours the leaves accordingly. 
What values are in attribute1 and attribute2 of the object you are plotting can be set in \code{"dna\_cluster"} and can be checked using the command \code{attributes(clust)\$colours}. 
In this example, \code{attribute1} was set to type, which means that the type of organisation was used to colour leaves.
\item[\code{leaf\_ends}] This works in the same way as \code{leaf\_colours} but instead of the leaf colours, it assigns different shapes to the line ends of the leaves.
\item[\code{activity}] This can be either turned on or off via \code{TRUE} and \code{FALSE}. 
If turned on, the size of the line ends will be determined by the activity of the leaf in the network---i\,.e., how many statements an actor made (minus the duplicated statements, if you chose to exclude them).
\item[\code{rectangles}] You can either provide a colour value to draw rectangles around groups or leave this empty to make the boxes disappear (which is the default). The group membership of each organisation in above's example is determined by providing \code{cutree.k = 2} in \code{"dna\_cluster"}. 
If you neither provide a value for \code{cutree.k} nor \code{cutree.h}, all leaves will belong to the same group.
\end{description}

Therefore it is possible to visualise a lot of different information in just one plot, which is great for publication, or if you want to use the plot to discover patterns for further analysis. 
The following plot shows this by employing the full potential of the function.

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 4, crop = TRUE>>=
clust <- dna_cluster(conn,
                     variable1 = "person",
                     attribute1 = "value",
                     attribute2 = "type",
                     cutree.k = 2,
                     excludeValues = list("concept" =
                     "There should be legislation to regulate emissions."))

# Now the first attribute contains just the names of the persons
clust$attribute1

# You can replace these values by something more informative like this
clust$attribute1 <- c("male", "female", "male", "male", "male", 
                      "female", "male")

# You can change the legend by changing the colours attribute
attr(clust, "colours") <- c("Gender", "Organisation Type")

# Then you are ready to plot
library("ggplot2")
dna_plotDendro(clust,
               activity = TRUE,
               shape = "diagonal",
               truncate = 20,
               leaf_colours = "attribute1",
               colours = "brewer",
               custom_colours = "Set1",
               rectangles = "#e34a33",
               leaf_ends = "attribute2",
               ends_alpha = 0.8,
               font_size = 9,
               leaf_labels = "ticks") +
  coord_flip()
@

As becomes clear from this plot, the three NGOs are clustered together, while other organisations make up the second cluster.
You can also see that while the NGOs are roughly the same in terms of activity, the number of statements  from persons who belong to a business or government organisations or are in Congress, differ widely.
Unsurprisingly, the gender of the speakers does not seem to influence in which cluster they end up, but at least now you know, how you can add this information.

You can also see in this example, how you can make further changes to the plot using commands from the \texttt{ggplot2} package \citep{wickham2009ggplot2}: simply by adding them via \code{+} at the end of the plot function.
\texttt{ggplot2} is very popular in the \R\ community and you can find many tutorials online on how to produce plots with it.
As \code{dna\_plotDendro} essentially uses \texttt{ggplot2} under the hood, you can use functions from \texttt{ggplot2} and its extensions to manipulate the appearence of the dendrogram further.
In this case, we added \code{+ coord\_flip()} after the end of the call to \code{dna\_plotDendro} to rotate the plot by 90°.

Another option might not be particularly useful with a small dataset but can work well with larger samples. When setting \code{circular = TRUE}, the dendrogram will be plotted as a circle, making room for more leaves without needing a large amount of horizontal space.

<<eval=TRUE, results = 'tex', fig.width = 7, fig.height = 5, out.width = "70%", out.height = "30%", crop = TRUE>>=
dna_plotDendro(clust,
               circular = TRUE,
               leaf_colours = "attribute2",
               leaf_labels = "nodes",
               colours = "brewer",
               custom_colours = "Set2",
               theme = "void")
@


What we did not explore so far, is that \code{dna\_cluster} is also capable of using a range of different clustering methods besides the default \texttt{ward.D2}.
First, we implemented all clustering methods from the \code{hclust} function in \R\ (See \code{help(hclust)} for details).
And second, we included the ``cluster\_edge\_betweenness'', ``cluster\_leading\_eigen'' and ``cluster\_walktrap'' algorithms from \texttt{igraph} package \citep{csardi2006igraph}.

We can compare a few of these algorithms by running \code{dna\_cluster} and \code{dna\_plotDendro} multiple times and then arranging the individual plots in a grid:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 7, fig.height = 7, crop = TRUE>>=
# We save the concept we want to exclude in a new object, so we do not have to
# repeat this line multiple times
excludeConcept <- list("concept" =
                         "There should be legislation to regulate emissions.")

# Then we call dna_cluster four times with four different clusterig methods
clust_centroid <- dna_cluster(conn,
                              clust.method = "centroid",
                              cutree.k = 2,
                              excludeValues = excludeConcept)
clust_edbe <- dna_cluster(conn,
                              clust.method = "edge_betweenness",
                              cutree.k = 2,
                              excludeValues = excludeConcept)
clust_leei <- dna_cluster(conn,
                              clust.method = "leading_eigen",
                              cutree.k = 2,
                              excludeValues = excludeConcept)
clust_walktrap <- dna_cluster(conn,
                              clust.method = "walktrap",
                              cutree.k = 2,
                              excludeValues = excludeConcept)

# Now we plot the four clustered objects but save the plots in objects instead
# of printing them to the screen directly
dend_centroid <- dna_plotDendro(clust_centroid, show_legend = FALSE, truncate = 25) +
  ggtitle("centroid")
dend_edbe <- dna_plotDendro(clust_edbe, show_legend = FALSE, truncate = 25)+
  ggtitle("edge_betweenness")
dend_leei <- dna_plotDendro(clust_leei, show_legend = FALSE, truncate = 25)+
  ggtitle("leading_eigen")
dend_walktrap <- dna_plotDendro(clust_walktrap, show_legend = FALSE, truncate = 25)+
  ggtitle("walktrap")

# Now we arrange the plots in a grid and save to a new object 'grid'
library("gridExtra")
grid <- grid.arrange(dend_centroid,
                     dend_edbe,
                     dend_leei,
                     dend_walktrap)

@

As you can see, all but one of the cluster methods arrive at basically the same result:
NGOs belong to one main cluster, while the other organisations form the second one.
Only clustering the sample data based on the leading eigenvector algorithm from the \texttt{igraph} package arrives at a different result.

Once your plot is displayed in the plot pane of RStudio, you can use the export button to save the image as a PNG or PDF file for later use.
Alternatively, you can also use another command from \texttt{ggplot2} to save the plot, which can often lead to better results than using \rstudio's built-in feature:

<<eval=FALSE>>=
ggsave(plot = grid, filename = "Cluster methods.pdf", device = "pdf",
       scale = 2, width = 10, height = 15, units = "cm")
@

You can experiment with your own data now and see how the different algorithms group your actors.
However, which clustering algorithm works best with your data is a question we cannot offer advice here.

\section{Heatmaps}
Once you know which actors are clustered together, the next step of an analysis would be to determine what stances bind them together.
One way to do so would be to use the \code{dna\_network()} function to retrieve (dis-)agreement of the actors towards all the different concepts.
This information is already present in objects created with \code{dna\_cluster()} and can be retrieved from it using the \$ symbol.
Instead of printing it to your console, you can use the \code{View} command to open it in a spreadsheet-style data viewer:

<<eval=FALSE, results = 'tex'>>=
clust <- dna_cluster(conn)
View(clust$network)
@
<<eval=TRUE, results = 'tex', echo=FALSE>>=
clust <- dna_cluster(conn)
dt <- clust$network
#colnames(dt) <- truncate(colnames(dt), trunc = 20)
kable(dt, format = "latex", booktabs = TRUE, linesep = "") %>%
  kable_styling(latex_options = c("striped", "HOLD_position"), font_size = 7) %>%
    row_spec(0, angle = 90)
@

The ``- 0'' or ``- 1'' behind each concept label stand for disagreement or agreement respectively. However, this table is hard to read, especially when there are more than just a few organisations or concepts present in the database.

One approach to make it easier to inspect the distribution of (dis-)agreement on concepts among members of each cluster are heatmap plots.
Heatmaps are especially helpful to get a quick overview of a matrix and see where values are high or low.
Combined with the dendrograms introduced in the last section, they can thus be helpful to get an insight into which concepts determine the clustering.
The function to do this in \rdna\ is called \code{dna\_plotHeatmap} and takes objects that were created with \code{dna\_cluster}---just as \code{dna\_plotDendro}:
<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 4, crop = TRUE>>=
clust <- dna_cluster(conn)
dna_plotHeatmap(clust)
@

The dendrograms on the x- and y-axis of the heatmap show how organisations and concepts are grouped together by the clustering algorithm used to construct the object \code{clust}.
As you can see, the main feature the NGOs have in common is their agreement to many of the concepts, while the other organisations seem to be more cautious in stating agreement towards any of the statements.

Since the heatmap is ordered by the dendrograms on its axis, changing the clustering algorithm will also change the heatmap.
This makes it easy to compare how concepts and actors are clustered together.
As many users---and some publishers---may prefer grayscale plots, we also use the following plot to demonstrate how to accomplish that with \code{dna\_plotHeatmap}:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 4, crop = TRUE>>=
clust <- dna_cluster(conn,
                     clust.method = "walktrap")
dna_plotHeatmap(clust,
                colours = "gradient",
                custom_colours = c('gray', 'black'))
@

Using the walktrap algorithm changed the order of the four upper organisations and now grouped business and government organisations together.
Note, however, that the dendrogram on the x-axis of the plot hasn't changed.
This is because the concepts cannot be clustered using the "walktrap" algorithm but are instead ordered based on the default value \code{ward.D2}.
If you prefer not doing that, you have the option to turn off the drawing of the dendrogram on the x-axis completely by using the parameter \code{dendro\_x = FALSE}.

As you can see, by setting \code{colours} as \code{gradient} and providing two colour values, you can change the colour for low (first colour) and high (second colour) values in the underlying matrix.
Providing more than two values within \code{c()}, will still lead to the first value being used for low values and the last value being used for high values.
All additional colour values in between will also be used to construct a gradient of colours.

Alternatively, \code{colours} takes the setting \code{brewer} which will automatically choose a pleasant set of colours for the heatmap.
In this case, \code{custom\_colours}---if used---needs to be a name of one of the available colour palettes from the \texttt{RColorBrewer} package \citep{neuwirth2014rcolorbrewer}.
You can display the available options using the \R\ command \code{RColorBrewer::display.brewer.all()}.

As you will have noted, the labels of the concept are separated by (dis-)agreement and the words ``yes'' and ``no'' are added at the end.
You can control which suffixes to use for the different qualifier levels by providing a list with ``translations'' of each level as you can see below. 
The level \code{"0"} will be replaced by ``disagree'' and \code{"1"} by ``agree''.
More levels can be named in the same fashion by adding more entries to this list.
If your qualifier levels have a different meaning, the default values should always be changed as \rdna\ does not detect the meaning automatically.

As was mentioned before, \code{dna\_clust} removes duplicates on the document level by default, leaving the dataset---in case of the sample---as a binary.
If the heatmap is not binary, it might make sense to display the exact values inside the plot by setting \code{values = TRUE}.

<<eval=TRUE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 4, out.height = "30%", out.width = "60%", crop = TRUE>>=
clust <- dna_cluster(conn,
                     duplicates = "include",
                     clust.method = "walktrap")
dna_plotHeatmap(clust,
                values = TRUE,
                truncate = 20,
                colours = "brewer",
                custom_colours = "YlOrRd",
                dendro_x = FALSE,
                dendro_y_size = 0.4,
                qualifierLevels = list("0" = "disagree",
                                       "1" = "agree"),
                show_legend = FALSE)
@

Consult \code{help(dna\_plotHeatmap)} to see what other options you can set. For example, read up on the \code{dendro\_y\_size} option we used above, to see what it means.
Also note, that you can use the \code{...} argument to pass arguments to \code{dna\_plotDendro}.
Try using a different shape argument, for example, to see how this would work.

\section{Multi-dimensional scaling}
From the heatmap plot you got an idea now why the cluster analysis assigned certain group memberships to actors.
Yet, what you cannot assess so far is how different or similar the clusters are to each other.
The network matrix usually simply has too many columns to compare the different actors to each other all at once.
One common tool to get around this problem is to reduce the number of dimensions---in this case the dimensions would be agreement and disagreement to each concept---until they can be plotted in a two-dimensional space.
That is excatly what (non-metric) multidimensional scaling (MDS) does.\footnote{Specifically we use Kruskal's non-metric multidimensional scaling which makes most sense for our kind of network data.}
Taking agreement and disagreement information towards all concepts, MDS can reduce the differences and similarities between actors to plot them in a two-dimensional space.

In \rdna\ we can perform this with the now well-known \code{dna\_cluster} function and the \code{dna\_plotCoordinates} command.

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 4, crop = TRUE>>=
clust <- dna_cluster(conn)
dna_plotCoordinates(clust)
@

Each point in this dotplot represents an organisation, in this case, and the points are highlighted with colours and shapes which represent their membership in one of the clusters.
Clusters are derived in this case with the \code{pam} function from the \texttt{cluster} package \citep{maechler2017cluster} and the silhouette width is used to assess the best number of clusters automatically by default, which in this case happens to be one.

However, looking at the plot, you probably wonder why there only appear to be three dots.
In fact, there is one point for every one of the seven organisations, but since they are so similar, they are plotted in almost exactly the same place.
There are, however, tools to make them visible nevertheless.
The first of these tools is called jittering: by adding random noise to the data, it is possible to prevent overplotting.
In the \code{dna\_plotCoordinates} function, this can be done by providing one or two numeric values to the \code{jitter} argument:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 4, crop = TRUE>>=
dna_plotCoordinates(clust, jitter = c(0.5, 1.7))
@

Now you can see points for all of the seven organisations.
Additionally, polygons now encompass all points of the same cluster to visualise the area of a specific cluster.
The first jitter value in the code chunk above determines the limits of how much a point can be displaced left or right on the x-axis.
If a second value is provided, the points are additionally jittered on the y-axis.

Since the jittering happens randomly, the plot would usually look different every time.
However, to ensure reproducibility of plots, we added a seed argument.
As long as you do not change the seed, plots will look the same every time you plot them.
Yet, if you change the seed, this will alter the position of the jittered points:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 4, crop = TRUE>>=
dna_plotCoordinates(clust, jitter = c(0.5, 1.7), seed = 23455)
@

Use the jittering option cautiously though, since it distorts the appearance of the plot heavily if you choose appearance high jitter values as in the examples above.

To evade the problem of distortion, you can use the other tool to show overplotted points in \rdna.
Use option \code{label = TRUE} to plot the actor labels closely to their respective dots.
In this case, the labels will be moved instead of the points if actors appear very closely together.

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 4, crop = TRUE>>=
dna_plotCoordinates(clust, label = TRUE)
@

The downside in this case is that the polygons disappeared again, since you need at least three points in one cluster, before the become visible.

Like in the plot functions we showed before, \code{dna\_plotCoordinates} again comes with several arguments to style the plots.
Specifically, you can provide colours to \code{custom\_colours} and numeric values to \code{custom\_shape}.

Another important point to highlight about \code{dna\_plotCoordinates} is that there are two additional cluster algorithms to determine the groups.
The first one, \code{pam}, was already mentioned above.
The second one, \code{cluster\_louvain} is from the \texttt{igraph} package \citep{csardi2006igraph} and can be chosen with \code{clust\_method = "louvain"}:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 4, crop = TRUE>>=
dna_plotCoordinates(clust,
                    label = TRUE,
                    custom_colours = c("red", "green", "blue"),
                    custom_shape = c(4,5),
                    clust_method = "louvain")
@

If you would rather use one of those explained above, you can simply set the option in \code{dna\_cluster} and then choose \code{clust\_method = "inherit"}:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 4, crop = TRUE>>=
clust <- dna_cluster(conn, clust.method = "edge_betweenness", cutree.k = 2)
dna_plotCoordinates(clust,
                    draw_polygons = TRUE,
                    label = TRUE,
                    jitter = c(1.5, 1.7),
                    clust_method = "inherit",
                    seed = 12345)
@

Other ways to style the plot include setting the labels of the axis:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 4, crop = TRUE>>=
dna_plotCoordinates(clust,
                    axis_labels = c("Dimension A", "Dimension B"),
                    stress = FALSE,
                    title = character())
@

Or using again \texttt{ggplot2} commands to modify the plot much further:

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 4, crop = TRUE>>=

# ggplot
library("ggplot2")
clust$group <- gsub("Group", "Cluster", clust$group)
dna_plotCoordinates(clust,
                    jitter = c(1.5, 1.7),
                    clust_method = "inherit",
                    axis_labels = c("Dimension A", "Dimension B"),
                    custom_shape = c(1, 5),
                    stress = FALSE,
                    title = character(),
                    label = TRUE,
                    label_size = 2,
                    point_size = 3,
                    label_background = TRUE) +
  theme_bw() +
  labs(caption = paste0("Nonmetric MDS with STRESS = ",
                        round(attributes(clust$mds)$stress, 3),
                        ". Clusters derived from ",
                        clust$method,
                        " calculation.")) +
  theme(panel.grid.minor = element_blank(),
        text = element_text(size = 9),
        panel.grid.major = element_line(colour = "grey",
                                        size = 0.3,
                                        linetype = 2),
        panel.background = element_rect(fill = "#F8F2E4"),
        legend.position = c(0.85, 0.85),
        legend.background = element_rect(fill = alpha(0.4)),
        legend.title = element_blank())
@


% \section{Correspondence analysis}
% TODO

\section{Network plots}
We have shown above already how you can use the \R\ infrastructure to produce some basic network plots with the \texttt{statnet} suite of packages. 
Besides that, there are a number of packages in the \R\ universe that are quite capable to cater to your network plotting needs.
The already mentioned \texttt{igraph} package \citep{csardi2006igraph}, for example, comes with very powerful functions for plotting networks. 
Other packages, such as \texttt{networkD3} \citep{allaire2017networkD3} are capable of creating interactive network plots, which are great to share, for example, on a website.
The package \texttt{ggraph} \citep{linpedersen2017ggraph} was specifically developed to employ the same logic as \texttt{ggplot2}, the so-called grammar of graphics, to implement a consistent logic for plotting networks.
This makes it possible to produce highly customised plots by combining a set of relatively few individual functions which each serve to control one specific aspect of a plot.

Whatever package you might prefer though, all of them have one thing in common: unlike graphical solutions such as \gephi\ or \visone, using \R\ for network analysis and plotting enables you to do reproducible research.
This often outweighs the alleged convenience of GUI applications, as reproducibility is not just of academic value, but also enables you to copy the same plotting code again and again in your own research, once you are happy with the appearance of the network plot.

However, we do acknowledge that bringing the data into the right form and tweaking your network plots to look good can be daunting for beginners or even advanced \R\ users who are new to network analysis.
This is why \rdna\ comes with two dedicated functions to create netwrok plots: \code{dna\_plotNetwork} and \code{dna\_plotHive}.
Both are capable of handling \dna\ network objects:\footnote{Both functions technically first create an \texttt{igraph} object from a \dna\ network which is then plotted using \texttt{ggraph}.}\footnote{There is a good chance that your own plot will not look exactly like this.
Due to your screen size and the size of the plot window, the size of nodes and labels might differ.
However, you can easily control this by changing the arguments \code{node\_size} and \code{font\_size}.}

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 6, fig.height = 4, crop = TRUE>>=
# Plot onemode network
nw <- dna_network(conn, networkType = "onemode")
dna_plotNetwork(nw)
@
<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 6, fig.height = 4, crop = TRUE>>=
# Or twomode
nw <- dna_network(conn, networkType = "twomode")
dna_plotNetwork(nw, truncate = 20, label_repel = 0) +
  coord_flip()
@

As you might be aware though, network visualisation poses one problem: the choice of the right layout algorithm can be quite difficult and there is unfortunately not one that would solve the problems for all datasets.
In \code{dna\_plotNetwork}, more than a dozen options from the \texttt{igraph} package are available---not all of which are useful for plotting networks from \dna\ data. Here is the selection of algorithms which we think makes the most sense:

\begin{description}
  \item[nicely] This is probably the most useful option. Not technically a network algorithm, it employs \texttt{igraph} package to pick an appropriate layout. See \code{?igraph::nicely} for the details.
  
  \item[bipartite] This algorithm is only really useful for two-mode networks. It minimises edge-crossings by plotting the nodes for each of the two variables in a separate row. See \code{?igraph::as\_bipartite}.
  
   \item[circle] Arranges the nodes in a circle.
  
  \item[dh] Uses Davidson and Harels simulated annealing algorithm to place nodes. See \code{?igraph::with\_dh}.
  
  \item[drl] Uses the force directed algorithm from the DrL toolbox to place nodes. See \code{?igraph::with\_drl}.
  
  \item[fr] Spreads the nodes based on the force-directed algorithm of Fruchterman and Reingold. See \code{?igraph::with\_fr}
  
  \item[gem] Places nodes on the plane using the GEM force-directed layout algorithm. See \code{?igraph::with\_gem}.
  
  \item[graphopt] Employs the Graphopt algorithm based on alternating attraction and repulsion to place nodes.
  
  \item[kk] Uses the spring-based algorithm by Kamada and Kawai to place nodes. See \code{?igraph::with\_kk}
  
  \item[lgl] Uses the algorithm from Large Graph Layout to place nodes. See \code{?igraph::with\_lgl}.
  
  \item[mds] Performs metric multidimensional scaling for generating the coordinates of the nodes. This algorithm is what comes closest to the "stress minimization" layout from \visone. See \code{?igraph::with\_mds}.
  
  \item[randomly] Places nodes randomly into the plot. See \code{?igraph::randomly}.
  
  \item[star] This places one node in the centre and spreads the rest of the nodes in equal distances around it. See \code{?igraph::as\_star}.
\end{description}

To get a better idea of how this works, look at the code chunk and plot below to see a small selection of the algorithms in action.

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 7, crop = TRUE>>=
# First, let's construct a onemode network
nw <- dna_network(conn,
                  networkType = "onemode",
                  qualifierAggregation = "congruence",
                  duplicates = "document",
                  excludeValues = list("concept" =
                  "There should be legislation to regulate emissions."))

# As with the options for dna_cluster we produce four different plot objects
nw_fr <- dna_plotNetwork(nw, layout = "fr", show_legend = FALSE) +
   ggtitle("Fruchterman Reingold")
nw_graphopt <- dna_plotNetwork(nw, layout = "graphopt", show_legend = FALSE) +
   ggtitle("Graphopt")
nw_mds <- dna_plotNetwork(nw, layout = "mds", show_legend = FALSE) +
   ggtitle("MDS")
nw_randomly <- dna_plotNetwork(nw, layout = "kk") +
   ggtitle("Kamada and Kawai")
 
# The plots are again arranged in a grid
grid <- grid.arrange(nw_fr, nw_graphopt, nw_mds, nw_randomly)

@

There are again several options to style the plot.

Some critics, however, doubt the usefulness of the common network plots---often referred to as hairballs---you see above.
\citet{krzywinski2012hive}, for example, have argued that these network plots ``lack reproducibility and perceptual uniformity because they do not use a node coordinate system''. In the function \code{dna\_plotNetwork} a seed is automatically set to ensure at least reproducibility when you run the same code. You can change the argument \code{seed} to get an idea of how much chance is involved in calculating the layout.

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 8, fig.height = 4, crop = TRUE>>=
nw_1 <- dna_plotNetwork(nw, show_legend = FALSE, seed = 1)
nw_2 <- dna_plotNetwork(nw, show_legend = FALSE, seed = 2)
nw_12345 <- dna_plotNetwork(nw, show_legend = FALSE, seed = 12345)

# The plots are again arranged in a grid
grid <- grid.arrange(nw_1, nw_2, nw_12345, nrow = 1)
@

To get around this problem, \citet{krzywinski2012hive} suggest so-called hive plots. These plots feature radially distributed linear axes on which the nodes are positioned. You can choose yourself, what the axes mean, so it is easier to assess whether and how a certain attribute of a network influences clustering among actors.

In \rdna, you can produce hive plots with the \code{dna\_plotHive} function, which is very similar to \code{dna\_plotNetwork}, except that you additionally provide an argument to produce the axes.

<<eval=TRUE, message = FALSE, warning = FALSE, results = 'tex', fig.width = 4, fig.height = 4, crop = TRUE>>=
dna_plotHive(nw, axis = "type")
@


% TODO
% More styling options
% include interactive network plots from visNetwork and networkD3


\section{Adding and manipulating documents in DNA from R}\label{sec:man-docs}
In \rdna, there are also several functions to manipulate the documents within a \dna\ database from \R: \code{dna\_getDocuments}, \code{dna\_addDocument}, \code{dna\_removeDocument} and \code{dna\_setDocuments}.

\code{dna\_getDocuments} retrieves a dataframe with all documents from a \dna\ connection into the \R\ environment.
This can be useful in multiple scenarios:
If you have lost or never had original source files for the documents in the database, you can export the dataframe from \R\ into a common format, such as a CSV file, which can be opened using, for example, MS Excel, LibreOffice, or Numbers (see \code{?write.csv}).
This can also be useful if you have used \dna\ to add metadata to your documents.

A second scenario in which this could be useful is, if you want to calculate some statistics about your text---such as length of texts in words or characters, readability or lexical diversity---or if you want to support your manual coding by determining, for example, the most important words via keyness or tfidf algorithms (see \citet{kasper2017text} for an overview of working with text in \R).

Another scenario would be if you want to use \R\ instead of \dna\ to recode some of your documents' metadata.
In the example below, we first retrieve the documents from \dna\ then alter two of the columns: "type" and "author".

<<eval=FALSE, results = 'tex'>>=
docs <- dna_getDocuments(conn)
docs
@
<<eval=TRUE, results = 'tex', echo=FALSE>>=
docs <- dna_getDocuments(conn)
colnames(docs) <- trim(colnames(docs))
docs$text <- gsub("\n", "", trim(docs$text, n = 14))
docs$title <- gsub("\n", "", trim(docs$title, n = 14))
kable(docs, format = "latex", booktabs = TRUE, linesep = "") %>%
  kable_styling(latex_options = c("striped", "HOLD_position"), font_size = 5)
@
<<eval=FALSE, results = 'tex'>>=
# Fill empty column with the same word in every row
docs$type <-  "hearing"
# Use regular expression to change the order of name and surname of the authors
docs$author <- sub('^(.*), (.*)', '\\2 \\1', docs$author)
docs
@
<<eval=TRUE, results = 'tex', echo=FALSE>>=
docs$type <-  "hearing"
docs$author <- sub('^(.*), (.*)', '\\2 \\1', docs$author)
docs$text <- gsub("\n", "", trim(docs$text, n = 14))
docs$title <- gsub("\n", "", trim(docs$title, n = 14))
kable(docs, format = "latex", booktabs = TRUE, linesep = "") %>%
  kable_styling(latex_options = c("striped", "HOLD_position"), font_size = 5)
@

So how do we get this new dataframe back into \dna? The answer is the next function we will discuss: \code{dna\_setDocuments}.
This function has just a few arguments but is very capable and can also be dangerous if not used correctly.
We, therefore, recommend that you make a backup copy of your database---which you should do regularly anyway---before you start to alter anything.
After you have done this, you can provide a dataframe with ten columns:

\begin{description}
  \item[ID] must be integer and must contain the document IDs.
  \item[title] must contain the document titles as character objects.
  \item[text] must contain the document texts as character objects.
  \item[coder] must contain the coder IDs as integer values. % (see dna_getCoders).
  \item[author] must contain the document authors as character objects.
  \item[source] must contain the document sources as character objects.
  \item[section] must contain the document sections as character objects.
  \item[notes] must contain the document notes as character objects.
  \item[type] must contain the document types as character objects.
  \item[date] must contain the document dates as POSIXct objects or as numeric objects indicating milliseconds since 1970-01-01.
\end{description}

Since we retrieved the objects docs from the database, it already meets all these conditions.
In other cases, however, you will have to rearrange the data a bit until you can push it to \dna.
Before you actually change anything, it is recommended to do a test run, or in other words \code{simulate} what would be done to the database:

<<eval=FALSE, results = 'tex'>>=
dna_setDocuments(conn, documents = docs, removeStatements = FALSE,
                 simulate = TRUE)
@

Note, that this is the default and that you need to set \code{simulate = FALSE} before anything really happens.
You should also be cautious with the option \code{removeStatements} since when set to \code{TRUE}, it will destroy part of your work if a document in your new dataframe has the same ID as a document already present in the database.

\subsection{Adding newspaper articles using LexisNexisTools}
One common way of retrieving texts for analysis in \dna\ is by downloading articles from the commercial newspaper archive LexisNexis.
Many university libraries have access to this database which maintains a collection of newspaper articles from many major outlets across Europe and North America.
Its powerful search engine also allows for a finely grained search string to limit the number of articles for a specific topic.

To convert the raw files from LexisNexis though, we need another \R\ package: \texttt{LexisNexisTools} \citep{gruber2018lexis}.
You can install this package via \code{devtools::install\_github("JBGruber/LexisNexisTools")}.
The workflow looks as follows:

<<eval=TRUE, warning = FALSE, message=FALSE, results = 'hide'>>=
library("LexisNexisTools")
# This places a sample TXT file in your working directory in
lnt_sample()

# Look for TXT files
my_files <- list.files(pattern = "TXT", ignore.case = TRUE)

# If this contains other files not from LexisNexis you can subset the vector
my_files <- my_files[grepl("^sample.TXT$", my_files)]

# Now the files can be converted into an LNToutput object
LNToutput <- lnt_read(my_files)

# And this object can be converted to work in rDNA
docs <- lnt_convert(LNToutput, to = "rDNA")
@

This object is almost ready to add to the \dna\ database.
We only need to adjust the ID column, so that IDs are not duplicated and set a valid coder ID (i\,.e., one that is already present in the database).

<<eval=TRUE, warning = FALSE, results = 'tex'>>=
docs_orig <- dna_getDocuments(conn)

# We create new unique IDs by IDs with the highest ID from the original data 
docs$id <- docs$id + max(docs_orig$id)

# We use coder 1 to import the new documents
unique(docs_orig$coder)
docs$coder <- 2

# Now we combine the original with the new documents and push it to DNA
docs <- rbind(docs_orig, docs)
dna_setDocuments(conn, documents = docs, simulate = FALSE)
@

Now the documents from LexisNexis are in the \dna\ database:

<<eval=FALSE, results = 'tex'>>=
docs <- dna_getDocuments(conn)
docs
@
<<eval=TRUE, results = 'tex', echo=FALSE>>=
docs <- dna_getDocuments(conn)
colnames(docs) <- trim(colnames(docs))
docs$text <- gsub("\n", "", trim(docs$text, n = 14))
docs$title <- gsub("\n", "", trim(docs$title, n = 14))
docs$author <- gsub("\n", "", trim(docs$author, n = 10))
docs$source <- gsub("\n", "", trim(docs$source, n = 6))
docs$section <- gsub("\n", "", trim(docs$section, n = 6))
docs$type <- gsub("\n", "", trim(docs$type, n = 6))
docs$date <- gsub("\n", "", trim(as.character(docs$date), n = 10, e = ""))
kable(docs, format = "latex", booktabs = TRUE, linesep = "") %>%
  kable_styling(latex_options = c("striped", "HOLD_position"), font_size = 5)
@


\subsection{Adding other documents}

Besides LexisNexis, there is a myriad of sources from where you can get documents for your discourse network analysis.
These can come in a number of different shapes and (file) formats.
The task is then, how to get the documents into \dna\ with as little manual work as possible.
Luckily, many people who use \R\ have had similar problems and developed useful packages to make it as easy as possible to read in (text) data from, for example, MS Word, Excel and PDF files or directly from web pages via web scraping.
There are many tutorials online on how to get your specific format in which you retrieved raw text data into \R\ which is why we will not cover this part.
However, before you can hand the documents over to \dna, you have to bring them into the correct format first.

We use the dataset ``Irish budget speeches from 2010'' from \citet{lowe2013validate} which is integrated in the \texttt{quanteda} package \citep{benoit2018quanteda} to exemplify the process of how this is done:\footnote{Install \texttt{quanteda} first with \code{install.packages("quanteda")} if you do not have it on your system already.}

<<eval=TRUE, results = 'tex', message=FALSE>>=
library("quanteda")
# First load the data
corpus <- data_corpus_irishbudget2010

# To get to the text data we have to subset the corpus object which transforms
# it into a data.frame
df <- corpus$documents
@
<<eval=TRUE, results = 'tex', echo=FALSE>>=
df2 <- df
row.names(df2) <- trim(row.names(df2), n = 14)
df2$texts <- trim(df2$texts, n = 14)
kable(df2, format = "latex", booktabs = TRUE, linesep = "") %>%
  kable_styling(latex_options = c("striped", "HOLD_position"), font_size = 7)
@

Now the easiest way to arrive at the correct format is to create a new data.frame and fill it with the columns from the budget data.
Note, that not all of the columns in \dna\ makes sense for all data sources.
Simply leave a column empty if you think it does not make sense in your case, as is done for ``section'' in the following:

<<eval=TRUE, results = 'tex'>>=
docs_new <- data.frame(id = df$number, 
                       title = row.names(df), 
                       text = df$texts, 
                       coder = 1, 
                       author = paste(df$foren, df$name), 
                       source = "Budget Statement 2010", 
                       section = "", 
                       notes = df$party, 
                       type = df$debate, 
                       date = df$year,
                       stringsAsFactors = FALSE)
@
<<eval=TRUE, results = 'tex', echo=FALSE>>=
docs_new2 <- docs_new
docs_new2$title <- trim(docs_new2$title, n = 14)
docs_new2$text <- trim(docs_new2$text, n = 14)
docs_new2$source <- trim(docs_new2$source, n = 10)
kable(docs_new2, format = "latex", booktabs = TRUE, linesep = "") %>%
  kable_styling(latex_options = c("striped", "HOLD_position"), font_size = 5)
@

Before you can proceed, you should check if the columns already have the right classes to hand the data.frame over to \dna:
<<eval=TRUE, results = 'tex'>>=
lapply(docs_new, class)
@

Compare this output with the requirements for the data.frame (see \code{?dna\_setDocuments}) and you can see that only the ID column does not have the correct class.
In this case, it is easy to correct that:

<<eval=TRUE, results = 'tex'>>=
docs$id <- as.integer(docs$id)
@

The remaining steps are the same as before:
\begin{enumerate}
  \item Retrieve the documents already in \dna\ with \code{dna\_getDocuments} (can be skipped if no documents are present).
  \item Make sure the new documents have unique IDs.
  \item Use \code{rbind} to append the old data.frame with the new documents.
  \item Use \code{dna\_setDocuments} to push the data.frame into the \dna\ database.
\end{enumerate}

<<eval=TRUE, results = 'tex'>>=
# 1.
docs_orig <- dna_getDocuments(conn)

# 2.
docs$id <- docs$id + max(docs_orig$id)

# 3.
docs <- rbind(docs_orig, docs)

# 4.
dna_setDocuments(conn, documents = docs, simulate = FALSE)
@

This should be all it takes to get documents from pretty much any source into \dna.
\bibliographystyle{apalike}
\bibliography{dna-manual}

\end{document}