/
CO18.tex
833 lines (700 loc) · 52.4 KB
/
CO18.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
%\documentclass[12pt]{article}
\documentclass[runningheads]{llncs}
%\bibliographystyle{plainnat}
\bibliographystyle{splncs04}
\usepackage{graphicx}
\usepackage{amssymb,amsmath,color}
\usepackage[breaklinks=true]{hyperref}
\renewcommand\UrlFont{\color{blue}\rmfamily}
\usepackage[anythingbreaks]{breakurl}
%\usepackage[round]{natbib}
%\usepackage{amsthm}
%\usepackage{algorithm}
%\usepackage{algorithmicx}
%\usepackage[noend]{algpseudocode}
\newtheorem{thm}{Theorem}
%\newtheorem{definition}{Definition}
\newcommand{\beq}{\begin{equation}}
\newcommand{\eeq}{\end{equation}}
\newcommand{\ben}{\begin{enumerate}}
\newcommand{\een}{\end{enumerate}}
\newcommand{\mc}[1]{\ensuremath{\mathcal{#1}}}
\newcommand{\mb}[1]{\ensuremath{\mathbb{#1}}}
\newcommand{\ul}[1]{\ensuremath{\underline{#1}}}
\newcommand{\RM}{\emph{RM}}
\newcommand{\A}{\emph{A}}
\newcommand{\B}{\emph{B}}
\newcommand{\C}{\emph{C}}
\newcommand{\EE}{\bb{E}}
\newcommand{\PP}{\bb{P}}
\newcommand{\bpi}{\bar{\pi}}
\newcommand{\bp}{\bar{p}}
% \newcommand{\comment}[1]{}
\newcommand{\comment}[1]{\textcolor{red}{\sc #1}}
\title{Risk-Limiting Audits by Stratified Union-Intersection Tests of Elections (SUITE)}
\author{
Kellie Ottoboni\inst{1}\orcidID{0000-0002-9107-3402} \and
Philip B.~Stark\inst{1}\orcidID{0000-0002-3771-9604} \and
Mark Lindeman\inst{2}\orcidID{0000-0001-8815-815X} \and
Neal McBurnett\orcidID{0000-0001-8667-1830}
}
\authorrunning{K.~Ottoboni et al.}
\titlerunning{SUITE Risk-Limiting Audits}
\institute{
Department of Statistics, University of California, Berkeley, CA, USA \and
Verified Voting Foundation, Philadelphia, PA, USA}
\date{Version: \today}
\begin{document}
\maketitle
\begin{abstract}
Risk-limiting audits (RLAs) offer a statistical guarantee: if a full manual tally of the paper ballots would show that the reported election outcome is wrong, an RLA has a known minimum chance of leading to a full manual tally.
RLAs generally rely on random samples.
Stratified sampling---partitioning the population of ballots into disjoint
strata and sampling independently from the strata---may simplify logistics or increase efficiency compared
to simpler sampling designs, but makes risk calculations harder.
We present SUITE, a new method for conducting RLAs using stratified samples.
SUITE considers all possible partitions of outcome-changing error across strata.
For each partition, it combines $P$-values from stratum-level tests into a combined
$P$-value; there is no restriction on the tests used in different strata.
SUITE maximizes the combined $P$-value over all partitions of outcome-changing error.
The audit can stop if that maximum is less than the risk limit.
Voting systems in some Colorado counties (comprising 98.2\% of voters)
allow auditors to check how the system interpreted each ballot,
which allows \emph{ballot-level comparison} RLAs.
Other counties use \emph{ballot polling}, which is less efficient.
Extant approaches to conducting an RLA of a statewide contest would require major changes to
Colorado's procedures and software, or would sacrifice the efficiency of ballot-level comparison.
SUITE does not.
It divides ballots into two strata: those cast in counties that can conduct ballot-level comparisons, and the rest.
Stratum-level $P$-values are found by methods derived here.
The resulting audit is substantially more efficient than statewide ballot polling.
SUITE is useful in any state with a mix of voting systems or that uses stratified sampling
for other reasons.
We provide an open-source reference implementation and exemplar calculations in Jupyter notebooks.
\keywords{stratified sampling, nonparametric tests, Fisher's combining function, sequential hypothesis tests, Colorado risk-limiting audits, maximizing $P$-values over nuisance parameters, union-intersection test,
intersection-union test}
\end{abstract}
\noindent
\textbf{Acknowledgements.}
We are grateful to Ronald L.~Rivest, Steven N.~Evans, and Wentao Zhan for helpful conversations and suggestions.
\section{Introduction}
A risk-limiting audit (RLA) of an election contest is a procedure that
has a known minimum chance of leading to a full manual tally of the ballots if the electoral outcome according to that tally would differ from the reported outcome.
\emph{Outcome} means the winner(s) (or, for instance, whether there is a runoff)---not the numerical vote totals.
RLAs require a durable, voter-verifiable record of voter intent, such as paper ballots,
and they assume that this audit trail is sufficiently complete and accurate that a full hand
tally would show the true electoral outcome.
That assumption is not automatically satisfied: a \emph{compliance audit}
\cite{starkWagner12}
is required to check whether the paper trail is trustworthy.
Current methods for risk-limiting audits are generally \emph{sequential hypothesis testing procedures}: they examine more ballots, or batches of ballots,
until either (i)~there is strong statistical evidence that a full hand tabulation would confirm the outcome,
or (ii)~the audit has led to a full hand tabulation, the result of which should become the official
result.
RLAs have been conducted in California, Colorado, Indiana, Ohio, Virginia, and Denmark, and are required by law in Colorado (CRS~1-7-515) and Rhode Island (SB~413A and HB~5704A).
The most efficient and transparent sampling design for risk-limiting audits selects individual ballots uniformly at random, with or without replacement~\cite{stark10c}.
Risk calculations for such samples can be made simple without sacrificing rigor~\cite{stark10d,lindemanStark12}.
However, to audit contests that cross jurisdictional boundaries then requires coordinating sampling in different counties, and may require different counties to use the lowest common denominator method for assessing risk from the sample, which would not take full advantage of the capabilities of some voting systems.
For instance, any system that uses paper ballots as the official record can conduct \emph{ballot-polling} audits, while \emph{ballot-level comparison audits} require systems to generate \emph{cast vote records} that can be checked manually against a human reading of the paper~\cite{lindemanEtal12,lindemanStark12}.
(These terms are described in Section~\ref{sec:combiningMethods}.)
Stratified RLAs have been considered previously, primarily to conform with legacy audit laws under which counties draw audit samples independently of each other, but also to allow auditors to start the audit before all vote-by-mail or provisional ballots have been tallied, by sampling independently from ballots cast in person, by mail, and provisionally, as soon as subtotals for each group are available \cite{stark08a,higginsEtal11}.
However, extant methods address only a single approach to auditing, batch-level comparisons, and only a particular test statistic.
Here, we introduce SUITE, a more general approach to conducting RLAs using stratified samples.
SUITE is a twist on \emph{intersection-union} tests \cite{pesarinSalmaso10}, which represent the null hypothesis
as the intersection of a number of simpler hypotheses, and the alternative hypothesis as a union of their alternatives.
In contrast, here, the null is the union of simpler hypotheses, and the alternative is the intersection of their alternatives.
The approach involves finding the maximum $P$-value over a vector of nuisance parameters that describe the simple hypotheses,
all allocations of tabulation error across strata for which a full count would find a different electoral outcome than was reported.
(A \emph{nuisance parameter} is a property of the population that is not of direct interest, but that affects the probability distribution of the data.
\emph{Overstatement} is error that made the margin of one or more winners over one or more losers appear larger than it really was.
The total overstatement across strata determines whether the reported outcome is correct;
the overstatements in individual strata are nuisance parameters that affect the distribution of the audit sample.)
The basic building block for the method is testing whether the overstatement error in a single stratum is greater than or equal to a quota.
Fisher's combining function is used to merge $P$-values for tests in different strata into a single $P$-value for the hypothesis that the overstatement in every stratum is greater than or equal to its quota.
If that hypothesis can be rejected for \emph{all} stratum-level quotas that could change the outcome---that is, if the maximum combined $P$-value is sufficiently small---the audit can stop.
It is not actually necessary to consider all possible quotas: the $P$-value
has a bounded modulus of continuity, which allows us to find upper and lower bounds
everywhere using only values on a discrete grid.
We present a numerical procedure, implemented in Python,
to find bounds on the maximum $P$-value when there are two strata.
The procedure can be generalized to more than two strata.
Section~\ref{sec:stratified} presents the new approach to stratified auditing.
Section~\ref{sec:combiningMethods} illustrates the method by solving a problem pertinent to Colorado:
combining ballot polling in one stratum with ballot-level comparisons in another.
This requires straightforward modifications to the mathematics behind ballot polling and ballot-level comparison to allow the overstatement to be compared to specified thresholds other than the overall contest margin; those modifications are described in Sections~\ref{sec:comparisonError} and~\ref{sec:ballotPollError}.
Section~\ref{sec:examples} gives numerical examples of simulated audits, using parameters intended to reflect how the procedure would work in Colorado.
We provide example software implementing the risk calculations for
our recommended approach in Python Jupyter notebooks.\footnote{%
See \url{https://github.com/pbstark/CORLA18}.
}
Section~\ref{sec:discussion} gives recommendations and
considerations for implementation.
\section{Stratified audits} \label{sec:stratified}
\emph{Stratified sampling} involves partitioning a population
into non-overlapping groups and drawing independent random samples from those groups.
\cite{stark08a,higginsEtal11} developed RLAs based on comparing stratified samples of batches of ballots to hand counts of the votes in those batches: batch-level comparison RLAs, using a particular test statistic.
The method we develop here is more general and more flexible: it can be used with any test statistic, and test statistics in different strata need not be the same---which is key to combining audits of ballots cast using diverse voting technologies.
Here and below, we consider auditing a single plurality contest at a time, although the same sample can be used to audit more than one contest (and super-majority contests), and there are ways of combining audits of different contests into a single process \cite{stark09c,stark10d}.
We use terminology drawn from a number of papers, notably \cite{lindemanStark12}.
An \emph{overstatement error} is an error that caused the margin between \emph{any} reported
winner and \emph{any} reported loser to appear larger than it really was.
An \emph{understatement error} is an error that caused the margin between \emph{every} reported
winner and \emph{every} reported loser to appear to be smaller than it really was.
Overstatements cast doubt on outcomes; understatements do not, even though they are tabulation errors.
We use $w$ to denote a reported winner and $\ell$ to denote a reported loser.
The total number of reported votes for candidate~$w$ is $V_{w}$
and the total for candidate~$\ell$ is $V_{\ell}$.
Thus $V_{w} > V_{\ell}$, since $w$ is reported to have gotten more votes than $\ell$.
Let $V_{w\ell} \equiv V_{w} - V_{\ell} > 0$ denote the contest-wide margin (in votes) of
$w$ over $\ell$.
We have $S$ strata.
Let $V_{w\ell,s}$ denote the margin (in votes) of reported winner $w$ over reported loser $\ell$
in stratum $s$.
Note that $V_{w\ell,s}$ might be negative in one stratum, but $\sum_{s=1}^S V_{w\ell,s} = V_{w\ell} > 0$.
Let $A_{w\ell}$ denote the margin (in votes) of reported winner $w$ over reported loser $\ell$ that a full hand count would show: the \emph{actual} margin, in contrast to the \emph{reported} margin $V_{w\ell}$.
Reported winner $w$ really beat reported loser $\ell$ if and only if $A_{w\ell} > 0$.
Define $A_{w\ell,s}$ to be the actual margin (in votes) of $w$ over $\ell$ in stratum $s$.
Let $\omega_{w\ell,s} \equiv V_{w\ell,s} - A_{w\ell,s}$ be the \emph{overstatement}
of the margin of $w$ over $\ell$ in stratum $s$.
Reported winner $w$ really beat reported loser
$\ell$ if and only if $\omega_{w\ell} \equiv \sum_s \omega_{w\ell,s} < V_{w\ell}$.
An RLA is a test of the hypothesis that the outcome is wrong, that is, that $w$ did not really beat $\ell$:
$\sum_s \omega_{w\ell, s} \ge V_{w\ell}$.
The null is true if and only if there exists \textit{some} $S$-tuple of real numbers $(\lambda_s)_{s=1}^S$ with $\sum_s \lambda_s = 1$ such that $\omega_{w\ell, s} \ge \lambda_s V_{w\ell}$ for all $s$.\footnote{%
``If'' is straightforward. For ``only if,'' suppose $\omega_{w\ell} \ge V_{w\ell}$. Set $\lambda_s = \frac{\omega_{w\ell, s}}{\sum_t \omega_{w\ell, t}}$. Then $\sum_s \lambda_s = 1$, and $\omega_{w\ell, s} = \lambda_s \omega_{w\ell} \ge \lambda_s V_{w\ell}$ for all $s$.
}
Thus if we can reject the conjunction hypothesis $\cap_s \{ \omega_{w\ell,s} \ge \lambda_s V_{w\ell} \}$ at significance level $\alpha$ for all $(\lambda_s)$ such that $\sum_s \lambda_s = 1$, we can stop the audit, and the risk limit will be $\alpha$.
\subsection{Fisher's combination method}
Fix $\mathbf{\lambda} \equiv (\lambda_s)_{s=1}^S$, with $\sum_s \lambda_s = 1$.
To test the conjunction hypothesis that stratum null hypotheses are true, that is, that $\omega_{w\ell,s} \ge \lambda_s V_{w\ell}$ for all $s$, we use Fisher's combining function.
Let $p_s(\lambda_s)$ be the $P$-value of the hypothesis $\omega_{w\ell,s} \ge \lambda_s V_{w\ell}$.
If the null hypothesis is true, then
\beq \label{eq:fisher}
\chi(\mathbf{\lambda}) = -2 \sum_{s=1}^S \ln p_s(\lambda_s)
\eeq
has a probability distribution that is dominated by the chi-square distribution with $2S$~degrees
of freedom.\footnote{%
If the stratum-level tests had continuously distributed $P$-values, the distribution would be exactly
chi-square with $2S$~degrees of freedom, but if any of the $P$-values has atoms when
the null hypothesis is true, it is in general stochastically smaller.
This follows from a coupling argument along the lines of Theorem~4.12.3 in \cite{grimmett01}.
}
Fisher's combined statistic will tend to be small when all stratum-level null hypotheses are true.
If any is false, then as the sample size increases, Fisher's combined statistic will tend to grow.
If, for all $\mathbf{\lambda}$ with $\sum_s \lambda_s = 1$, we can reject the conjunction
hypothesis at level $\alpha$
(i.e., if the minimum value of Fisher's combined statistic over all $\lambda$ is larger than the $1-\alpha$ quantile of the chi-square
distribution with $2S$~degrees of freedom), the audit can stop.
If the audit is allowed to ``escalate'' in steps, increasing the sample size sequentially, then either the tests used in the separate strata have to be sequential tests, or multiplicity needs to be taken into account, for instance by adjusting the risk limit at each step.
Otherwise, the overall procedure can have a risk limit that is much larger than $\alpha$.
For examples of controlling for multiplicity when using non-sequential testing procedures in an RLA, see \cite{stark08a,stark09a}.
The stratum-level $P$-value $p_s(\lambda)$ could be a $P$-value for the hypothesis
$\omega_{w\ell,s} \ge \lambda_s V_{w\ell}$ from any test procedure.
We assume, however, that $p_s$ is based on a one-sided test, and that the tests
for different values of $\lambda$ ``nest'' in the sense that if $a > b$,
then $p_s(a) > p_s(b)$.
This monotonicity is a reasonable requirement because the evidence that the overstatement
is greater than $a$ should be weaker than the evidence that the overstatement is greater than
$b$, if $a > b$.
In particular, this monotonicity holds for the tests proposed in Sections~\ref{sec:comparisonError}
and \ref{sec:ballotPollError}.
One could use a function other than Fisher's to combine the stratum-level $P$-values into a $P$-value for the conjunction hypothesis, provided it satisfies these properties (see \cite{pesarinSalmaso10}):
\begin{itemize}
\item the function is non-increasing in each argument and symmetric with respect to rearrangements of the arguments
\item the combining function attains its supremum when one of the arguments approaches zero
\item for every level $\alpha$, the critical value of the combining function is finite and strictly smaller than the function's supremum.
\end{itemize}
For instance, one could use Liptak's function, $T = \sum_i \Phi^{-1}(1-p_i)$,
or Tippett's function, $T = \max_i (1-p_i)$.
Fisher's function is convenient for this application because the tests in different strata are independent, so the chi-squared distribution dominates the distribution of $\chi(\cdot)$ when the null hypothesis is true.
If tests in different strata were correlated, the null distribution of the combination function would need to be calibrated by simulation; some other combining function might have better properties than Fisher's \cite{pesarinSalmaso10}.
\subsection{Maximizing Fisher's combined $P$-value for $S=2$}
We now specialize to $S=2$ strata.
The set of $\mathbf{\lambda} = (\lambda_1, \lambda_2)$ such that $\sum_s \lambda_s = 1$ is then a one-dimensional family: if $\lambda_1 = \lambda$, then $\lambda_2 = 1-\lambda$.
For a given set of data, finding the maximum $P$-value over all $\mathbf{\lambda}$
is thus a one-dimensional optimization problem.
We provide two software solutions to the problem.
The first approach approximates the maximum via a grid search, refining the
grid once the maximum has been bracketed.
This is not guaranteed to find the global maximum exactly, although it can approximate
the maximum as closely as one desires by refining the mesh, since the objective function is continuous.
Appendix~\ref{sec:appendix-max-fisher} derives a bound on the approximation error for the special
case when $p_1$ and $p_2$ are $P$-values from a comparison audit and a ballot-polling audit, respectively.
A second, more rigorous approach uses bounds on Fisher's combining function $\chi$ for all $\lambda$,
which are valid when the $P$-values are monotonic functions of $\lambda$.
(A lower bound on $\chi$ implies an upper bound on the $P$-value: if, for all $\lambda$, the lower bound is
larger than the $1-\alpha$ quantile of the chi-squared distribution with 4~degrees of freedom, the maximum $P$-value is no larger than $\alpha$.)
Some values of $\lambda$ can be ruled out \emph{a priori}, because (for instance) $\omega_{w\ell,s} \le V_{w\ell,s}+N_s$, where $N_s$ is the number of ballots cast in stratum $s$, and thus
\beq
1 - \frac{V_{w\ell,2}+N_2}{V_{w\ell}} \le \lambda \le \frac{V_{w\ell,1}+N_1}{V_{w\ell}}.
\eeq
Let $\lambda_-$ and $\lambda_+$ be lower and upper bounds on $\lambda$.
\comment{Replace the paragraph with an argument from the local modulus of continuity}
When the $p_s(\cdot)$ are monotonically increasing functions of $\lambda$, $p_1(\lambda)$ increases monotonically and $p_2(1-\lambda)$ decreases monotonically.
Suppose $[a, b) \subset [\lambda_-, \lambda_+]$.
Then for all $\lambda \in [a, b)$, $-2\ln p_1(\lambda) \ge -2\ln p_1(b)$ and
$-2\ln p_2(1-\lambda) \ge -2\ln p_2(1-a)$.
Thus
\beq\label{eq:lowerbound}
\chi(\lambda) = -2(\ln p_1(\lambda)+ \ln p_2(1-\lambda))
\ge -2(\ln p_1(b) + \ln p_2(1-a)) \equiv \chi_-[a,b).
\eeq
This gives a lower bound for $\chi$ on the interval $[a, b)$; the corresponding
upper bound is $\chi(\lambda) \le -2(\ln p_1(a) + \ln p_2(1-b)) \equiv \chi_+[a,b)$.
Partitioning $[\lambda_-, \lambda_+]$ into a collection of intervals $[a_k, a_{k+1})$
and finding $\chi_-[a_k, a_{k+1})$ and $\chi_+[a_k, a_{k+1})$ for each
yields piecewise-constant lower and upper bounds for $\chi(\lambda)$.
If, for all $\lambda \in [\lambda_-, \lambda_+]$, the lower bound on $\chi$
is larger than the $1-\alpha$ quantile of the chi-square distribution with 4~degrees of freedom,
the audit can stop.
On the other hand, if for some $\lambda \in [\lambda_-, \lambda_+]$,
the upper bound is less than the $1-\alpha$ quantile of the chi-square distribution with 4~degrees of freedom,
or if $\chi(a_k)$ is less than this quantile at any grid point $\{a_k\}$,
the sample size in one or both strata needs to increase.
If the lower bound is less than the $1-\alpha$ quantile on some interval,
but $\chi(a_k)$ is above this quantile at every grid point $\{a_k\}$, then one should improve the lower bound by refining the grid and/or by increasing the sample size in one or both strata.
\section{Auditing cross-jurisdictional contests}\label{sec:combiningMethods}
As mentioned above, stratified sampling can simplify audit logistics by allowing jurisdictions
to sample ballots independently of each other, or by allowing a single jurisdiction to sample
independently from different collections of ballots (e.g., vote-by-mail versus cast in person).
SUITE allows stratified samples to be combined into an RLA of contests that include ballots from more
than one stratum.
We present an example where SUITE is helpful for a different reason:
it enables an RLA to take
advantage of differences among voting systems to reduce audit sample sizes, which solves
a current problem in Colorado.
CRS~1-7-515 requires Colorado to conduct risk-limiting audits beginning in 2017.
%(There are provisions to allow the Secretary of State to exempt some counties.)
The first risk-limiting election audits under this statute were conducted in November, 2017; the second were conducted in July, 2018.\footnote{%
See \url{https://www.sos.state.co.us/pubs/elections/RLA/2017RLABackground.html}
}
Counties cannot audit contests that cross jurisdictional boundaries (\emph{cross-jurisdictional} contests, such as gubernatorial contests and most federal contests)
on their own: margins and risk limits apply to entire contests, not to the portion of a
contest included in a county.
Colorado has not yet conducted an RLA of a cross-jurisdictional contest, although it has performed RLA-like procedures on individual jurisdictions' portions of some cross-jurisdictional contests.
To audit statewide elections and contests that cross county lines, Colorado will need to implement new approaches and make some changes to its auditing software, RLATool.
Colorado's voting systems are heterogeneous.
Some counties (containing about 98\% of active voters, as of this writing) have
voting systems that export cast vote records (CVRs) in a way that the paper ballot corresponding to each CVR can be identified uniquely and retrieved.
We call counties with such voting systems \emph{CVR counties}.
In CVR counties, auditors can manually check the accuracy of the voting system's interpretation of individual ballots.
In other counties (\emph{legacy} or \emph{no-CVR} counties) there is no way to check the accuracy
of the system's interpretation of voter intent for individual ballots.
Contests entirely contained in CVR counties can be audited using \emph{ballot-level comparison audits} \cite{lindemanStark12}, which compare CVRs to the auditors' interpretation of voter intent directly from paper ballots.
Ballot-level comparison audits are currently the most efficient approach to risk-limiting audits in that they require examining fewer ballots than other methods do, when the outcome of the contest under audit is in fact correct.
Contests involving no-CVR counties can be audited using \emph{ballot-polling audits} \cite{lindemanEtal12,lindemanStark12}, which generally require examining more ballots than ballot-level comparison audits to attain the same risk limit.
Colorado's challenge is to audit contests that include ballots cast in both CVR counties and no-CVR counties. There is no literature on how to combine ballot polling with ballot-level comparisons to audit
cross-jurisdictional contests
that include voters in CVR counties and voters in no-CVR counties.\footnote{%
See~\cite{Rivest-2018-bayesian-tabulation-audits}
for a different (Bayesian) approach to auditing contests that include both CVR counties
and no-CVR counties. In general, Bayesian audits are not risk-limiting.
}
Colorado could simply revert to ballot-polling audits for cross-jurisdictional contests that include votes in no-CVR counties, but that would entail a loss of efficiency.
Alternatively, Colorado could use batch-level comparison audits, with single-ballot batches in CVR counties and larger batches in no-CVR counties.\footnote{%
Since so few ballots are cast in no-CVR counties, cruder approaches might work, for instance, pretending that no-CVR counties had CVRs, but treating any ballot sampled from a no-CVR county as if it had a 2-vote overstatement error. See \cite{banuelosStark12}.
}
The statistical theory for such audits has been worked out (see, e.g., \cite{stark08a,stark09c,stark09b,stark10d} and Section~\ref{sec:appendix-comparison}, below); indeed, this is the method that was used in several of California's pilot audits, including the audit in Orange County, California.
However, batch-level comparison audits were found to be less efficient than ballot-polling audits in these pilots \cite{CA_SOS_EAC}.
Moreover, to use batch-level comparison audits in Colorado would require major changes to RLATool, for reporting batch-level contest results prior to the audit, for drawing the sample, for reporting audit findings, and for determining when the audit can stop.
The changes would include modifying data structures, data uploads, random sampling procedures, and the county user interface.
No-CVR counties would also have to revise their audit procedures.
Among other things, they would need to report vote subtotals
for physically identifiable groups of ballots before the audit starts.
No-CVR counties with voting systems that can only report subtotals by precinct
might have to make major changes to how they handle ballots, for instance, sorting all ballots by precinct.
These are large changes.
We show here that SUITE makes possible a ``hybrid'' RLA that keeps the advantages of ballot-level comparison audits in CVR counties but does not require major changes to how no-CVR counties audit, nor major changes to RLATool.
The key is to use stratified sampling with two strata: ballots cast in CVR counties and those cast in no-CVR counties.
In order to use Equation~\ref{eq:fisher}, we must develop stratum-level tests for the overstatement error that are appropriate for the corresponding voting system.
Sections~\ref{sec:comparisonError} and~\ref{sec:ballotPollError} describe these tests for overstatement in the CVR and no-CVR strata, respectively.
\subsection{Comparison audits of overstatement quotas}
\label{sec:comparisonError}
To use comparison auditing in the approach to stratification described above requires extending previous work to test whether the overstatement error is greater than or equal to $\lambda_s V_{w\ell}$, rather than simply $V_{w\ell}$.
Appendix~\ref{sec:appendix-comparison} derives this generalization for arbitrary batch sizes, including batches consisting of one ballot.
The derivation considers only a single contest, but the
MACRO test statistic \cite{stark09c,stark10d} automatically extends the result to
auditing any number of contests simultaneously.
The derivation is for plurality contests, including ``vote-for-$k$'' plurality contests.
Majority and super-majority contests are a minor
modification~\cite{stark08a}.\footnote{%
So are some forms of preferential and approval voting, such as Borda count, and
proportional representation contests, such as D'Hondt~\cite{starkTeague14}.
For a derivation of ballot-level comparison risk-limiting audits for super-majority contests,
see \url{https://github.com/pbstark/S157F17/blob/master/audit.ipynb}. (Last visited 14~May 2018.)
Changes for IRV/STV are more complicated.
}
\subsection{Ballot-polling audits of overstatement quotas}
\label{sec:ballotPollError}
To use the new stratification method with ballot polling requires a different approach than \cite{lindemanEtal12} took: their approach tests whether $w$ got a larger \emph{share} of the votes than $\ell$, but we need to test whether the margin \emph{in votes} in the stratum is greater than or equal to a threshold (namely, $\lambda_s V_{w\ell}$).
This introduces a nuisance parameter, the number of ballots with votes for either $w$ or $\ell$.
We address this by maximizing the probability ratio in Wald's Sequential Probability Ratio Test
\cite{wald45} over all possible values of the nuisance parameter.
Appendix~\ref{sec:appendix-polling} develops the test.
\section{Numerical examples}\label{sec:examples}
Jupyter notebooks containing calculations for hybrid stratified audits intended to be relevant for Colorado are available at \url{https://www.github.com/pbstark/CORLA18}.
\texttt{hybrid-audit-example-1} contains two hypothetical elections.
The first has $110,000$ cast ballots, of which
9.1\% were in no-CVR counties.
The \emph{diluted margin} (the margin in votes, divided by the total number of ballots cast) is $1.8\%$.
In 94\% of 10,000 simulations in which
the reported results were correct, drawing 700~ballots from the CVR stratum and 500~ballots
from the no-CVR stratum (1,200 ballots in all)
allowed SUITE to confirm the outcome at 10\% risk.
For the remaining 6\%, further expansion of the audits would have been necessary.
If it were possible to conduct a ballot-level comparison audit for the entire contest,
an RLA with risk limit 10\% could terminate after examining 263~ballots if it found no errors.
A ballot-polling audit of the entire contest would have been expected to examine about 14,000 ballots, more than 10\% of ballots cast.
The hybrid audit is less efficient than a ballot-level comparison audit, but far more efficient than a ballot-polling audit.
The second hypothetical election has 2~million cast ballots, of which 5\% were cast in no-CVR counties.
The diluted margin is about $20\%$.
The workload for SUITE at 5\% risk is quite low:
In 93\% of 10,000 simulations in which
the reported results were correct, auditing 50~ballots from the
CVR stratum and 25~ballots from the no-CVR stratum
would have confirmed the outcome.
If it were possible to conduct a ballot-level comparison audit for the entire contest,
an RLA at risk limit 5\% could terminate after examining 31~ballots if it found no errors.
The additional work for the hybrid stratified audit is disproportionately in the no-CVR counties.
A second notebook, \texttt{hybrid-audit-example-2}, illustrates the
workflow for SUITE for an election with 2~million ballots cast.
The reported margin is just over $1\%$, but the reported winner
and reported loser are actually tied in both strata.
The risk limit is 5\%.
For a sample of 7600~ballots from the CVR stratum and 400~ballots from the no-CVR stratum,
the maximum combined $P$-value is 1, so the audit cannot stop there.
A third notebook, \texttt{fisher\_combined\_pvalue}, illustrates the numerical methods used
to check whether the maximum combined $P$-value is below the risk limit.
It includes code for the tests in the two strata,
for the lower and upper bounds $\lambda_-$ and $\lambda_+$ for $\lambda$,
for evaluating Fisher's combining function on a grid,
and for maximizing the $P$-value, using the local modulus of continuity
for Equation~\ref{eq:fisher-hybrid} to bound any approximation error.
\section{Discussion} \label{sec:discussion}
We present SUITE, a new class of procedures for RLAs based on stratified random sampling.
SUITE is agnostic about the capability of voting equipment in different strata, unlike
previous methods, which require batch-level comparisons in every stratum.
SUITE allows arbitrary tests to be used in different strata; if those tests are sequentially valid, then the overall RLA is sequential.
(Otherwise, multiplicity adjustments might be needed if one wants an audit that escalates in stages.
See~\cite{stark08a,stark09a} for two approaches.)
Like other RLA methods, SUITE poses auditing as a hypothesis test.
The null hypothesis is a union over all partitions of outcome-changing error across strata.
The hypothesis is rejected if the maximum $P$-value over all such partitions is sufficiently
small.
Each possible partition yields an intersection hypothesis, tested by
combining $P$-values from different strata using Fisher's combining function (or a suitable
replacement).
Among other things, the new approach solves a current problem in Colorado:
how to conduct RLAs of contests that cross jurisdictional lines, such as statewide
contests and many federal contests.
We give numerical examples in Jupyter notebooks
that can be modified to estimate the workload for different contest sizes, margins, and risk limits.
In our numerical experiments, the new method requires auditing far fewer ballots than previous approaches would.
\appendix
\section{Comparison tests for an overstatement quota}\label{sec:appendix-comparison}
\subsection{Notation}
\begin{itemize}
\item $\mathcal{W}$: the set of reported winners of the contest
\item $\mathcal{L}$: the set of reported losers of the contest
\item $N_s$ ballots were cast in stratum $s$. (The contest might not appear on all $N_s$ ballots.)
\item $P$ ``batches'' of ballots are in stratum $s$. A batch contains one or more ballots. Every ballot in stratum $s$ is in exactly one batch.
\item $n_p$: number of ballots in batch $p$. $N_s = \sum_{p=1}^P n_p$.
\item $v_{pi} \in \{0, 1\}$: reported votes for candidate $i$ in batch $p$
\item $a_{pi} \in \{0, 1\}$: actual votes for candidate $i$ in batch $p$.
If the contest does not appear on any ballot in batch $p$, then $a_{pi} = 0$.
\item $V_{w\ell,s} \equiv \sum_{p=1}^P (v_{pw} - v_{p\ell})$:
Reported margin in stratum $s$ of reported winner $w \in \mathcal{W}$ over reported loser
$\ell \in \mathcal{L}$, in votes.
\item $V_{w\ell}$:
overall reported margin in votes of reported winner $w \in \mathcal{W}$ over reported loser
$\ell \in \mathcal{L}$ for the entire contest (not just stratum $s$)
\item $V \equiv \min_{w \in \mathcal{W}, \ell \in \mathcal{L}} V_{w \ell}$:
smallest reported overall margin in votes between any reported winner and reported loser
\item $A_{w\ell,s} \equiv \sum_{p=1}^P (a_{pw} - a_{p\ell})$:
actual margin in votes in the stratum of reported winner $w \in \mathcal{W}$ over reported loser
$\ell \in \mathcal{L}$
\item $A_{w\ell}$:
actual margin in votes of reported winner $w \in \mathcal{W}$ over reported loser
$\ell \in \mathcal{L}$ for the entire contest (not just in stratum $s$)
\end{itemize}
\subsection{Reduction to maximum relative overstatement}
If the contest is entirely contained in stratum $s$, then
the reported winners of the contest are the actual winners if
$$
\min_{w \in \mathcal{W}, \ell \in \mathcal{L}} A_{w\ell,s} > 0.
$$
Here, we address the case that the contest may include a portion outside the stratum.
To combine independent samples in different strata, it is convenient
to be able to test whether the net overstatement error in a stratum is greater than or equal to a given threshold.
Instead of testing that condition directly, we will test a condition that is sufficient
but not necessary for the inequality to hold, to get a computationally simple test that
is still conservative (i.e., the level is not larger than its nominal value).
For every winner, loser pair $(w, \ell)$, we want to test
whether the overstatement error is greater than or equal to some threshold, generally
one tied to the reported margin between $w$ and $\ell$.
For instance, for a hybrid stratified audit, we set the threshold to be
$\lambda_s V_{w\ell}$.
We want to test whether
$$
\sum_{p=1}^P (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell} \ge \lambda_s.
$$
The maximum of sums is not larger than the sum of the maxima; that is,
$$
\max_{w \in \mathcal{W}, \ell \in \mathcal{L}}
\sum_{p=1}^P (v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell}
\le
\sum_{p=1}^P \max_{w \in \mathcal{W}, \ell \in \mathcal{L}}
(v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell}.
$$
Define
$$
e_p \equiv \max_{w \in \mathcal{W} \ell \in \mathcal{L}}
(v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell})/V_{w\ell}.
$$
Then no reported margin is overstated by a fraction $\lambda_s$ or more
if
$$
E \equiv \sum_{p=1}^P e_p < \lambda_s.
$$
Thus if we can reject the hypothesis $E \ge \lambda_s$, we can conclude that
no pairwise margin was overstated by as much as a fraction $\lambda_s$.
Testing whether $E \ge \lambda_s$ would require a very large sample if we knew nothing at
all about $e_p$ without auditing batch $p$: a single large value of $e_p$ could make
$E$ arbitrarily large.
But there is an \emph{a priori} upper bound for $e_p$.
Whatever the reported votes $v_{pi}$ are in batch~$p$, we can find the
potential values of the actual votes $a_{pi}$ that would make the
error $e_p$ largest, because $a_{pi}$ must be between 0 and $n_p$,
the number of ballots in batch~$p$:
$$
\frac{v_{pw}-a_{pw} - v_{p\ell} + a_{p\ell}}{V_{w\ell}} \le
\frac{v_{pw}- 0 - v_{p\ell} + n_p}{V_{w\ell}}.
$$
Hence,
\begin{equation} \label{eq:uDef}
e_p \le \max_{w \in \mathcal{W}, \ell \in \mathcal{L}}
\frac{v_{pw} - v_{p\ell} + n_p}{V_{w\ell}} \equiv u_p.
\end{equation}
Knowing that $e_p \le u_p$ might let us conclude reliably that $E < \lambda_s$
by examining only a small number of batches---depending on the
values $\{ u_p\}_{p=1}^P$ and on the values of $\{e_p\}$ for the audited batches.
To make inferences about $E$, it is helpful to work with the \emph{taint}
$t_p \equiv \frac{e_p}{u_p} \le 1$.
Define $U \equiv \sum_{p=1}^P u_p$.
Suppose we draw batches at random with replacement, with probability $u_p/U$
of drawing batch $p$ in each draw, $p = 1, \ldots, P$.
(Since $u_p \ge 0$, these are all positive numbers, and they sum to 1,
so they define a probability distribution on the $P$ batches.)
Let $T_j$ be the value of $t_p$ for the batch $p$ selected in the $j$th draw.
Then $\{T_j\}_{j=1}^n$ are IID, $\mathbb{P} \{T_j \le 1\} = 1$, and
$$
\mathbb{E} T_1 = \sum_{p=1}^P \frac{u_p}{U} t_p =
\frac{1}{U}\sum_{p=1}^P u_p \frac{e_p}{u_p} =
\frac{1}{U} \sum_{p=1}^P e_p = E/U.
$$
Thus $E = U \mathbb{E} T_1$.
So, if we have strong evidence that
$\mathbb{E} T_1 < \lambda_s/U$, we have
strong evidence that $E < \lambda_s$.
This approach can be simplified even further by noting that $u_p$ has
a simple upper bound that does not depend on $v_{pi}$.
At worst, the reported result for batch $p$ shows $n_p$ votes for the
``least-winning'' apparent winner of the contest with the smallest margin,
but a hand interpretation would show that all $n_p$ ballots in the batch
had votes for the runner-up in that contest.
Since
$V_{w\ell} \ge V\equiv \min_{w \in \mathcal{W}, \ell \in \mathcal{L}} V_{w \ell}$ and $0 \le v_{pi} \le n_p$,
$$
u_p = \max_{w \in \mathcal{W}, \ell \in \mathcal{L}}
\frac{v_{pw} - v_{p\ell} + n_p}{V_{w\ell}}
\le \max_{w \in \mathcal{W}, \ell \in \mathcal{L}}
\frac{n_p - 0 + n_p}{V_{w\ell}}
\le \frac{2n_p}{V}.
$$
Thus if we use $2n_p/V$ in lieu of $u_p$, we still get conservative results.
(We also need to re-define $U$ to be the sum of those upper bounds.)
An intermediate, still conservative approach would be to use this upper bound for
batches that consist of a single ballot, but use the sharper bound (\ref{eq:uDef})
when $n_p > 1$.
Regardless, for the new definition of $u_p$ and $U$,
$\{T_j\}_{j=1}^n$ are IID, $\mathbb{P} \{T_j \le 1\} = 1$,
and
$$
\mathbb{E} T_1 = \sum_{p=1}^P \frac{u_p}{U} t_p =
\frac{1}{U}\sum_{p=1}^P u_p \frac{e_p}{u_p} =
\frac{1}{U} \sum_{p=1}^P e_p = E/U.
$$
So, if we have evidence that $\mathbb{E} T_1 < \lambda_s/U$, we have evidence that
$E < \lambda_s$.
\subsection{Testing $\mathbb{E} T_1 \ge \lambda_s/U$}
A variety of methods are available to test whether $\mathbb{E} T_1 < \lambda_s/U$.
One particularly elegant sequential method is based on Wald's Sequential Probability
Ratio Test (SPRT) \cite{wald45}.
Harold Kaplan pointed out this method on a website that no longer exists.
A derivation of this \emph{Kaplan-Wald} method is in Appendix A of~\cite{starkTeague14};
to apply the method here, take $t = \lambda_s$ in their equation~18.
A different sequential method, the \emph{Kaplan-Markov} method (also due to Harold Kaplan),
is given in~\cite{stark09b}.
\section{Ballot-polling tests for an overstatement quota}\label{sec:appendix-polling}
In this section, we derive a ballot-polling test of the hypothesis that the margin (in votes) in a single stratum is greater than or equal to a threshold $c$.
\subsection{Wald's SPRT with a nuisance parameter}
Consider a single stratum $s$ containing $N_s$ ballots, of which
$N_{w,s}$ have a vote for $w$ but not for $\ell$, $N_{\ell,s}$ have a vote for $\ell$ but not for $w$, and $N_{u,s} = N_s - N_{w,s} - N_{\ell,s}$ have votes for both $w$ and $\ell$ or neither $w$ nor $\ell$, including undervotes and invalid ballots.
Ballots are drawn sequentially without replacement, with equal probability of selecting each as-yet-unselected ballot in each draw.
We want to test the compound hypothesis that $N_{w,s} - N_{\ell,s} \le c$ against the alternative that $N_{w,s} = V_{w,s}$, $N_{\ell,s} = V_{\ell,s}$, and $N_{u,s} = V_{u,s}$, with $V_{w,s} - V_{\ell,s} > c$.
The values $V_{w,s}$, $V_{\ell,s}$, and $V_{u,s}$ are the reported results for stratum $s$
(or values related to those reported results; see \cite{lindemanEtal12}).
The value of $c$ is inferred from the definition $\omega_{w\ell, s} \equiv V_{w\ell, s} - (N_{w,s} - N_{\ell,s})$.
Thus,
$$ c = V_{w, s} - V_{\ell, s} - \omega_{w\ell, s} = V_{w\ell, s} - \lambda_s V_{w\ell}.$$
In this problem, $N_{u,s}$ (equivalently, $N_{w,s} + N_{\ell,s}$) is a nuisance parameter: we care about $ N_{w,s} - N_{\ell,s}$, but the probability distribution of the sample depends also on $N_{u,s}$.
Let $X_k$ be $w$, $\ell$, or $u$ according to whether the ballot selected on the $k$th draw
shows a vote for $w$ but not $\ell$, $\ell$ but not $w$, or something else.
Let $W_n \equiv \sum_{k=1}^n 1_{X_k = w}$; and define $L_n$ and $U_n$ analogously.
Then $W_n + L_n + U_n = n$.
The probability of a given data sequence $X_1, \ldots, X_n$ under the alternative hypothesis is
\begin{equation} \label{eq:alternative_prob}
\frac{\prod_{i=0}^{W_n-1} (V_{w,s}-i) \;
\prod_{i=0}^{L_n-1} (V_{\ell,s}-i) \;
\prod_{i=0}^{U_n-1} (V_{u,s}-i)}
{\prod_{i=0}^{n-1} (N_s-i)},
\end{equation}
using the convention that $\prod_{i=0}^{-1} d_i \equiv 1$.
If $L_n \ge W_n - cn/N_s$, the data obviously do not provide evidence against the null, so we suppose that
$L_n < W_n - cn/N_s$, in which case, the element of the null that will maximize the probability of the observed data has $N_{w,s} - c = N_{\ell,s}$.
Under the null hypothesis, the probability of $X_1, \ldots, X_n$ is
\begin{equation} \label{eq:null_prob}
\frac{ \prod_{i=0}^{W_n-1} (N_{w,s}-i) \;
\prod_{i=0}^{L_n-1}(N_{w,s}-c - i)
\prod_{i=0}^{U_n-1} (N_{u,s}-i)}
{\prod_{i=0}^{n-1} (N_s-i)},
\end{equation}
for some value $N_{w,s}$ and the corresponding $N_{u,s} = N_s - 2N_{w,s}+c$.
How large can that probability be under the null?
The probability under the null is maximized by any integer
$x \in \{ \max(W_n, L_n+c), \ldots, (N_s-U_n)/2 \}$ that maximizes
$$
\prod_{i=0}^{W_n-1} (x-i) \; \prod_{i=0}^{L_n-1} (x-c-i) \; \prod_{i=0}^{U_n-1} (N_s-2x+c - i),
$$
again using the convention that $\prod_{i=0}^{-1} d_i \equiv 1$.
The logarithm is monotonic, so any maximizer $x^*$ also maximizes
$$ f(x) = \sum_{i=0}^{W_n-1} \ln (x-i) + \sum_{i=0}^{L_n-1} \ln (x-c-i) + \sum_{i=0}^{U_n-1} \ln (N_s-2x+ c - i).$$
(Here, we use the convention that $\sum_{i=0}^{-1} d_i \equiv 0$.)
If we consider the domain of $f$ to be real numbers, not just integers, the second derivative of $f$ is everywhere negative, so $f$ has a unique maximum on $[ \max(W_n, L_n+c), (N_s-U_n)/2]$,
either at one of the endpoints or somewhere on the interval.
The derivative $f'(x)$ gives information about where the maximum occurs:
$$ f'(x) = \sum_{i=0}^{W_n-1} \frac{1}{ x-i } + \sum_{i=0}^{L_n-1} \frac{1}{ x-c-i} -2 \sum_{i=0}^{U_n-1} \frac{1}{N_s-2x+ c - i}.$$
If $f'(x)$ does not change signs, then the maximum is at one of the endpoints;
set $x^*$ to be the endpoint that makes $f$ larger.
Otherwise, the (real-valued) maximum occurs where $f'(x) = 0$ and can be found using a root finder such as Brent's method; alternatively, the maximum can be found using concave optimization.
Since the maximizer for the original problem must be an integer, set $x^*$ to be the value in $\{\lfloor x \rfloor, \lceil x \rceil\}$ that makes $f$ larger; by concavity, that is the integer maximizer.
%The first two terms on the right increase monotonically with $x$ and the last term decreases monotonically with $x$.
%This yields bounds without having to evaluate $f$ everywhere.
%Suppose $y < z$.
%Then for all integer $x$ between $y$ and $z$,
%$$ f(x) \le \sum_{i=0}^{W_n-1} \ln (z-i) + \sum_{i=0}^{L_n-1} \ln (z-c-i) + \sum_{i=0}^{U_n-1} \ln (N_s-2y+c-i).$$
%The optimization problem can be solved using a branch and bound approach.
%For instance, start by evaluating
%$$
% f^+(x) \equiv \sum_{i=0}^{W_n-1} \ln (x-i) + \sum_{i=0}^{L_n-1} \ln (x-c-i)
%$$
%and
%$$
% f^-(x) \equiv \sum_{i=0}^{U_n-1} \ln (N_s-2x+c-i)
%$$
%at $\max(W_n, L_n+c)$, $(N_s-U_n)/2$, and their midpoint,
% to get the values of $f = f^+ + f^-$ at those three points, along with
%upper bounds on $f$ on the ranges between them.
%At stage $j$, we have evaluated $f$, $f^+$, and $f^-$ at $j$ points $x_1 < x_2 < \ldots < x_j$, and
%we have upper bounds on $f$ on the $j-1$ ranges $R_m = \{x_m, x_m+1, \ldots, x_{m+1}\}$
%between those points.
%Let $U_m$ be the upper bound on $f(x)$ for $x \in R_m$.
%Suppose that for some $h$, $f(x_h) = \max_{m=1}^j U_m$.
%Then $x^* = x_h$ is a global maximizer of $f$.
%If there is some $U_m > \max_i f(x_i)$, then subdivide the range with the largest $U_m$,
%calculate $f$, $f^+$, and $f^-$ at the new point, and repeat.
%This algorithm must terminate by identifying a global maximizer $x^*$ after a finite number of steps.
A conservative $P$-value for the null hypothesis after $n$ items have been drawn is thus
$$
P_n = \frac{\prod_{i=0}^{W_n-1} (x^*-i) \; \prod_{i=0}^{L_n-1} (x^*-c-i) \; \prod_{i=0}^{U_n-1} (N_s-2x^*+c-i)}{\prod_{i=0}^{W_n-1}(V_{w,s}-i) \; \prod_{i=0}^{L_n-1} (V_{\ell,s}-i) \; \prod_{i=0}^{U_n-1} (V_{u,s}-i)}.
$$
Because the test is built on Wald's SPRT, the sample can expand sequentially and
(if the null hypothesis is true) the chance that $P_n < p$ is never larger than $p$.
That is, $\Pr \{ \inf_n P_n \le p \} \le p$ if the null is true.
A Jupyter notebook implementing this approach is given in \url{https://github.com/pbstark/CORLA18}.
\section{Maximizing Fisher's combined $P$-value for a hybrid stratified audit}\label{sec:appendix-max-fisher}
Below, we present an algorithm for finding the maximum $P$-value in a hybrid stratified audit that combines
ballot-level comparison with ballot polling.
Let $p_1(\lambda)$ be the $P$-value from a ballot-polling audit and $p_2(1-\lambda)$ be the
$P$-value from a ballot comparison audit.
The calculation of these $P$-values is derived in Appendix~\ref{sec:appendix-polling}
and~\ref{sec:appendix-comparison}, respectively.
To find the minimum combined $P$-value, we first compute $\chi(\lambda)$
for a grid of values $\lambda_0 = \lambda_- < \lambda_1 < \cdots < \lambda_n = \lambda_+$,
where $\lambda_{j+1} - \lambda_j = \delta$.
To account for the fact that $\chi$ may be smaller for values of $\lambda$ between the
sampled values, we bound its variation between grid points.
The \emph{local modulus of continuity}, $\omega(\delta; \lambda)$, is an upper bound on how much $\chi$ can change in a $\delta$-neighborhood of $\lambda$:
$$\lvert \chi(\lambda) - \chi(\lambda + \delta) \rvert \leq \omega(\delta; \lambda).$$
For all $\lambda \in [\lambda_j, \lambda_{j+1}]$,
\begin{equation}
\chi(\lambda) \ge \max \{ \chi(\lambda_j) - \omega(\lambda-\lambda_j; \lambda_j),
\chi(\lambda_{j+1}) - \omega(\lambda_{j+1}-\lambda; \lambda_{j+1}) \}.
\end{equation}
Thus for all $\lambda \in [\lambda_-, \lambda_+]$,
\begin{equation}
\chi(\lambda) \ge \chi^*_n \equiv \min_{j=0}^{n-1} \min_{\mu \in [\lambda_j, \lambda_{j+1}]}
\max \{ \chi(\lambda_j) - \omega(\mu-\lambda_j; \lambda_j),
\chi(\lambda_{j+1}) - \omega(\lambda_{j+1}-\mu; \lambda_{j+1}) \}.
\end{equation}
If that lower bound is above $\chi_\alpha$, the audit can stop.
If not, then either the true minimum is below $\chi_\alpha$ or the grid $\{\lambda_j \}$ is too coarse.
Finding a lower bound on $\chi(\lambda)$ between gridpoints can be simplified in a number of ways, for instance, using a (bound on a) global modulus of continuity instead of a local modulus of continuity.
The algorithm to approximate $\min_\lambda \chi(\lambda)$ is as follows:
\begin{enumerate}
\item Pick a grid of values $\lambda_0 \equiv \lambda_- < \lambda_1 < \cdots < \lambda_n \equiv \lambda_+$.
\item Evaluate $\chi(\lambda_j)$, $j=0, \ldots, n$.
\item If, for any $j$, $\chi(\lambda_j) \le \chi_\alpha$, the audit must escalate.
\item Otherwise, check whether $\chi^*_n \ge \chi_\alpha$.
\item If so, the audit can stop: the minimum $\chi(\lambda)$ is above $\chi_\alpha$.
If not, refine the grid of values of $\lambda$ and return to step~2.
\end{enumerate}
\subsection{Bounding the local modulus of continuity for Fisher's combining function}
Fisher's combining function applied to the stratumwise $P$-values is
\begin{align}\label{eq:fisher-hybrid}
\chi(\lambda) = -2 & \Bigg[ n_1 \log\left( 1 - \frac{\lambda V_{w\ell}}{2N_1 \gamma}\right) - o_1 \log(1 - 1/2\gamma) - o_2 \log(1 - 1/\gamma) \nonumber \\
& - u_1 \log(1 + 1/2\gamma) - u_2 \log(1 + 1/\gamma)+ \sum_{i=0}^{W_n - 1} \log(N_{w,2}(1-\lambda) - i) \nonumber \\
& +\sum_{i=0}^{L_n - 1} \log(N_{w,2}(1-\lambda) - c(1-\lambda) - i) +\sum_{i=0}^{U_n - 1} \log(N_2 - 2N_{w,2}(1-\lambda) + c(1-\lambda) - i) \nonumber \\
& - \sum_{i=0}^{W_n - 1} \log(V_{w,2} - i) - \sum_{i=0}^{L_n - 1} \log(V_{\ell,2} - i) - \sum_{i=0}^{U_n - 1} \log(V_{u,2} - i) \Bigg].
\end{align}
We seek the modulus of continuity (or an upper bound on it) for this function.
While $c(\lambda)$ and $N_w(\lambda)$ must be integer values
(as they represent numbers of ballots),
the $P$-value from the ballot-polling SPRT and thus $\chi(\lambda)$ make sense for non-integer values of
$c(\lambda)$ and $N_w(\lambda)$, so to allow the use of calculus, we will extend the domain
to real values.
Moduli of continuity obey two key properties:
\begin{itemize}
\item \textbf{Sublinearity}: $\omega(a f + b g, \delta) \le \lvert a \rvert \omega_f(\delta) + \lvert b \rvert \omega_g (\delta)$
\item \textbf{Composition}: $\omega(g \circ f, \delta) = \omega_g(\omega_f(\delta))$
\end{itemize}
\noindent First, we find the modulus of continuity for several simple functions:
\begin{itemize}
\item $\log(ax - b)$: the modulus of continuity is $\log(1 + a\delta)$.
\item $c(\lambda) = (V_{w,2}-V_{\ell,2}) - \lambda V_{w\ell}$: since this function is linear in $\lambda$, the modulus of continuity is $V_{w\ell}\delta$.
\item $N_{w,2}(\lambda)$. When $N_{u,2}$ is known, this is also linear in $c(\lambda)$.
When $N_{u,2}$ is not known, $N_{w,2}(\lambda)$ increases with $c$, at a rate between $0$ and $1$.
For $N_{w,2}(\lambda)$ in the interior of the domain, taking the implicit derivative with respect to $c$ shows that
$$\frac{d N_{w, 2}}{dc} = \frac{ \sum_{i=0}^{L_n-1} \frac{1}{(N_{w,2} - c - i)^2} + \sum_{i=0}^{U_n-1} \frac{2}{(N_2 - 2N_{w,2} + c - i)^2} }{\sum_{i=0}^{W_n-1} \frac{1}{(N_{w,2} - i)^2} + \sum_{i=0}^{L_n-1} \frac{1}{(N_{w,2} - c - i)^2} + \sum_{i=0}^{U_n-1} \frac{4}{(N_2 - 2N_{w,2} + c - i)^2} } \in [0, 1].$$
Thus it also has modulus of continuity at most $V_{w\ell}\delta$.
Moreover, since $N_{w,2}$ is dominated by $c$, the terms $N_{w,2}(\lambda)-c(\lambda)$ and $-2N_{w,2}(\lambda)+c(\lambda)$ have modulus of continuity $V_{w\ell}\delta$ and $2V_{w\ell}\delta$, respectively.
\item For terms that don't involve $\lambda$, the modulus of continuity is 0.
\end{itemize}
There are four terms in $\chi(\lambda)$ that involve $\lambda$.
Using the two key properties of linearity and composition, as well as the moduli for simple functions above, we obtain the following bound on the global modulus of continuity:
\begin{align}
\omega_\chi(\delta) = & 2W_n \log(1 + V_{w\ell}\delta) + 2L_n \log(1+ V_{w\ell}\delta) + \nonumber \\
& 2U_n \log(1 + 2V_{w\ell}\delta) + 2n_1 \log\left(1 + \frac{V_{w\ell}}{2N_1\gamma}\delta\right)
\end{align}
\bibliography{./pbsBib}
\end{document}