-
Notifications
You must be signed in to change notification settings - Fork 0
/
manual.tex
1639 lines (1344 loc) · 73.7 KB
/
manual.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentstyle[11pt] {report}
\marginparwidth 0pt
\oddsidemargin 0pt
\evensidemargin 0pt
\marginparsep 0pt
\topmargin=0.0in
\textwidth=6.25in
\textheight=8.25in
\pagestyle{plain}
\parskip=5pt
\parindent=30pt
\begin{document}
\thispagestyle{empty}
\centerline{\Huge\bf ASURV}
\bigskip
\centerline{\Huge\bf Astronomy SURVival Analysis }
\bigskip
\bigskip
\bigskip
\bigskip
\centerline{\LARGE\bf A Software Package for Statistical Analysis of }
\bigskip
\centerline{\LARGE\bf Astronomical Data Containing Nondetections}
\bigskip
\bigskip
\bigskip
\bigskip
\begin{tabbing}
xxxxxxxxxxx\= \kill
\>xxxxxxxxxxxxxxx\= \kill
{\Large\bf Developed by: } \\
\> \\
\> {\Large\bf Takashi Isobe } (Center for Space Research, MIT) \\
\> \\
\> {\Large\bf Michael LaValley } (Dept. of Statistics, Penn State)\\
\> \\
\>{\Large\bf Eric Feigelson }(Dept. of Astronomy \& Astrophysics, Penn State) \\
\> \\
\> \\
\> \\
{\Large\bf Available from: } \\
\> \\
\> code@stat.psu.edu \\
\> \\
\>or, \\
\> \\
\> Eric Feigelson \\
\> Dept. of Astronomy \& Astrophysics \\
\> Pennsylvania State University \\
\> University Park PA 16802 \\
\> (814) 865-0162 \\
\> Email: edf@astro.psu.edu (Internet)\\
\end{tabbing}
\bigskip
\bigskip
\bigskip
\bigskip
\bigskip
\centerline{\Large\bf Rev. 1.2, Summer 1992.}
\newpage
\centerline{\Large\bf TABLE OF CONTENTS}
\bigskip
\bigskip
\noindent{\large\bf
\begin{center}
\begin{verbatim}
1 Introduction ............................................ 3
2 Overview of ASURV ....................................... 4
2.1 Statistical Functions and Organization .......... 4
2.2 Cautions and caveats ............................ 6
3 How to run ASURV ........................................ 9
3.1 Data Input Formats .............................. 9
3.2 KMESTM instructions and example ................. 10
3.3 TWOST instructions and example .................. 11
3.4 BIVAR instructions and example .................. 13
4 Acknowledgements ........................................ 21
Appendices ................................................ 22
A1 Overview of survival analysis .................... 22
A2 Annotated Bibliography on Survival Analysis ...... 23
A3 How Rev 1.2 is Different From Rev 0.0 ............ 26
A4 Errors Removed in Rev 1.1 ........................ 28
A5 Errors Removed in Rev 1.2 ........................ 28
A6 Obtaining and Installing ASURV ................... 28
A7 User Adjustable Parameters ....................... 30
A8 List of subroutines used in ASURV Rev 1.2 ........ 32
\end{verbatim}
\end{center}
}
\bigskip
\bigskip
\bigskip
\centerline{\Large\bf NOTE}
\begin{small}
This program and its documentation are provided `AS IS' without
warranty of any kind. The entire risk as the results and performance
of the program is assumed by the user. Should the program prove
defective, the user assume the entire cost of all necessary correction.
Neither the Pennsylvania State University nor the authors of the program
warrant, guarantee or make any representations regarding use of, or the
results of the use of the program in terms of correctness, accuracy
reliability, or otherwise. Neither shall they be liable for any direct or
indirect damages arising out of the use, results of use or inability to
use this product, even if the University has been advised of the possibility
of such damages or claim. The use of any registered Trademark depicted
in this work is mere ancillary; the authors have no affiliation with
or approval from these Trademark owners.
\end{small}
\newpage
\centerline{\Large\bf 1 Introduction}
Observational astronomers frequently encounter the situation where they
observe a particular property (e.g. far infrared emission, calcium line
absorption, CO emission) of a previously defined sample of objects, but
fail to detect all of the objects. The data set then contains nondetections as
well as detections, preventing the use of simple and familiar statistical
techniques in the analysis.
A number of astronomers have recently recognized the existence
of statistical methods, or have derived similar methods, to deal with these
problems. The methods are collectively
called `survival analysis' and nondetections are called
`censored' data points. {\sl\bf ASURV} is a menu-driven stand-alone computer
package designed to assist astronomers in using methods from survival
analysis. Rev. 1.2 of {\sl\bf ASURV} provides the maximum-likelihood
estimator of the censored distribution,
several two-sample tests, correlation tests and linear regressions as
described in our papers in the {\it\bf Astrophysical Journal} (Feigelson
and Nelson, 1985; Isobe, Feigelson, and Nelson, 1986).
No statistical procedure can magically recover information that was
never measured at the telescope. However, there is frequently important
information implicit in the failure to detect some objects which can be
partially recovered under reasonable assumptions. We purposely provide
a variety of statistical tests - each
based on different models of where upper limits truly lie - so that the
astronomer can judge the importance of the different assumptions.
There are also reasons for {\bf not} applying these methods. We describe
some of their known limitations in {\bf \S 2.2}.
Because astronomers are frequently unfamiliar with the field of
statistics, we devote Appendix {\bf \S A1} to background
material. Both general issues concerning censored data and specifics
regarding the methods used in {\sl\bf ASURV} are discussed. More
mathematical presentations can be found in the references given in Appendix
{\bf \S A2}. Users of Rev 0.0, distributed between 1988 and 1991, are
strongly encouraged to
read Appendices {\bf \S A3-A5} to be familiar with the changes made in the
package. Appendices {\bf \S A6-A8} are needed only by code installers and
those needing to modify the I/O or array sizes.
Users wishing to get right down to work should read {\bf \S 2.1} to
find the appropriate program, and follow the instructions in {\bf \S 3}.
\bigskip
\bigskip
\centerline{\Large\bf Acknowledging ASURV}
We would appreciate that studies using this package include phrasing
similar to `... using {\sl\bf ASURV} Rev 1.2 ({\bf B.A.A.S.} reference),
which implements the methods presented in ({\bf Ap. J.} reference)', where
the {\bf B.A.A.S.} reference is the most recent published {\sl\bf ASURV}
Software Report (Isobe and Feigelson 1990; LaValley, Isobe and Feigelson
1992) and the {\bf Ap. J.} reference is Feigelson and Nelson (1985) for
univariate problems and Isobe,Feigelson and Nelson (1986) for bivariate
problems. Other general works appropriate for referral include the review
of survival methods for astronomers by Schmitt (1985), and the survival
analysis monographs by Miller (1981) or Lawless (1982).
\newpage
\centerline{\Large\bf 2 Overview of ASURV}
\medskip
\noindent{\large\bf 2.1 Statistical Functions and Organization}
The statistical methods for dealing with censored data might be
divided into a 2x2 grid: parametric $vs.$ nonparametric, and univariate $vs.$
bivariate. Parametric methods assume that the censored survival times
are drawn from a parent population with a known distribution function ($e.g.$
Gaussian, exponential), and this is the principal assumption of survival
models for industrial reliability. Nonparametric models make
no assumption about the
parent population, and frequently rely on maximum-likelihood techniques.
Univariate methods are devoted to determining the characteristics of the
population from which the censored sample is drawn, and comparing such
characteristics for two or more censored samples. Bivariate methods
measure whether the censored property of the sample is correlated with
another property of the objects and, if so, to perform a regression that
quantifies the relationship between the two variables.
We have chosen to concentrate on nonparametric models, since the
underlying distribution of astronomical populations is usually unknown.
The linear regression methods however, are either fully parametric
($e.g.$ EM algorithm regression) or semi-parametric
($e.g.$ Cox regression, Buckley-James regression).
Most of the methods are described in more detail in the astronomical
literature by Schmitt (1985), Feigelson and Nelson (1985) and Isobe et al.
(1986). The generalized Spearman's rho utilized in {\sl\bf ASURV}
Rev 1.2 is derived by Akritas (1990).
The program within {\sl\bf ASURV} giving univariate methods is
{\sl\bf UNIVAR}. Its first subprogram is {\sl\bf KMESTM}, giving the
Kaplan-Meier estimator for the distribution function of a randomly
censored sample. First derived in
1958, it lies at the root of non-parametric survival analysis. It is the
unique, self-consistent, generalized maximum-likelihood estimator for the
population from which the sample was drawn. When formulated in cumulative
form, it has analytic asymptotic (for large N) error bars. The median is
always well-defined, though the mean is not if the lowest point in the
sample is an upper limit. It is identical to the differential
`redistribute-to-the-right' algorithm, independently derived by Avni et al.
(1980) and others, but the differential form does not have simple analytic
error analysis.
The second univariate program is {\sl\bf TWOST}, giving a variety
of ways to test whether two censored samples are drawn from the same parent
population. They are mostly generalizations of standard tests for
uncensored data, such as the Wilcoxon and logrank nonparametric two-sample
tests. They differ in how the censored data are weighted or `scored' in
calculating the statistic. Thus each is more sensitive to differences at
one end or the other of the distribution. The Gehan and logrank tests are
widely used in biometrics, while some of the others are not. The tests
offered in Rev 1.2 differ significantly from those offered in Rev 0.0 and
details of the changes are in {\bf \S A3}.
{\sl\bf BIVAR} provides methods for bivariate data, giving three
correlation tests and three linear regression analyses. Cox hazard model
(correlation test), EM algorithm, and Buckley-James method (linear
regressions) can treat several independent variables if the dependent
variable contains only one kind of censoring ($i.e.$ upper or lower limits).
Generalized Kendall's tau (correlation test), Spearman's rho
(correlation test), and Schmitt's binned linear regression can treat
mixed censoring including censoring in the independent variable, but
can have only one independent variable.
{\sl\bf COXREG} computes the correlation probabilities by Cox's
proportional hazard model and {\sl\bf BHK} computes the generalized
Kendall's tau. {\sl\bf SPRMAN} computes correlation probabilities
based on a generalized version of Spearman's rank order correlation
coefficient. {\sl\bf EM} calculates linear regression
coefficients by the EM algorithm assuming a normal distribution of
residuals, {\sl\bf BJ} calculates the Buckley-James regression with
Kaplan-Meier residuals, and {\sl\bf TWOKM} calculates the binned
two-dimensional Kaplan-Meier distribution and associated linear
regression coefficients derived by Schmitt (1985). A bootstrapping
procedure provides error analysis for Schmitt's method in Rev 1.2. The
code for EM algorithm follows that given by
Wolynetz (1979) except minor changes. The code for Buckley-James method
is adopted from Halpern (Stanford Dept. of Statistics; private
communication). Code for the Kaplan-Meier estimator and some of the
two-sample tests was adapted from that given in Lee (1980). {\sl\bf COXREG},
{\sl\bf BHK}, {\sl\bf SPRMAN}, and the {\sl\bf TWOKM} code were developed
entirely by us.
Detailed remarks on specific subroutines can be found in the comments at
the beginning of each subroutine. The reference for the source of the code
for the subroutine is given there, as well as an annotated list of the
variables used in the subroutine.
\newpage
\noindent{\Large\bf 2.2 Cautions and Caveats}
The Kaplan-Meier estimator works with any underlying distribution
($e.g.$ Gaussian, power law, bimodal), but only if the censoring is `random'.
That is, the probability that the measurement of an object is censored can
not depend on the value of the censored variable. At first glance, this
may seem to be inapplicable to most astronomical problems: we detect the
brighter objects in a sample, so the distribution of upper limits always
depends on brightness. However, two factors often serve to randomize
the censoring distribution. First, the censored variable may not be
correlated with the variable by which the sample was initially
identified. Thus, infrared observations of a sample of radio bright
objects will be randomly censored if the radio and infrared emission are
unrelated to each other. Second, astronomical objects in a sample usually
lie at different distances, so that brighter objects are not always the
most luminous. In cases where the variable of interest is censored at
a particular value ($e.g.$ flux, spectral line equivalent width, stellar
rotation rate) rather than randomly censored, then parametric methods
appropriate to `Type 1' censoring should be used. These are described by
Lawless (1982) and Schneider (1986), but are not included in this package.
Thus, the censoring mechanism of each study MUST be understood
individually to judge whether the censoring is or is not likely to be
random. A good example of this judgment process is provided by
Magri et al. (1988). The appearance of the data, even if the upper limits
are clustered at one end of the distribution, is NOT a reliable measure. A
frequent (if philosophically distasteful) escape from the difficulty of
determining the nature of the censoring in a given experiment is to define
the population of interest to be the observed sample. The Kaplan-Meier
estimator then always gives a valid redistribution of the upper limits,
though the result may not be applicable in wider contexts.
The two-sample tests are somewhat less restrictive than the
Kaplan-Meier estimator, since they seek only to compare two samples rather
than determine the true underlying distribution. Because of this, the
two-sample tests do not require that the censoring patterns of the two samples
are random. The two versions of the Gehan test in {\sl\bf ASURV} assume
that the censoring patterns of the two samples are the same, but
the version with hypergeometric variance is more reliable in case of
different censoring patterns. The logrank test results appear to be
correct as long as the censoring patterns are not very different.
Peto-Prentice seems to be the test least affected by differences in
the censoring patterns. There is little known about the limitations of
the Peto-Peto test. These issues are discussed in Prentice and Marek (1979),
Latta (1981) and Lawless (1982).
Two of the bivariate correlation coefficients, generalized Kendall's tau
and Cox regression, are both known to be inaccurate when many tied values
are present in the data. Ties are particularly common when the data is
binned. Caution should be used in these cases. It is not known how the
third correlation method, generalized Spearman's rho, responds to ties in the
data. However, there is no reason to believe that it is more accurate than
Kendall's tau in this case, and it should also used be with caution in the
presence of tied values.
Cox regression, though widely used in biostatistical applications,
formally applies only if the `hazard rate' is constant; that is, the
cumulative distribution function of the censored variable falls
exponentially with increasing values. Astronomical luminosity functions,
in contrast, are frequently modeled by power law distributions. It is not
clear whether or not the results of a Cox regression are significantly
affected by this difference.
There are a variety of limitations to the three linear regression
methods -- {\sl\bf EM}, {\sl\bf BJ}, and {\sl\bf TWOKM} -- presented here.
First, only Schmitt's binned method implemented in {\sl\bf TWOKM} works when
censoring is present in both variables. Second, {\sl\bf\ EM} requires that
the residuals about the fitted line follow a Gaussian distribution.
{\sl\bf BJ} and {\sl\bf TWOKM} are less restrictive, requiring only that the
censoring distribution about the fitted line is random. Both we and
Schneider (1986) find little difference in the regression
coefficients derived from these two methods. Third, the Buckley-James
method has a defect in that the final solution occasionally oscillates
rather than converges smoothly. Fourth, there is considerable uncertainty
regarding the error analysis of the regression coefficients of all three
models. {\sl\bf\ EM} gives analytic formulae based on asymptotic normality,
while Avni and Tananbaum (1986) numerically calculate and examine the
likelihood surface. BJ gives an analytic formula for the slope only, but it
lies on a weak theoretical foundation. We now provide Schmitt's bootstrap
error analysis for {\sl\bf TWOKM}, although this may be subject to high
computational expense for large data sets. Users may thus wish to run
all methods and evaluate the uncertainty with caution. Fifth, Schmitt's
binned regression implemented in {\sl\bf TWOKM} has a number of drawbacks
discussed by Sadler et al. (1989), including slow or failed convergence
to the two-dimensional Kaplan-Meier distribution, arbitrary choice of
bin size and origin, and problems if either too many or too few bins are
selected. In our judgment, the most reliable linear regression method
may be the Buckley-James regression, and we suggest that Schmitt's regression
be reserved for problems with censoring present in both variables.
If we consider the field of survival analysis from a broader
perspective, we note a number of deficiencies with respect to censored
statistical problems in astronomy (see Feigelson, 1990). Most importantly,
survival analysis assumes the upper limits in a given experiment are
precisely known, while in astronomy they frequently represent n times
the RMS noise level in the experimental detector, where n = 2, 3, 5
or whatever. {\bf It is possible that all existing survival methods will
be inaccurate for astronomical data sets containing many points very close
to the detection limit.} Methods that combine the virtues of survival
analysis (which treat censoring) and measurement error models (which
treat known measurement errors in both uncensored and censored points)
are needed. See the discussion by Bickel and Ritov (1992) on this important
matter. A related deficiency is the absence of
weighted means or regressions associated with censored samples.
Survival analysis also has not yet produced any true multivariate
techniques, such as a Principal Components Analysis that permits
censoring. There also seems to be a dearth of nonparametric
goodness-of-fit tests like the Kolmogorov-Smirnov test.
Finally, we note that this package - {\sl\bf ASURV} - is not unique.
Nearly a dozen software packages for analyzing censored data have been
developed (Wagner and Meeker 1985). Four are part of large multi-purpose
commercial statistical software systems: SAS, SPSS, BMDP, and GLIM.
These packages are available on many university central computers. We have
found BMDP to be the most useful for astronomers (see Appendices to
Feigelson and Nelson 1985, Isobe et al. 1986 for instructions on its use).
It provides most of the functions in {\sl\bf KMESTM} and {\sl\bf TWOST}
as well as a full Cox regression, but no linear regressions. Other packages
(CENSOR, DASH, LIMDEP, STATPAC, STAR, SURVAN, SURVCALC, SURVREG) were written
at various universities, medical institutes, publishing houses and industrial
labs; they have not been evaluated by us.
\newpage
\centerline{\Large\bf 3 How to Run ASURV}
\medskip
\noindent{\large\bf 3.1 Data Input Formats}
{\sl\bf ASURV} is designed to be menu-driven. There are two basic input
structures: a data file, and a command file. The data file is assumed to
reside on disk. For each observed object, it contains the measured value
of the variable of interest and an indicator as to whether it is detected
or not. Listed below are the possible values that the {\bf censoring indicator
} can take on. Upper limits are most common in astronomical applications and
lower limits are most common in lifetime applications.
\begin{verbatim}
For univariate data: 1 : Lower Limit
0 : Detection
-1 : Upper Limit
For bivariate data: 3 : Both Independent and Dependent Variables are
Lower Limits
2 : Only independent Variable is a Lower Limit
1 : Only Dependent Variable is a Lower Limit
0 : Detected Point
-1 : Only Dependent Variable is an Upper Limit
-2 : Only Independent Variable is an Upper Limit
-3 : Both Independent and Dependent Variables are
Upper Limits
\end{verbatim}
The command file may either reside on disk, but is more frequently
given interactively at the terminal.
For the univariate program {\sl\bf UNIVAR}, the format of the data file
is determined in the subroutine {\sl\bf DATAIN}. It is currently set
at 10*(I4,F10.3) for {\sl\bf KMESTM}, where each column represents a
distinct subsample. For {\sl\bf TWOST}, it is set at I4,10*(I4,F10.3),
where the first column gives the sample identification number. Table 1
gives an example of a {\sl\bf TWOST} data file called 'gal2.dat'. It
shows a hypothetical study where six normal galaxies, six starburst galaxies
and six Seyfert galaxies were observed with the IRAS satellite. The variable
is the log of the 60-micron luminosity. The problem we address are the
estimation of the luminosity functions of each sample, and a determination
whether they are different from each other at a statistically significant
level. Command file input formats are given in subroutine {\sl\bf UNIVAR},
and inputs are parsed in subroutines {\sl\bf DATA1} and {\sl\bf DATA2}. All
data input and output formats can be changed by the user as described in
appendix {\bf \S A7}.
For the bivariate program {\sl\bf BIVAR}, the data file format is
determined by the subroutine {\sl\bf DATREG}. It is currently set
at I4,10F10.3. In most cases, only two variables are used with input
format I4,2F10.3. Table 2 shows an example of a bivariate censored problem.
Here one wishes to investigate possible relations between infrared and
optical luminosities in a sample of galaxies. {\sl\bf BIVAR} command file
input formats are sometimes a bit complex. The examples below illustrate
various common cases. The formats for the basic command inputs are given in
subroutine {\sl\bf BIVAR}. Additional inputs for the Spearman's rho
correlation, EM algorithm, Buckley-James method, and Schmitt's method are
given in subroutines {\sl\bf R3}, {\sl\bf R4}, {\sl\bf R5}, and {\sl\bf R6}
respectively.
The current version of {\sl\bf ASURV} is set up for data sets with
fewer than 500 data points and occupies about 0.46 MBy of core memory.
For problems with more than 500 points, the parameter values in the
subroutines {\sl\bf UNIVAR} and {\sl\bf BIVAR} must be changed as described
in appendix {\bf\S A7}. Note that for the generalized Kendall's tau
correlation coefficient, and possibly other programs, the computation time
goes up as a high power of the number of data points.
{\sl\bf ASURV} has been designed to work with data that can
contain either upper limits (right censoring) or lower limits (left
censoring). Most of these procedures assume restrictions on the
type of censoring present in the data. Kaplan-Meier estimation and
the two-sample tests presented here can work with data that has either
upper limits or lower limits, but not with data containing both. Cox
regression, the EM algorithm, and Buckley-James regression can use
several independent variables if the dependent variable
contains only one type of censored point (it can be either upper or lower
limits). Kendall's tau, Spearman's rho, and Schmitt's binned regression can
have mixed censoring, including censoring in the independent variable, but
they can only have one independent variable.
\bigskip
\bigskip
\noindent{\large\bf 3.2 KMESTM Instructions and Example}
Suppose one wishes to see the Kaplan-Meier estimator for the infrared
luminosities of the normal galaxies in Table 1. If one runs {\sl\bf ASURV}
interactively from the terminal, the conversation looks as follows:
\begin{verbatim}
Data type : 1 [Univariate data]
Problem : 1 [Kaplan-Meier]
Command file? : N [Instructions from terminal]
Data file : gal1.dat [up to 9 characters]
Title : IR Luminosities of Galaxies
[up to 80 characters]
Number of variables : 1 [if several variables in data file]
Name of variable : IR [ up to 9 characters]
Print K-M? : Y
Print out differential
form of K-M? : Y
25.0 [Starting point is set to 25]
5 [5 bins set]
2.0 [Bin size is set equal to 2]
Print original data? : Y
Need output file? : Y
Output file : gal1.out [up to 9 characters]
Other analysis? : N
\end{verbatim}
If an output file is not specified, the results will be sent to the
terminal screen.
If a command file residing on disk is preferred, run {\sl\bf ASURV}
interactively as above, specifying 'Y' or 'y' when asked if a command file is
available. The command file would then look as follows:
\begin{verbatim}
gal1.dat ... data file
IR Luminosities of Galaxies ... problem title
1 ... number of variables
1 ... which variable?
IR ... name of the variable
1 ... printing K-M (yes=1, no=0)
1 ... printing differential K-M (yes=1, no=0)
25.0 ... starting point of differential K-M
5 ... number of bins
2.0 ... bin size
1 ... printing data (yes=1, no=0)
gal1.out ... output file
\end{verbatim}
All inputs are read starting from the first column.
The resulting output is shown in Table 3. It shows, for example,
that the estimated normal galaxy cumulative IR luminosity function is 0.0
below log(L) = 26.9, 0.63 $\pm$ 0.21 for 26.9 $<$ log(L) $<$ 28.5,
0.83 $\pm$ 0.15 for 28.5 $<$ log(L) $<$ 30.1, and 1.00 above log(L) = 30.1.
The estimated mean for the sample is 27.8 $\pm$ 0.5. The 'C' beside two
values indicates these are nondetections; the Kaplan-Meier function does
not change at these values.
\bigskip
\bigskip
\noindent{\large\bf 3.3 TWOST Instructions and Example}
Suppose one wishes to see two sample tests comparing the subsamples
in Table 1. If one runs {\sl\bf ASURV} interactively from the terminal, the
conversation goes as follows:
\begin{verbatim}
Data type : 1 [Univariate data]
Problem : 2 [Two sample test]
Command file? : N [Instructions from terminal]
Data file : gal2.dat [up to 9 characters]
Problem Title : IR Luminosities of Galaxies
[up to 80 characters]
Number of variables : 1
[if the preceeding answer is more
than 1, the number of the variable
to be tested is now requested]
Name of variable : IR [up to 9 characters]
Number of groups : 3
Which combination ? : 0 [by the indicators in column one
1 of the data set]
Name of subsamples : Normal [up to 9 characters]
Starburst
Need detail print out ? : N
Print full K-M? : Y
Print out differential
form of K-M? : N
Print original data? : N
Output file? : Y
Output file name? : gal2.out [up to 9 characters]
Other analysis? : N
\end{verbatim}
A command file that performs the same operations goes as follows, after
answering 'Y' or 'y' where it asks for a command file:
\begin{verbatim}
gal2.dat ... data file
IR Luminosities of Galaxies ... title
1 ... number of variables
1 ... which variable?
IR ... name of the variable
3 ... number of groups
0 1 2 ... indicators of the groups
0 1 0 1 ... first group for testing
second group for testing
need detail print out ? (if Y:1, N:0)
need full K-M print out? (if Y:1, N:0)
0 ... printing differential K-M (yes=1, no=0)
0 ... print original data? (if Y:1, N:0)
Normal ... name of first group
Starburst ... name of second group
gal2.out ... output file
\end{verbatim}
The resulting output is shown in Table 4. It shows that the
probability that the distribution of IR luminosities of normal and
starburst galaxies are similar at the 5\% level in both the Gehan and Logrank
tests.
\bigskip
\bigskip
\noindent{\large\bf 3.4 BIVAR Instructions and Example}
Suppose one wishes to test for correlation and perform regression
between the optical and infrared luminosities for the galaxy sample in
Table 2. If one runs {\sl\bf ASURV} interactively from the terminal, the
conversation looks as follow:
\begin{verbatim}
Data type : 2 [Bivariate data]
Command file? : N [Instructions from terminal]
Title : Optical-Infrared Relation
[up to 80 characters]
Data file : gal3.dat [up to 9 characters]
Number of Indep. variable: 1
Which methods? : 1 [Cox hazard model]
another one ? : Y
: 4 [EM algorithm regression]
: N
Name of Indep. variable : Optical
Name of Dep. variable : Infrared
Print original data? : N
Save output ? : Y
Name of Output file : gal3.out
Tolerance level ? : N [Default : 1.0E-5]
Initial estimate ? : N
Iteration limits ? : Y
Max. iterations : 50
Other analysis? : N
\end{verbatim}
The user may notice that the above test run includes several input
parameters specific to the EM algorithm (tolerance level through maximum
iterations). All of the regression procedures, particularly Schmitt's
two-dimensional Kaplan-Meier estimator method that requires specification
of the bins, require such extra inputs.
A command file that performs the same operations goes as follows,
following the request for a command file name:
\begin{verbatim}
Optical-Infrared Relation .... title
gal3.dat .... data file
1 1 2 .... 1. number of independent variables
2. which independent variable
3. number of methods
1 4 .... method numbers (Cox, and EM)
Optical Infrared .... name of indep.& dep
variables
0 .... no print of original data
gal3.out .... output file name
1.0E-5 .... tolerance
0.0 0.0 0.0 0.0 .... estimates of coefficients
50 .... iteration
\end{verbatim}
The resulting output is shown in Table 5. It shows that the
correlation between optical and IR luminosities is significant at a
confidence level P $<$ 0.01\%, and the linear fit is
$log(L_{IR})\alpha(1.05 \pm 0.08)*log(L_{OPT})$. It is important to run all
of the methods in order to get a more complete understanding of the
uncertainties in these estimates.
If you want to use Buckley-James method, Spearman's rho, or Schmitt's
method from a command file, you need the next extra inputs:
\begin{verbatim}
(for B-J method)
1.0e-5 tolerance
30 iteration
(for Spearman's Rho)
1 Print out the ranks used in computation;
if you do not want, set to 0
(for Schmitt's)
10 10 bin # for X & Y
10 if you want to set the binning
information, set it to the positive
integer; if not, set to 0.
1.e-5 tolerance
30 iteration
0.5 0.5 bin size for X & Y
26.0 27.0 origins for X & Y
1 print out two dim KM estm;
if you do not need, set to 0.
100 # of iterations for bootstrap error
analysis; if you do not want error
analysis, set to 0
-37 negative integer seed for random number
generator used in error analysis.
\end{verbatim}
\newpage
\centerline{\Large\bf Table 1}
\bigskip
\bigskip
\centerline{\large\bf Example of UNIVAR Data Input for TWOST}
\bigskip
\centerline{\large\bf IR,Optical and Radio Luminosities of Normal,}
\centerline{\large\bf Starburst and Seyfert Galaxies}
\begin{verbatim}
____
0 0 28.5 |
0 0 26.9 |
0 -1 29.7 Normal galaxies
0 -1 28.1 |
0 0 30.1 |
0 -1 27.6 ____
1 -1 29.0 |
1 0 29.0 |
1 0 30.2 Starburst galaxies
1 -1 32.4 |
1 0 28.5 |
1 0 31.1 ____
2 0 31.9 |
2 -1 32.3 Seyfert galaxies
2 0 30.4 |
2 0 31.8 ____
| | |
#1 #2 #3
---I---I---------I--
Column # 4 8 17
Note : #1 : indicator of subsamples (or groups)
If you need only K-M estimator, leave out this indicator.
#2 : indicator of censoring
#3 : variable (in this case, the values are in log form)
\end{verbatim}
\newpage
\centerline{\Large\bf Table 2}
\bigskip
\bigskip
\centerline{\large\bf Example of Bivariate Data Input for MULVAR}
\bigskip
\centerline{\large\bf Optical and IR luminosity relation of IRAS galaxies}
\begin{verbatim}
0 27.2 28.5
0 25.4 26.9
-1 27.2 29.7
-1 25.9 28.1
0 28.8 30.1
-1 25.3 27.6
-1 26.5 29.0
0 27.1 29.0
0 28.9 30.2
-1 29.9 32.4
0 27.0 28.5
0 29.8 31.1
0 30.1 31.9
-1 29.7 32.3
0 28.4 30.4
0 29.3 31.8
| | |
#1 #2 #3
_________ ______
Optical IR
---I---------I---------I--
Column # 4 13 22
Note : #1 : indicator of censoring
#2 : independent variable (may be more Than one)
#3 : dependent variable
\end{verbatim}
\newpage
\centerline{\Large\bf Table 3}
\bigskip
\centerline{\large\bf Example of KMESTM Output}
\bigskip
\begin{small}
\begin{verbatim}
KAPLAN-MEIER ESTIMATOR
TITLE : IR Luminosities of Galaxies
DATA FILE : gal1.dat
VARIABLE : IR
# OF DATA POINTS : 6 # OF UPPER LIMITS : 3
VARIABLE RANGE KM ESTIMATOR ERROR
FROM 0.000 TO 26.900 1.000
FROM 26.900 TO 28.500 0.375 0.213
27.600 C
28.100 C
FROM 28.500 TO 30.100 0.167 0.152
29.700 C
FROM 30.100 ONWARDS 0.000 0.000
SINCE THERE ARE LESS THAN 4 UNCENSORED POINTS,
NO PERCENTILES WERE COMPUTED.
MEAN= 27.767 +/- 0.515
DIFFERENTIAL KM ESTIMATOR
BIN CENTER D
26.000 3.750
28.000 1.250
30.000 1.000
32.000 0.000
34.000 0.000
-------
SUM = 6.000
(D GIVES THE ESTIMATED DATA POINTS IN EACH BIN)
INPUT DATA FILE: gal1.dat
CENSORSHIP X
0 28.500
0 26.900
-1 29.700
-1 28.100
0 30.100
-1 27.600
\end{verbatim}
\end{small}
\newpage
\centerline{\Large\bf Table 4}
\bigskip
\centerline{\large\bf Example of TWOST Output}
\bigskip
\begin{small}
\begin{verbatim}
******* TWO SAMPLE TEST ******
TITLE : IR Luminosities of Galaxies
DATA SET : gal2.dat
VARIABLE : IR
TESTED GROUPS ARE Normal AND Starburst
# OF DATA POINTS : 12, # OF UPPER LIMITS : 5
# OF DATA POINTS IN GROUP I : 6
# OF DATA POINTS IN GROUP II : 6
GEHAN`S GENERALIZED WILCOXON TEST -- PERMUTATION VARIANCE
TEST STATISTIC = 1.652
PROBABILITY = 0.0986
GEHAN`S GENERALIZED WILCOXON TEST -- HYPERGEOMETRIC VARIANCE
TEST STATISTIC = 1.687
PROBABILITY = 0.0917
LOGRANK TEST
TEST STATISTIC = 1.814
PROBABILITY = 0.0696
PETO & PETO GENERALIZED WILCOXON TEST
TEST STATISTIC = 1.730
PROBABILITY = 0.0837
PETO & PRENTICE GENERALIZED WILCOXON TEST
TEST STATISTIC = 1.706
PROBABILITY = 0.0881
KAPLAN-MEIER ESTIMATOR
DATA FILE : gal2.dat
VARIABLE : Normal
# OF DATA POINTS : 6 # OF UPPER LIMITS : 3
VARIABLE RANGE KM ESTIMATOR ERROR
FROM 0.000 TO 26.900 1.000
FROM 26.900 TO 28.500 0.375 0.213
27.600 C
28.100 C
FROM 28.500 TO 30.100 0.167 0.152
29.700 C
FROM 30.100 ONWARDS 0.000 0.000
SINCE THERE ARE LESS THAN 4 UNCENSORED POINTS,
NO PERCENTILES WERE COMPUTED.
MEAN= 27.767 +/- 0.515
KAPLAN-MEIER ESTIMATOR
DATA FILE : gal2.dat
VARIABLE : Starburst
# OF DATA POINTS : 6 # OF UPPER LIMITS : 2
VARIABLE RANGE KM ESTIMATOR ERROR
FROM 0.000 TO 28.500 1.000
FROM 28.500 TO 29.000 0.600 0.219
29.000 C
FROM 29.000 TO 30.200 0.400 0.219
FROM 30.200 TO 31.100 0.200 0.179
FROM 31.100 ONWARDS 0.000 0.000
32.400 C
PERCENTILES
75 TH 50 TH 25 TH
17.812 28.750 29.900
MEAN= 29.460 +/- 0.460
\end{verbatim}
\end{small}
\newpage
\centerline{\Large\bf Table 5}
\bigskip
\bigskip
\centerline{\large\bf Example of BIVAR Output}
\bigskip
\begin{center}
\begin{verbatim}
CORRELATION AND REGRESSION PROBLEM
TITLE IS Optical-Infrared Relation
DATA FILE IS gal3.dat
INDEPENDENT DEPENDENT
Optical AND Infrared
NUMBER OF DATA POINTS : 16
UPPER LIMITS IN Y X BOTH MIX
6 0 0 0
CORRELATION TEST BY COX PROPORTIONAL HAZARD MODEL
GLOBAL CHI SQUARE = 18.458 WITH
1 DEGREES OF FREEDOM
PROBABILITY = 0.0000
LINEAR REGRESSION BY PARAMETRIC EM ALGORITHM
INTERCEPT COEFF : 0.1703 +/- 2.2356
SLOPE COEFF 1 : 1.0519 +/- 0.0793
STANDARD DEVIATION : 0.3687
ITERATIONS : 27