Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniform detection function without adjustments giving error #79

Closed
LHMarshall opened this issue Mar 21, 2023 · 11 comments
Closed

Uniform detection function without adjustments giving error #79

LHMarshall opened this issue Mar 21, 2023 · 11 comments
Assignees
Labels
Milestone

Comments

@LHMarshall
Copy link
Member

data(book.tee.data)
region <- book.tee.data$book.tee.region
egdata <- book.tee.data$book.tee.dataframe
samples <- book.tee.data$book.tee.samples
obs <- book.tee.data$book.tee.obs

# fit a half-normal detection function
result <- ddf(dsmodel=~mcds(key="unif", formula=~1), data=egdata, method="ds",
              meta.data=list(width=4))

image

@erex
Copy link
Member

erex commented Mar 22, 2023

@LHMarshall see related issue
DistanceDevelopment/Distance#151

about knock-on consequences of uniform detection function and Hessian for dht2

@erex
Copy link
Member

erex commented May 12, 2023

@erex
Copy link
Member

erex commented May 30, 2023

@LHMarshall
Copy link
Member Author

LHMarshall commented Jun 7, 2023

@lenthomas how should the expected cluster size be calculated in the case of a uniform detection function with no adjustments?

I need an alternative for this situation than the current code based on all sorts that are NULL in this situation. I guess there is no variability from the N_hat and Nc_hat values so is it just the variability observed in the E[cluster size] values? The standard deviation of the cluster sizes observed in each stratum divided by n-1?

    # This computes the se(E(s)). It essentially uses 3.37 from Ads but in
    # place of using 3.25, 3.34 and 3.38, it uses 3.27, 3.35 and an equivalent
    # cov replacement term for 3.38. This uses line to line variability
    # whereas the other formula measure the variance of E(s) within the lines
    # and it goes to zero as p approaches 1.
    if(se & options$varflag!=1){
      numRegions <- length(unique(samples$Region.Label))
      if(options$varflag==2){
        cov.Nc.Ncs <- rep(0, numRegions)
        scale <- clusters$summary$Area/clusters$summary$CoveredArea

        for(i in 1:numRegions){
          c.stratum.data <- clusters$Nhat.by.sample[
              as.character(clusters$Nhat.by.sample$Region.Label) ==
              as.character(region.table$Region.Label[i]), ]

          i.stratum.data <- individuals$Nhat.by.sample[
              as.character(individuals$Nhat.by.sample$Region.Label) ==
              as.character(region.table$Region.Label[i]), ]

          Li <- sum(c.stratum.data$Effort.x)
          cov.Nc.Ncs[i] <- covn(c.stratum.data$Effort.x/(scale[i]*Li),
                                c.stratum.data$Nhat/scale[i],
                                i.stratum.data$Nhat/scale[i],
                                options$ervar)
        }
      }else{
        cov.Nc.Ncs <- as.vector(by(obs$size*(1 - obs$pdot)/obs$pdot^2,
                                   obs$Region.Label, sum))
        cov.Nc.Ncs[is.na(cov.Nc.Ncs)] <- 0
      }

      cov.Nc.Ncs[is.nan(cov.Nc.Ncs)] <- 0
      if(numRegions > 1){
        cov.Nc.Ncs <- c(cov.Nc.Ncs, sum(cov.Nc.Ncs))
      }
      cov.Nc.Ncs <- cov.Nc.Ncs +
                     diag(t(clusters$vc$detection$partial)%*%
                          solvecov(model$hessian)$inv%*%
                          individuals$vc$detection$partial)
      se.Expected.S <- as.vector(clusters$N$cv)^2 +
                        as.vector(individuals$N$cv)^2 -
                        2*cov.Nc.Ncs/
                         (as.vector(individuals$N$Estimate)*
                          as.vector(clusters$N$Estimate))
      Expected.S[is.nan(Expected.S)] <- 0
      se.Expected.S[se.Expected.S<=0 | is.nan(se.Expected.S)] <- 0
      se.Expected.S <- as.vector(Expected.S)*sqrt(se.Expected.S)

      Expected.S <- data.frame(Region        = clusters$N$Label,
                               Expected.S    = as.vector(Expected.S),
                               se.Expected.S = as.vector(se.Expected.S))
    }else{
      Expected.S[is.nan(Expected.S)] <- 0
      Expected.S <- data.frame(Region     = clusters$N$Label,
                               Expected.S = as.vector(Expected.S))
    }

@LHMarshall
Copy link
Member Author

It no longer crashes... but it omits the se of the expected cluster size. I notice that there is still variability in the abundance estimates due to the encounter rate variability which will need to be incorporated into the se of the expected cluster size.

>   dfm <- ds( dat , 
+              transect = "line",
+              truncation = 400,
+              key = "unif",
+              optimizer = "R",
+              dht_se = TRUE)
Starting AIC adjustment term selection.
Fitting uniform key function
AIC= 18142.155
Fitting uniform key function with cosine(1) adjustments
AIC= 18143.256

Uniform key function selected.
> summary(dfm)

Summary for distance analysis 
Number of observations :  1514 
Distance range         :  0  -  400 

Model       : Uniform key function 
AIC         :  18142.15 
Optimisation:  mrds (nlminb) 

Detection function parameters
Scale coefficient(s):  
NULL

                    Estimate
Average p                  1
N in covered region     1514

Summary for clusters

Summary statistics:
  Region      Area CoveredArea   Effort    n   k           ER        se.ER      cv.ER
1    20A  6536.115 12721112000 15901390  390 305 2.452616e-05 1.075680e-06 0.04385848
2    35A  3486.357  7406071200  9257589  321 134 3.467425e-05 5.024032e-06 0.14489228
3    35B  3433.566 10967019200 13708774  307 194 2.239442e-05 2.196862e-06 0.09809864
4    53A  3508.205  6576117600  8220147  209 171 2.542534e-05 1.146461e-06 0.04509128
5    53B  4125.793  7253688800  9067111  287 164 3.165286e-05 2.784125e-06 0.08795807
6  Total 21090.037 44924008800 56155011 1514 968 2.740042e-05 1.123309e-06 0.04099607

Abundance:
  Label     Estimate           se         cv          lcl          ucl       df
1   20A 2.003822e-04 8.788461e-06 0.04385848 1.838212e-04 0.0002184353 304.0000
2   35A 1.511085e-04 2.189446e-05 0.14489228 1.136239e-04 0.0002009593 133.0000
3   35B 9.611589e-05 9.428838e-06 0.09809864 7.924424e-05 0.0001165796 193.0000
4   53A 1.114966e-04 5.027526e-06 0.04509128 1.020057e-04 0.0001218706 170.0000
5   53B 1.632414e-04 1.435840e-05 0.08795807 1.372605e-04 0.0001941401 163.0000
6 Total 7.223447e-04 2.961329e-05 0.04099607 6.664233e-04 0.0007829587 374.6173

Density:
  Label     Estimate           se         cv          lcl          ucl       df
1   20A 3.065770e-08 1.344600e-09 0.04385848 2.812392e-08 3.341975e-08 304.0000
2   35A 4.334282e-08 6.280040e-09 0.14489228 3.259103e-08 5.764163e-08 133.0000
3   35B 2.799302e-08 2.746077e-09 0.09809864 2.307928e-08 3.395293e-08 193.0000
4   53A 3.178167e-08 1.433076e-09 0.04509128 2.907632e-08 3.473874e-08 170.0000
5   53B 3.956608e-08 3.480156e-09 0.08795807 3.326888e-08 4.705522e-08 163.0000
6 Total 3.425052e-08 1.404137e-09 0.04099607 3.159896e-08 3.712458e-08 374.6173

Summary for individuals

Summary statistics:
  Region      Area CoveredArea   Effort    n   k           ER        se.ER     cv.ER mean.size    se.mean
1    20A  6536.115 12721112000 15901390  264 305 1.660232e-05 3.116001e-06 0.1876847 0.6769231 0.07176285
2    35A  3486.357  7406071200  9257589  647 134 6.988861e-05 1.730274e-05 0.2475760 2.0155763 0.14483992
3    35B  3433.566 10967019200 13708774  483 194 3.523291e-05 9.024984e-06 0.2561521 1.5732899 0.15410961
4    53A  3508.205  6576117600  8220147  238 171 2.895325e-05 5.958753e-06 0.2058060 1.1387560 0.14945122
5    53B  4125.793  7253688800  9067111  511 164 5.635753e-05 1.051900e-05 0.1866476 1.7804878 0.15575080
6  Total 21090.037 44924008800 56155011 2143 968 3.827586e-05 4.060756e-06 0.1060918 1.4154557 0.06098587

Abundance:
  Label     Estimate           se        cv          lcl          ucl       df
1   20A 0.0001356434 2.545818e-05 0.1876847 9.405647e-05 0.0001956178 304.0000
2   35A 0.0003045708 7.540442e-05 0.2475760 1.880057e-04 0.0004934072 133.0000
3   35B 0.0001512182 3.873485e-05 0.2561521 9.197454e-05 0.0002486224 193.0000
4   53A 0.0001269674 2.613066e-05 0.2058060 8.493381e-05 0.0001898035 170.0000
5   53B 0.0002906494 5.424901e-05 0.1866476 2.016855e-04 0.0004188555 163.0000
6 Total 0.0010090492 1.070519e-04 0.1060918 8.195958e-04 0.0012422956 420.9479

Density:
  Label     Estimate           se        cv          lcl          ucl       df
1   20A 2.075290e-08 3.895002e-09 0.1876847 1.439027e-08 2.992876e-08 304.0000
2   35A 8.736076e-08 2.162843e-08 0.2475760 5.392612e-08 1.415251e-07 133.0000
3   35B 4.404114e-08 1.128123e-08 0.2561521 2.678688e-08 7.240938e-08 193.0000
4   53A 3.619157e-08 7.448441e-09 0.2058060 2.421005e-08 5.410273e-08 170.0000
5   53B 7.044692e-08 1.314875e-08 0.1866476 4.888405e-08 1.015212e-07 163.0000
6 Total 4.784483e-08 5.075945e-09 0.1060918 3.886175e-08 5.890438e-08 420.9479

Expected cluster size
  Region Expected.S
1    20A  0.6769231
2    35A  2.0155763
3    35B  1.5732899
4    53A  1.1387560
5    53B  1.7804878
6  Total  1.3969080

LHMarshall added a commit that referenced this issue Jun 8, 2023
…late the se of expected cluster size

Reference #79
@lenthomas
Copy link
Member

Hi @LHMarshall I'm afraid I'm not familiar with the code here at all, but I do think I know what should happen from the statistics perspective. When using a uniform+no adj then the expected cluster size is the mean cluster size, so the standard error of mean cluster size is the standard deviation of cluster size divided by sqrt (n). (Note, not n-1.) This formula can be used at both the stratum level and, presumably, at the global level. I say "presumably" because I assume that the "Total" expected cluster size is simply the mean cluster size taken over all observed individuals. It may not be that simple however -- for example it might be an area-weighted mean if the samples in each stratum are taken to be representative of the groups in that stratum and the strata differ in area. So it's important to know how the "Total" E(S) is arrived at in order to know exactly how to calculate the "Total" se(E(S)). Hope that makes sense.

@LHMarshall
Copy link
Member Author

@lenthomas Ok I've looked into how the E(S) values are calculated. The total E(S) value looks to be a weighted mean based on the estimated abundance / density which makes sense. What does that mean for calculating the se of the total E(S) as there will be variability associated with the values which the weighted mean is based on due to encounter rate?

See exploratory code below

> fit <- ds(dist.dat,
+           key = "unif",
+           nadj = 0,
+           truncation = 0.5)
Fitting uniform key function
AIC= -346.574
> 
> fit.sum <- summary(fit)
> 
> fit.sum$dht$individuals$N$Estimate
[1] 1024.239  285.000 2100.000 3409.239
> fit.sum$dht$clusters$N$Estimate
[1]  84.13043  84.64286  85.00000 253.77329
> # The Expected.S values are calculated as
> fit.sum$dht$individuals$N$Estimate/fit.sum$dht$clusters$N$Estimate
[1] 12.174419  3.367089 24.705882 13.434192
> 
> fit.sum$dht$Expected.S$Expected.S
[1] 12.174419  3.367089 24.705882 13.434192
> 
> tmp <- na.omit(dist.dat2)
> 
> # This is simply the mean of the observed clusters in each stratum for the stratum estimates
> mean.size.by.strat <- by(tmp$size, tmp$Region.Label, mean)
> mean.size.by.strat
tmp$Region.Label: central
[1] 12.17442
------------------------------------------------------- 
tmp$Region.Label: NE
[1] 3.367089
------------------------------------------------------- 
tmp$Region.Label: SW
[1] 24.70588
> 
> # But it is not quite just the overall mean for the global estimate
> mean(tmp$size)
[1] 13.652
> # But it's definately not a weighted mean in terms of area...
> # must be a weighted mean in terms of estimated density/abundance?
> sum(mean.size.by.strat)/3
[1] 13.4158
> area.props <- region@area/sum(region@area)
> sum(mean.size.by.strat*area.props)
[1] 7.407668
> 
> # Yup this seems to work but the estimated density / abundance has some variability associated with it...
> Nhat.props <- fit.sum$dht$clusters$N$Estimate[1:3]/sum(fit.sum$dht$clusters$N$Estimate[1:3])
> sum(mean.size.by.strat*Nhat.props)
[1] 13.43419

Simulated data used different areas and different densities in each stratum - also different mean cluster sizes

image

@lenthomas
Copy link
Member

Hmm... that means E(s) is estimated likely using hat(N) / hat(N)_s, i.e., ratio of estimated total population size to estimated cluster size (see formula 3.36 in Marques and Buckland 2004 - chapter 3 of the Advanced Distance Sampling book (although my notation is different)). Equation 3.37 has the variance calculation and 3.38 has the relevant covariate. Does that help?

@LHMarshall
Copy link
Member Author

LHMarshall commented Jun 19, 2023

@lenthomas and therein lies the problem... with the uniform model and no adjustments we now have no hessian matrix as required in 3.38. Is the covariance just 0 then if that is entirely associated with the detection function estimation? I assumed you meant covariance rather than covariates above? With the uniform and no adjustments you cannot have covariates right?

@lenthomas
Copy link
Member

That's right no covariates with uniform.

@lenthomas lenthomas self-assigned this Jun 20, 2023
@lenthomas lenthomas added the bug label Jun 20, 2023
@lenthomas lenthomas added this to the mrds 2.3.0 milestone Jun 20, 2023
LHMarshall added a commit that referenced this issue Jun 26, 2023
@LHMarshall
Copy link
Member Author

@lenthomas ok so I just made the covariance 0 in this case (commit should be reference above).... my output matches my initial understanding of formula 3.37 (but see below as I realised that the values I can get at easily are not the ones in the formula in the book) and also these estimates bear no resemblance to the group sd(size)/sqrt(n)! See image below...

So to check those values (and they match) I calculated
E[s]^2 * {var_hat(N_hat)/N_hat^2 + var_hat(Ns_hat)/Ns_hat^2)
where N_hat is the total estimated abundance of individuals and Ns is the total estimated abundance of clusters.
In the book it uses the estimates of individual and cluster abundance for the covered area not the whole region.

I couldn't easily get at the relevant values for cluster and individual estimates for the covered areas by stratum which are the values that 3.37 is based on. But I think the above is ok because of the rules for multiplication by a constant which is all that has happened to these values right?

image
> fit.sum

Summary for distance analysis 
Number of observations :  248 
Distance range         :  0  -  0.5 

Model       : Uniform key function 
AIC         :  -343.801 
Optimisation:  mrds (nlminb) 

Detection function parameters
Scale coefficient(s):  
NULL

                    Estimate
Average p                  1
N in covered region      248

Summary for clusters

Summary statistics:
   Region  Area CoveredArea Effort   n  k        ER      se.ER
1 central 337.5         330    330  82 22 0.2484848 0.03012937
2      NE  37.5          40     40  82  8 2.0500000 0.36005952
3      SW 150.0         150    150  84 15 0.5600000 0.04760952
4   Total 525.0         520    520 248 45 0.4661688 0.03495183
       cv.ER
1 0.12125235
2 0.17563879
3 0.08501701
4 0.07497677

Abundance:
    Label  Estimate        se         cv       lcl      ucl       df
1 central  83.86364 10.168663 0.12125235  65.23222 107.8165 21.00000
2      NE  76.87500 13.502232 0.17563879  50.90757 116.0882  7.00000
3      SW  84.00000  7.141428 0.08501701  70.02145 100.7691 14.00000
4   Total 244.73864 18.349713 0.07497677 209.43403 285.9946 20.82922

Density:
    Label  Estimate         se         cv       lcl       ucl       df
1 central 0.2484848 0.03012937 0.12125235 0.1932807 0.3194563 21.00000
2      NE 2.0500000 0.36005952 0.17563879 1.3575351 3.0956841  7.00000
3      SW 0.5600000 0.04760952 0.08501701 0.4668097 0.6717941 14.00000
4   Total 0.4661688 0.03495183 0.07497677 0.3989220 0.5447516 20.82922

Summary for individuals

Summary statistics:
   Region  Area CoveredArea Effort    n  k        ER     se.ER
1 central 337.5         330    330  982 22  2.975758 0.3960008
2      NE  37.5          40     40  268  8  6.700000 1.1038892
3      SW 150.0         150    150 2102 15 14.013333 1.1491971
4   Total 525.0         520    520 3352 45  6.395368 0.4228860
      cv.ER mean.size   se.mean
1 0.1330756 11.975610 0.4032298
2 0.1647596  3.268293 0.1691760
3 0.0820074 25.023810 0.5432991
4 0.0661238 13.516129 0.6161852

Abundance:
    Label Estimate        se        cv       lcl       ucl       df
1 central 1004.318 133.65028 0.1330756  762.4461 1322.9197 21.00000
2      NE  251.250  41.39585 0.1647596  170.6236  369.9757  7.00000
3      SW 2102.000 172.37956 0.0820074 1763.4913 2505.4867 14.00000
4   Total 3357.568 222.01515 0.0661238 2934.3388 3841.8413 30.87844

Density:
    Label  Estimate        se        cv       lcl       ucl       df
1 central  2.975758 0.3960008 0.1330756  2.259100  3.919762 21.00000
2      NE  6.700000 1.1038892 0.1647596  4.549962  9.866017  7.00000
3      SW 14.013333 1.1491971 0.0820074 11.756609 16.703245 14.00000
4   Total  6.395368 0.4228860 0.0661238  5.589217  7.317793 30.87844

Expected cluster size
   Region Expected.S se.Expected.S cv.Expected.S
1 central  11.975610     2.1559842    0.18003127
2      NE   3.268293     0.7870732    0.24082090
3      SW  25.023810     2.9558941    0.11812326
4   Total  13.718995     1.3714792    0.09996936

LHMarshall added a commit that referenced this issue Jul 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants