Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug when using Pred_TF and covariates #304

Open
Cole-Monnahan-NOAA opened this issue May 5, 2021 · 2 comments
Open

Potential bug when using Pred_TF and covariates #304

Cole-Monnahan-NOAA opened this issue May 5, 2021 · 2 comments

Comments

@Cole-Monnahan-NOAA
Copy link
Collaborator

@zoyafuso-NOAA discovered some strange behavior when trying to do a simulation, he and @Lewis-Barnett-NOAA and I have been digging into it a bit.

We're convinced there's a bug. Below is a reproducible example. Their goal is to take a set of data and add on a set of points at which to make predictions with the simulation feature. But these aren't real data hence the use of Pred_TF=1 for these points.

Let model 'orig' be fit just to the real data, and model 'pred' fit to the same data with the predictive points appended to the data and covariate_data. We expect the model fits to be identical.

However they are not in the case that:

  • The predictive points have different lat/lon positions
  • The covariate info for the predictive points is included
  • A density covariate is included

In this case the JNLL and MLEs are identical, but the index is different. I was able to track the difference down as far as eta2_gtp which is calculated like this

eta2_gct(g,c,t) += (gamma2_cp(c,p) + Xi2_gcp(g,c,p)) * X2_gctp(g,c,t,p);

Since gamma is identical (it's a parameter), and Xi2_gcp are the same, it must be a difference in the X2_gctp variables. I don't think those are saved in the Report. I suspect this is related to how covariates are extrapolated and such but don't really understand that part well.

I see two related issues here. (1) Why does adding out-of-sample data change the index? (2) Why is it even possible to use a covariate_data input that has fewer rows than the main inputs (n_i)? And why does that resolve this index difference? What values does it use for the predictive points? @James-Thorson-NOAA Do you have any ideas why this might occur? You can run the code below to reproduce.

To get models orig and pred to match, either drop the covariate or use covdat_orig for the covariate_data for the pred model. Or drop the jittering of longitude. All three actions result in identical models.

@Cole-Monnahan-NOAA
Copy link
Collaborator Author

Oops I forgot to attach the code. Here it is:

## I installed the development branches of these on 2021-05-04
packageVersion('FishStatsUtils') # 2.10.0
packageVersion('VAST')           # 3.8.0


library(VAST)
example <- load_example( data_set="covariate_example" )
dat_orig <- subset(as.data.frame(example$sampling_data), Year==2001)
covdat_orig <- subset(example$covariate_data, Year==2001)
covdat_orig$LOG_DEPTH <- log(covdat_orig$BOT_DEPTH)
covdat_orig$Year <- NA
covdat_orig <- covdat_orig[,c('Lon', 'Lat', 'LOG_DEPTH', 'Year')]

## Now add some arbitrary prediction points onto it, taking 200
## known points and jittering them slightly
dat_pred <- rbind(dat_orig, dat_orig[1:200,])
covdat_pred <- rbind(covdat_orig, covdat_orig[1:200,])
pred_TF <- rep(1, nrow(dat_pred)); pred_TF[1:nrow(dat_orig)] <- 0
## jitter the positions
set.seed(123423)
dat_pred$Lon <- dat_pred$Lon+rnorm(nrow(dat_pred),0,.1)*pred_TF
covdat_pred$Lon <- dat_pred$Lon

settings <- FishStatsUtils::make_settings(
  Version="VAST_v13_0_0", n_x=50,
  Region=example$Region, purpose="index2",
  fine_scale=FALSE, bias.correct=FALSE,
  ObsModel=c(2, 1), max_cells=Inf)
settings$FieldConfig[1:2,] <- 0

## Fit the model twice. Theoretically these should be identical
## b/c the new predicted points have no influence on the
## likelihood.
fit_orig <- fit_model(settings=settings,
                      getsd=FALSE, newtonsteps=1,
                      working_dir=paste0(getwd(), "/VAST_fit1/"),
                      Lat_i=dat_orig$Lat, Lon_i=dat_orig$Lon,
                      t_i=dat_orig$Year, b_i=dat_orig$Catch_KG,
                      a_i=dat_orig$AreaSwept_km2,
                      X2_formula= ~ LOG_DEPTH,
                      covariate_data=covdat_orig)

fit_pred <- fit_model(settings=settings,
                      getsd=FALSE, newtonsteps=1,
                      working_dir=paste0(getwd(), "/VAST_fit2/"),
                      Lat_i=dat_pred$Lat, Lon_i=dat_pred$Lon,
                      t_i=dat_pred$Year, b_i=dat_pred$Catch_KG,
                      a_i=dat_pred$AreaSwept_km2,
                      X2_formula= ~ LOG_DEPTH,
                      covariate_data=covdat_pred,
                      PredTF_i=pred_TF)

## Compare fits and indices
fit_orig$Report$jnll- fit_pred$Report$jnll
fit_orig$parameter_estimates$par- fit_pred$parameter_estimates$par
fit_orig$Report$Index_ctl[1]-fit_pred$Report$Index_ctl[1]

orig <- fit_orig$Report
pred <- fit_pred$Report
all.equal(orig,pred)
## Xi's are the same
all.equal(orig$Xi2_gcp,pred$Xi2_gcp)

## But the eta2s are different so it must be X2
plot(orig$eta2_gct, pred$eta2_gct)

## plot_results(fit_orig, plot_set=12, working_dir=paste0(getwd(),'/VAST_fit1/'))
## plot_results(fit_pred, plot_set=12, working_dir=paste0(getwd(), '/VAST_fit2/'))

@James-Thorson-NOAA
Copy link
Owner

Thanks for reaching out! My suspicion as you say is that these two runs are doing a different job "behind the scenes" in interpolating the covariates provided by covariate_data. You can see ?make_covariates to see a bit more.

I hope to organize a meeting of us four so we can discuss what you think the intended default behavior should be. This doesn't appear to be a bug in the sense of having a typo in the code, but I guess the point is that VAST isn't doing what you're expecting, so there's some issue of default behavior and/or documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants