Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggested changes to the queryset documentation in viewsforecasting #16

Open
angelicalmcgowan opened this issue Oct 7, 2022 · 6 comments
Assignees

Comments

@angelicalmcgowan
Copy link
Collaborator

angelicalmcgowan commented Oct 7, 2022

  1. I've drafted a suggestion for a new folder and file name structure in the Documentation folder of the viewsforecasting repo. This was done to make it easier for external users to navigate the content of this repo and more easily understand what they're looking at (and prevent us from having to explain this in writing repeatedly). As requested, this has been pushed to the documentation branch of viewsforecasting.

- UPDATE: Structure has been approved by HH; Chandler will implement it.

  1. If possible, I'd also like to implement the following table structure to the queryset documentation markdown files in the Documentation folder of viewsforecasting (the tables showing the variables that go into each queryset in the fatalities002 sub-models), in order for externals to consult these files without further instruction from us:
Queryset Source variable (in database) Table of source variable (in database) Transformations applied to source variable (in queryset) Processed variable (in queryset)
fatalities002_imfweo ged_sb_best_sum_nokgi ged2_cm "missing.fill", "ops.ln" ln_ged_sb_dep

- UPDATE: Structure has been approved by HH, but has yet to be implemented.

  1. We previously discussed adding "Date of access" to the queryset documentation, in order to adhere to the new ACLED attribution policy. This would need to show when the source/raw variable data was fetched and ingested from the data provider(s). How do we best do this in practise? Or should it perhaps be added to the codebooks, to be updated upon each data ingestion?
@angelicalmcgowan
Copy link
Collaborator Author

@SofiaNordenving @chandlervincentwilliams Adding you for info now and follow-up on these points when I'm away, if they're not implemented before then.

@angelicalmcgowan
Copy link
Collaborator Author

@chandlervincentwilliams I forgot to mention (2) above the other day – could you please implement this as well?

@hhegre @chandlervincentwilliams @SofiaNordenving - any thoughts on how to implement (3) above? The core idea is to have a record of when we ingested various datasets to meet attribution policies from our data providers.

@chandlervincentwilliams
Copy link
Collaborator

Some comments on the above points:

  1. I was able to add these new folders in the repo and just wrote some code that should populate the ensembles and surrogate models into these folders on Github. Working on the Queryset model documentation now.

  2. When I start writing the documentation codes in Python, I can see if I am able to add this information-- should be able to.

  3. I think this may be best added to the ingestion script(?) Perhaps some code that populates a column with the date of ingestion?

@angelicalmcgowan
Copy link
Collaborator Author

angelicalmcgowan commented Nov 17, 2022 via email

@SofiaNordenving
Copy link
Collaborator

I don't know exactly where this belongs but I think it could be related to (3). This thought came out of a discussion with Malika this morning, which I think could be summarised like this. When we query for model creation we need to implement a time lag for data that we are missing, so for GED it is 1 month. The problem for a lot of our data now is that it is not updated even though there is new data, so without checking what's actually in there we don't know what the lag should be. I don't think that it is realistic that we will be able to keep all the data up to date from the providers (Many of them don't have a consistent schedule for updates). What would be great is if this documentation somehow did include the timing of last ingested data, so not only when it was ingested but up to what month there is data for that variable.

@hhegre
Copy link
Collaborator

hhegre commented Nov 23, 2022

Excellent idea. Some of this information is at the dataset level, not the predictors (e.g. access date, last month with updated data). Somewhere in our GitHub system there should be a list of each "dataset" (e.g. ACLED, WDI, VDEM, ...) with this meta-information, and some link to the individual features we extract and ingest from them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants