Suggested changes to the queryset documentation in `viewsforecasting` #16

angelicalmcgowan · 2022-10-07T13:23:05Z

I've drafted a suggestion for a new folder and file name structure in the Documentation folder of the viewsforecasting repo. This was done to make it easier for external users to navigate the content of this repo and more easily understand what they're looking at (and prevent us from having to explain this in writing repeatedly). As requested, this has been pushed to the documentation branch of viewsforecasting.

- UPDATE: Structure has been approved by HH; Chandler will implement it.

If possible, I'd also like to implement the following table structure to the queryset documentation markdown files in the Documentation folder of viewsforecasting (the tables showing the variables that go into each queryset in the fatalities002 sub-models), in order for externals to consult these files without further instruction from us:

Queryset	Source variable (in database)	Table of source variable (in database)	Transformations applied to source variable (in queryset)	Processed variable (in queryset)
fatalities002_imfweo	ged_sb_best_sum_nokgi	ged2_cm	"missing.fill", "ops.ln"	ln_ged_sb_dep

- UPDATE: Structure has been approved by HH, but has yet to be implemented.

We previously discussed adding "Date of access" to the queryset documentation, in order to adhere to the new ACLED attribution policy. This would need to show when the source/raw variable data was fetched and ingested from the data provider(s). How do we best do this in practise? Or should it perhaps be added to the codebooks, to be updated upon each data ingestion?

The text was updated successfully, but these errors were encountered:

angelicalmcgowan · 2022-11-02T17:17:57Z

@SofiaNordenving @chandlervincentwilliams Adding you for info now and follow-up on these points when I'm away, if they're not implemented before then.

angelicalmcgowan · 2022-11-14T20:37:06Z

@chandlervincentwilliams I forgot to mention (2) above the other day – could you please implement this as well?

@hhegre @chandlervincentwilliams @SofiaNordenving - any thoughts on how to implement (3) above? The core idea is to have a record of when we ingested various datasets to meet attribution policies from our data providers.

chandlervincentwilliams · 2022-11-15T08:31:25Z

Some comments on the above points:

I was able to add these new folders in the repo and just wrote some code that should populate the ensembles and surrogate models into these folders on Github. Working on the Queryset model documentation now.
When I start writing the documentation codes in Python, I can see if I am able to add this information-- should be able to.
I think this may be best added to the ingestion script(?) Perhaps some code that populates a column with the date of ingestion?

angelicalmcgowan · 2022-11-17T10:36:00Z

(1, 2) Excellent, thanks! (3) Sounds very reasonable. Would it then appear in the same db table as the concerned raw/source variable? If so, perhaps we can make it standard practise to include this column in our queryset definitions (especially for data that will be fed into models and/or input data to be shared in the API)? That way, it always gets included when we prepare and share data, making it much easier to adhere to user terms. Also, is it possible for us to specify date of access per data entry/row to ensure full transparency, given that we most often don't update entire datasets when we ingest new data, but only the latest subsets thereof? This would be particularly important for GED data, since UCDP updates their Candidate data records when new information is made available, but we don't update the previously ingested data until we load the next GED data. This was a huge issue for our Ethiopia forecasts when the war broke out; once the media blackout eased up and the UCDP updated their early records from the war, "our" data differed from theirs by hundreds of fatalities per month, resulting in really low forecasts even though better data had become available. Transparency on this would be incredibly helpful in explaining our forecasts.

SofiaNordenving · 2022-11-17T12:58:02Z

I don't know exactly where this belongs but I think it could be related to (3). This thought came out of a discussion with Malika this morning, which I think could be summarised like this. When we query for model creation we need to implement a time lag for data that we are missing, so for GED it is 1 month. The problem for a lot of our data now is that it is not updated even though there is new data, so without checking what's actually in there we don't know what the lag should be. I don't think that it is realistic that we will be able to keep all the data up to date from the providers (Many of them don't have a consistent schedule for updates). What would be great is if this documentation somehow did include the timing of last ingested data, so not only when it was ingested but up to what month there is data for that variable.

hhegre · 2022-11-23T09:30:18Z

Excellent idea. Some of this information is at the dataset level, not the predictors (e.g. access date, last month with updated data). Somewhere in our GitHub system there should be a list of each "dataset" (e.g. ACLED, WDI, VDEM, ...) with this meta-information, and some link to the individual features we extract and ingest from them.

angelicalmcgowan assigned hhegre and angelicalmcgowan Oct 7, 2022

angelicalmcgowan assigned SofiaNordenving and chandlervincentwilliams Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggested changes to the queryset documentation in `viewsforecasting` #16

Suggested changes to the queryset documentation in `viewsforecasting` #16

angelicalmcgowan commented Oct 7, 2022 •

edited

angelicalmcgowan commented Nov 2, 2022

angelicalmcgowan commented Nov 14, 2022

chandlervincentwilliams commented Nov 15, 2022

angelicalmcgowan commented Nov 17, 2022 via email •

edited

SofiaNordenving commented Nov 17, 2022

hhegre commented Nov 23, 2022

Suggested changes to the queryset documentation in viewsforecasting #16

Suggested changes to the queryset documentation in viewsforecasting #16

Comments

angelicalmcgowan commented Oct 7, 2022 • edited

angelicalmcgowan commented Nov 2, 2022

angelicalmcgowan commented Nov 14, 2022

chandlervincentwilliams commented Nov 15, 2022

angelicalmcgowan commented Nov 17, 2022 via email • edited

SofiaNordenving commented Nov 17, 2022

hhegre commented Nov 23, 2022

Suggested changes to the queryset documentation in `viewsforecasting` #16

Suggested changes to the queryset documentation in `viewsforecasting` #16

angelicalmcgowan commented Oct 7, 2022 •

edited

angelicalmcgowan commented Nov 17, 2022 via email •

edited