Consider lessening # of CSV files #32

e-lo · 2018-12-13T21:38:03Z

From a tester:

We do not have “Point of interest” (POI) and “Project” separated in different tables. Also, our “Forecast” and “Scenario” tables belong to one table called “TrafficForecast”. This might be a very small thing, but I am curious to know the reasons why they were separated (especially the POI one). One of the two reasons I can think of is because it will make csv file bulky.

However, by having separate tables, the processing time for analysis (joining multiple tables) will increase. Also, since the tables are in relationship with each other, more QC queries will be required to see if they are properly linked. Access DB does not allow you to add any record which is not well connected to other tables if there is a table-relationship. However, with csv file, I think the user will only know if the mapping is done correctly after uploading all the data and running the QC script in Python. Therefore, it would be better to have less unique identifier (IDs) fields in the entire DB.

Another reason I could think of for having separate POI table is that they do not want to duplicate the POI or facility name. However, I do feel the chances of same segment going through construction more than 2 times without change in Area Type (AT) and functional class (FC) is rare.

For Scenario table, I think they are expecting to get multiple run results (which can be true in case of transit). So, this might be alright to have two different tables. However, as per the data we obtained from the state agencies, we always got the final run results (not the intermediate ones). The cases where we had multiple forecast, it was due to human error. The one scenario where we could have got results from different runs, was when we were doing deep dives. But I doubt if the states maintain and update that level of data.

e-lo · 2018-12-13T21:45:31Z

One of the many issues of creating a DB w/out a DB :-)

To respond to a few things:

Multiple forecasts/Scenario As a forecaster I routinely had many (dozens?) of forecasts for a specific scenario; as models and methodology were updated we were asked to update forecasts. So I think it is important to have capability to have multiple "model runs" per scenario. This will be increasingly important as we move towards systems were we examine results from many different models and outcomes, which is happening in some places now and more now that TMIP has been endorsing it.

One thing that is difficult is having multiple public forecasts for a single scenario/project...thus perhaps only keeping one. Having the ability to have them be private will likely increase chance of people studying which forecasts were good and why and having lots of forecasts documented using various methodologies and assumptions for the same scenario and project will help us get better over time.

The multiple file approach is basically to support "letting a file lie" once it is complete with no need to further edit it in addition to making them legible horizontally in a browser.
The merge time for the files is negligible in pandas, which has a very efficient merge algorithm.

e-lo added the flow improvement label Dec 13, 2018

e-lo assigned e-lo and gregerhardt Dec 13, 2018

e-lo added the schema Something with the schema itself label Jan 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider lessening # of CSV files #32

Consider lessening # of CSV files #32

e-lo commented Dec 13, 2018

e-lo commented Dec 13, 2018

Consider lessening # of CSV files #32

Consider lessening # of CSV files #32

Comments

e-lo commented Dec 13, 2018

e-lo commented Dec 13, 2018