Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider lessening # of CSV files #32

Open
e-lo opened this issue Dec 13, 2018 · 1 comment
Open

Consider lessening # of CSV files #32

e-lo opened this issue Dec 13, 2018 · 1 comment
Assignees
Labels
flow improvement schema Something with the schema itself

Comments

@e-lo
Copy link
Owner

e-lo commented Dec 13, 2018

From a tester:

We do not have “Point of interest” (POI) and “Project” separated in different tables. Also, our “Forecast” and “Scenario” tables belong to one table called “TrafficForecast”. This might be a very small thing, but I am curious to know the reasons why they were separated (especially the POI one). One of the two reasons I can think of is because it will make csv file bulky.

However, by having separate tables, the processing time for analysis (joining multiple tables) will increase. Also, since the tables are in relationship with each other, more QC queries will be required to see if they are properly linked. Access DB does not allow you to add any record which is not well connected to other tables if there is a table-relationship. However, with csv file, I think the user will only know if the mapping is done correctly after uploading all the data and running the QC script in Python. Therefore, it would be better to have less unique identifier (IDs) fields in the entire DB.

Another reason I could think of for having separate POI table is that they do not want to duplicate the POI or facility name. However, I do feel the chances of same segment going through construction more than 2 times without change in Area Type (AT) and functional class (FC) is rare.

For Scenario table, I think they are expecting to get multiple run results (which can be true in case of transit). So, this might be alright to have two different tables. However, as per the data we obtained from the state agencies, we always got the final run results (not the intermediate ones). The cases where we had multiple forecast, it was due to human error. The one scenario where we could have got results from different runs, was when we were doing deep dives. But I doubt if the states maintain and update that level of data.

@e-lo
Copy link
Owner Author

e-lo commented Dec 13, 2018

One of the many issues of creating a DB w/out a DB :-)

To respond to a few things:

  1. Multiple forecasts/Scenario As a forecaster I routinely had many (dozens?) of forecasts for a specific scenario; as models and methodology were updated we were asked to update forecasts. So I think it is important to have capability to have multiple "model runs" per scenario. This will be increasingly important as we move towards systems were we examine results from many different models and outcomes, which is happening in some places now and more now that TMIP has been endorsing it.

One thing that is difficult is having multiple public forecasts for a single scenario/project...thus perhaps only keeping one. Having the ability to have them be private will likely increase chance of people studying which forecasts were good and why and having lots of forecasts documented using various methodologies and assumptions for the same scenario and project will help us get better over time.

  1. The multiple file approach is basically to support "letting a file lie" once it is complete with no need to further edit it in addition to making them legible horizontally in a browser.

  2. The merge time for the files is negligible in pandas, which has a very efficient merge algorithm.

@e-lo e-lo added the schema Something with the schema itself label Jan 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flow improvement schema Something with the schema itself
Projects
None yet
Development

No branches or pull requests

2 participants