Add schema for Airline Reporting Carrier On-Time Performance Dataset #2

djalova · 2020-12-08T00:52:50Z

No description provided.

xuhdev · 2020-12-15T01:55:25Z

Could you resolve the conflicts? Seems like this is now unblocked

edwardleardi · 2020-12-17T18:55:27Z

This dataset is 81GB. Should we hold off on adding it until we add code that properly handles datasets that can't be fully loaded into memory?

We probably want to implement download resuming and maybe use some dependency like Dask for loading in large Pandas dataframes.

xuhdev · 2020-12-17T19:04:30Z

@edwardleardi Yes, I agree. How about we make this PR independent from the monolithic issue?

djalova · 2020-12-17T19:13:30Z

The problem I had with this one was I had trouble running the tests locally. I can try running them again on a machine with more storage.

edwardleardi · 2020-12-17T19:19:21Z

@djalova yeah the tests in this repo actually download and load every dataset part of the dataset schema. I don't think we'll ever get this to pass without actually handling for large datasets like this. The way things stand currently we would need a machine with 81GB memory to load in the dataset when running the test. Plus it would need to download the whole dataset, so I don't think we would want to be doing that for every test.

edwardleardi · 2020-12-17T19:23:18Z

@xuhdev Yeah we should definitely make it independent.

How do you think we should approach implementing features related to loading large datasets in the future? It would probably be on a loader dependent basis right? Since loading large datasets depends on the dataset type and the Python object you want to load into right?

What do you think about creating a new epic for loaders? We can add issues for adding an initial loader implementation like how we did with CSVPandasLoader and then separate issues for getting that loader to handle large datasets.

bdwyer2 · 2020-12-17T19:26:21Z

The way things stand currently we would need a machine with 81GB memory to load in the dataset when running the test.

It seems like the need for more beefy CI/CD infrastructure keeps coming up. Let's make a note to discuss this when we get back from the holidays next year.

xuhdev · 2020-12-17T19:54:06Z

@edwardleardi I don't see this issue as very urgent -- most people use the dataset would have a large RAM in place, otherwise the dataset may not be very useful depending on the use case (in case if even a subdataset can't be loaded). If we truly want some hard disk exchange stuff, we may add a CSVSqliteLoader, which loads the csv file to a sqlite database and allows the user to manipulate the database thenceforth. This would introduce sqlite as an additional dependency, and it's perhaps best done as a separate package.

About the epic, sure, after first release, we definitely would split different kinds of tasks to different epics if that would help management easier. Currently we only do pre-release and post-release distinction, perhaps because we don't focus on what would happen after the first release yet.

xuhdev · 2020-12-17T20:07:07Z

Large dataset support in general would be interesting, and should the demand arise we should do something to keep up with loading a large dataset, it's just unclear to me yet that what we should do about it. Perhaps we can open an issue and revisit later?

edwardleardi · 2020-12-17T21:13:27Z

Currently we only do pre-release and post-release distinction, perhaps because we don't focus on what would happen after the first release yet.

Got it, good idea

Perhaps we can open an issue and revisit later?

Opened here: https://app.zenhub.com/workspaces/pydax-5fcfdd73254483001e3f3b55/issues/codait/pydax/100

Add airline dataset

2f4d614

djalova force-pushed the airline branch from 3a42f1e to 2f4d614 Compare December 8, 2020 23:08

xuhdev added the dataset Add or update one or a few specific datasets label Dec 17, 2020

xuhdev assigned djalova Dec 17, 2020

edwardleardi mentioned this pull request Dec 17, 2020

Determine how to handle large datasets in the future CODAIT/pardata#100

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add schema for Airline Reporting Carrier On-Time Performance Dataset #2

Add schema for Airline Reporting Carrier On-Time Performance Dataset #2

djalova commented Dec 8, 2020

xuhdev commented Dec 15, 2020

edwardleardi commented Dec 17, 2020

xuhdev commented Dec 17, 2020

djalova commented Dec 17, 2020

edwardleardi commented Dec 17, 2020 •

edited

edwardleardi commented Dec 17, 2020

bdwyer2 commented Dec 17, 2020

xuhdev commented Dec 17, 2020 •

edited

xuhdev commented Dec 17, 2020

edwardleardi commented Dec 17, 2020

Add schema for Airline Reporting Carrier On-Time Performance Dataset #2

Are you sure you want to change the base?

Add schema for Airline Reporting Carrier On-Time Performance Dataset #2

Conversation

djalova commented Dec 8, 2020

xuhdev commented Dec 15, 2020

edwardleardi commented Dec 17, 2020

xuhdev commented Dec 17, 2020

djalova commented Dec 17, 2020

edwardleardi commented Dec 17, 2020 • edited

edwardleardi commented Dec 17, 2020

bdwyer2 commented Dec 17, 2020

xuhdev commented Dec 17, 2020 • edited

xuhdev commented Dec 17, 2020

edwardleardi commented Dec 17, 2020

edwardleardi commented Dec 17, 2020 •

edited

xuhdev commented Dec 17, 2020 •

edited