Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create datasets package with shortcuts to acquire datasets as DataFrames #199

Open
frreiss opened this issue Jun 4, 2021 · 0 comments
Open

Comments

@frreiss
Copy link
Member

frreiss commented Jun 4, 2021

Our notebooks and experiment scripts frequently repeat a pattern:

  • Download a reference data set (if not already present)
  • Read the data set with one of our reader functions
  • Convert everything in the data set to DataFrames

We should wrap these three steps into a single function so that we and our users don't need to write this code over and over again.

Suggested API:

  • Main entry point attp.dataset.download_<data set name>(), with optional arguments to specify:
    • cache directory
    • fold name
    • whether to return a DataFrame per document or a single stacked DataFrame
  • Each download_<name>() function performs the following steps:
    • If the raw data set isn't present, download it
    • Convert the entire raw data set into DataFrames
    • Stack the DataFrames into a single large dataframe (add a leading column with fold name) and write this DataFrame as a single Parquet file in the cache directory
    • Use the cached Parquet file for subsequent reads of the data set
    • If the user requested a DataFrame per document, split the single large DataFrame into multiple smaller ones
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant