Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ExcelHandler #1950

Closed
amontanez24 opened this issue Apr 22, 2024 · 0 comments · Fixed by #1962
Closed

Add ExcelHandler #1950

amontanez24 opened this issue Apr 22, 2024 · 0 comments · Fixed by #1962
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@amontanez24
Copy link
Contributor

Problem Description

As a user, I'd like an streamlined way to load my data and metadata from files so that I can get right to using SDV.

Expected behavior

  • In the sdv.io subpackage, add a folder called local
  • In that folder add a class called ExcelHandler

__init__

Parameters

from sdv.io.local import ExcelHandlder

handler = ExcelHandler() 

read

Functionality

Internally, reading should use the read_excel function from pandas. A few things should be hardcoded by default

  • Pandas should not detect an index column from the data
  • Pandas should not try to infer datetime formats (or cast them to np.datetime objects). Any datetime column should be left as a dtype 'object'
  • After reading the data, we should use it to infer a MultiTableMetadata object. (Even if there is only 1 table, we should still create a MultiTableMetadata object.)

Parameters

  • (required) file_path: A string describing the path of the Excel file to read
  • sheet_name: A list of strings denoting which sheets in the Excel file to read from
    • (default) None: Read all the sheets in the file
    • list(str): Read only the sheets listed

Returns

  • data: A dictionary mapping each table name to a pandas DataFrame with the data. The table name is the same as the sheet name
  • metadata: A MultiTableMetadata object that describes the data

write

Functionality
Internally, writing should use the to_excel function from pandas. A few things should be hardcoded by default

  • Do not write the index column
  • Each table of the synthetic data should be written as a new sheet within the file. The name of the sheet should be the same as the name of the table
  • If a sheet already exists with the same name, completely override it

Parameters

  • (required) synthetic_data: A dictionary that maps each table name to a pandas.DataFrame containing data from it

  • (required) file_name: The name of the excel file to write

  • sheet_name_suffix: A string with a suffix to add to each sheet name

    • (default) None: The name of the table should be the name of the sheet
    • (str) Append this string as the suffix. Eg. suffix of "_synthetic" will make sheets with "TABLENAME_synthetic"
  • mode: A string signaling which mode of writing to use

    • (default) 'w': Write sheets to a new file, clearing any existing file that may exist
    • 'a': Append new sheets within the existing file. Note: You cannot append data to existing sheets.

Additional context

  • We will add a number of local file handlers for different file types. Therefore the implementation of this class should also add a base class.
  • Optionally, the init, read and write functions can include a subset of arguments that the corresponding pandas functions use
    • if both the read and write for pandas are the same for a parameter (eg. decimal), then put it in the init.
    • We can ignore most of these parameters. Only add ones that seem impactful
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants