Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CSVHandler #1949

Closed
amontanez24 opened this issue Apr 22, 2024 · 0 comments · Fixed by #1958
Closed

Add CSVHandler #1949

amontanez24 opened this issue Apr 22, 2024 · 0 comments · Fixed by #1958
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@amontanez24
Copy link
Contributor

amontanez24 commented Apr 22, 2024

Problem Description

As a user, I'd like an streamlined way to load my data and metadata from files so that I can get right to using SDV.

Expected behavior

  • In the sdv.io subpackage, add a folder called local
  • In that folder add a class called CSVHandler

__init__

Parameters

  • sep: The separator used, if not a comma
  • (default) ',': A comma separates the values in a row
  • encoding: The character encoding to to use
    • (default) 'UTF'
    • Other options: See Python list of standard encoding
from sdv.io.local import CSVHandler

handler = CSVHandler(sep='\t', encoding='UTF') 

read

Functionality

Internally, reading should use the read_csv function from pandas. A few things should be hardcoded by default

  • Use the sep and encoding options from init
  • Pandas should not detect an index column from the data
  • Pandas should not try to infer datetime formats (or cast them to np.datetime objects). Any datetime column should be left as a dtype 'object'
  • Pandas should not error if there is a badly formatted line. We should just raise a warning and read the remaining lines.
  • After reading the data, we should use it to infer a MultiTableMetadata object. (Even if there is only 1 CSV file, we should still create a MultiTableMetadata object.)
from sdv.io.local import CSVHandler

data, metadata = handler.read(folder_name='project/data')

Parameters

  • (required) folder_name: The name of the folder that contains the CSV files, can include the entire path to the folder
  • file_names: A list of file names inside the folder to read
    • (default) None: Read all files in the folder that end with ".csv"
    • list(str): Only files with these names will be read into Python

Returns

  • data: A dictionary mapping each table name to a pandas DataFrame with the data. The table name is the same as the file name (excluding the '.csv' suffix)
  • metadata: A MultiTableMetadata object that describes the data

write

Functionality
Internally, writing should use the to_csv function from pandas. A few things should be hardcoded by default

  • Use the sep and encoding options from init
  • Do not write the index column
from sdv.io.local import CSVHandler

handler.write(
  synthetic_data,
  folder_name='project/synthetic_data',
  file_name_suffix='_v1', 
  mode='x')
)

Parameters

  • (required) synthetic_data: A dictionary that maps each table name to a pandas.DataFrame containing data from it
  • file_name_suffix: An optional suffix to add when writing each file
    • (default) None: Do not add a suffix. The file name will be the same as the table name with a ".csv" extension
    • string: Append the suffix after the table name. Eg. a suffix "_synthetic" will write a file as "TABLENAME_synthetic.csv"
  • mode: A string signaling which mode of writing to use
    • (default) 'x': Write to new files, raising errors if any existing files exist with the same name
    • 'w': Write to new files, clearing any existing files that exist
    • 'a': Append the new CSV rows to any existing files

Additional context

  • We will add a number of local file handlers for different file types (see Add ExcelHandler #1950). Therefore the implementation of this class should also add a base class.
  • Optionally, the init, read and write functions can include a subset of arguments that the corresponding pandas functions use
    • if both the read and write for pandas are the same for a parameter (eg. decimal), then put it in the init.
    • We can ignore most of these parameters. Only add ones that seem impactful
    • If there are some that the different file types have in common, consider adding to the Base.
@amontanez24 amontanez24 added the feature request Request for a new feature label Apr 22, 2024
@amontanez24 amontanez24 added this to the 1.12.2 milestone May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants