Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make optional loading of data into pandas dataframe in sf.load() #25

Open
batmaxx opened this issue Feb 27, 2024 · 3 comments
Open

make optional loading of data into pandas dataframe in sf.load() #25

batmaxx opened this issue Feb 27, 2024 · 3 comments

Comments

@batmaxx
Copy link

batmaxx commented Feb 27, 2024

Feature Suggestion

Description

The load() function (in load.py) downloads the data (zip+unzip) and then directly load the data into a pandas dataframe.
Would it be possible to add an argument to only download the data (and thus return nothing), skipping the pandas dataframe creation altogether?
That would allow people only interested in downloading the data or using something different than Pandas to also use the nice functionalities implemented in _maybe_download_dataset (url, cashing, filename...).

Code

An easy non-breaking change could be to add an argument create_pandas_dataframe=True:

def load(dataset, variant=None, market=None, start_date=None, end_date=None,
         parse_dates=None, index=None, refresh_days=30, create_pandas_dataframe=True):
    """
    Load the dataset from local disk and return it as a Pandas DataFrame.
    ....
    :param create_pandas_dataframe:
        Boolean to create pandas dataframe with loaded data.

    :return:
        Pandas DataFrame with the data or None.
    """

    assert dataset is not None

    # Convert dataset name, variant, and market to lower-case.
    dataset = dataset.lower()
    if variant is not None:
        variant = variant.lower()
    if market is not None:
        market = market.lower()

    # Dict with dataset arguments.
    dataset_args = {'dataset': dataset, 'variant': variant, 'market': market}

    # Download file if it does not exist on local disk, or if it is too old.
    _maybe_download_dataset(**dataset_args, refresh_days=refresh_days)

    # Return Pandas DataFrame.
    if create_pandas_dataframe:
        # Lambda function for converting strings to dates. Format: YYYY-MM-DD
        date_parser = lambda x: pd.to_datetime(x, yearfirst=True, dayfirst=False)
    
        # Print status message.
        print('- Loading from disk ... ', end='')
    
        # Full path for the CSV-file on local disk.
        path = _path_dataset(**dataset_args)
        if start_date or end_date:
            print('\n- Applying filter ... ', end='')
            path = _filtered_file(path, start_date, end_date=end_date)
        
        # Load dataset into Pandas DataFrame.
        df = pd.read_csv(path, sep=';', header=0,
                        parse_dates=parse_dates, date_parser=date_parser)
    
        # Set the index and sort the data.
        if index is not None:
            # Set the index.
            df.set_index(index, inplace=True)
    
            # Sort the rows of the DataFrame according to the index.
            df.sort_index(ascending=True, inplace=True)
    
        # Print status message.
        print('Done!')
        
        return df

Example

import simfin as sf

# only download data
sf.load_income(variant='quarterly-full-asreported', market='us', create_pandas_dataframe=False)

# download data and return pandas dataframe
sf.load_balance(variant='quarterly-full-asreported', market='us')

Happy to make a PR if necessary. Thanks!

@thf24
Copy link
Member

thf24 commented Mar 11, 2024

So basically you want to use only _maybe_download_dataset(**dataset_args, refresh_days=refresh_days) ?
There is no need to call the "load" method if you can just call "_maybe_download_dataset" or am I missing something? :)

@thf24
Copy link
Member

thf24 commented Mar 11, 2024

I mean basically we have the function already that you need which is "_maybe_download_dataset"?
The only thing the load methods adds is the lowercase conversion

@batmaxx
Copy link
Author

batmaxx commented Mar 14, 2024

Yes using _maybe_download_dataset directly works for sure, however the "_" signals an internal usage only. That's why I thought making it explicite with an extra argument would make it easier for users to either download and load the dataset or only download the dataset.

For my use case I'm fine using _maybe_download_dataset directly. If this change is not deemed as important then I can close this issue 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants