Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partially updating and syncing large CSV-files #1

Open
Hvass-Labs opened this issue Oct 1, 2019 · 0 comments
Open

Partially updating and syncing large CSV-files #1

Hvass-Labs opened this issue Oct 1, 2019 · 0 comments

Comments

@Hvass-Labs
Copy link
Contributor

Introduction

Partial updating and syncing of the CSV-files, was originally a part of the idea for this project, so it would only download the full CSV-files the first time, and the following downloads would only be partial, so as to sync your local CSV-files with the server's new data.

It turned out this wasn't necessary for the fundamental data (Income Statements, Balance Sheets, Cash-Flow Statements) because the CSV-files are fairly small and the compressed ZIP-files that are being downloaded from the server, are typically only a few MB each, so it is much easier and probably also faster to download the full data-files each time a user wants to update them, rather than have the server create different subsets of the data and send that to each individual user.

However, the dataset with share-prices could benefit from an efficient data-syncing method, because the ZIP-file is currently about 70 MB and it unpacks into a CSV-file that is nearly 400 MB, and these files will grow with time.

So a data-syncing method might become necessary eventually, and the idea is therefore described here, in case someone will implement it in the future. The idea is explained for CSV-files in general.

Overall Idea

The overall idea is that we need a timestamp associated with each CSV-file, which tells us when the most recent data-item in the CSV-file was actually added to the SimFin server's database. Sometimes this timestamp can be read directly from the data, other times it must be provided by the SimFin server e.g. in a separate file called 'timestamp.txt' inside the ZIP-file for each dataset.

If a user does not have the dataset and timestamp on their local harddisk, then the entire dataset is downloaded from the SimFin server. This can be done very quickly, because the server just needs to send a pre-packaged ZIP-file like it is doing now.

But when the user already has a version of the dataset and timestamp on their local harddisk, then the timestamp is first read from disk and sent to the SimFin server along with the other arguments that specify which dataset is wanted. Then the server looks up all the data from the given dataset, where the timestamp is more recent than the timestamp provided by the user. The data is then put into a new CSV-file, along with a new timestamp for the most recent data in the CSV-file, and this is compressed into a ZIP-file which is sent to the user. Afterwards the temporary CSV-file and ZIP-file can be deleted on the server.

The user then unpacks the ZIP-file they received from the server, and simply appends the new CSV-file to the end of the old CSV-file already located on the harddisk.

The local CSV-file can be appended daily with new data from the SimFin server. This means that data for each ticker will be spread throughout the local CSV-file. It is also possible that the dates will be unsorted in the local CSV-file, if older data should ever be added to the SimFin server's database in the future. But when the data is loaded using Pandas and a DataFrame is created with a multi-level index of e.g. the Ticker and Date, then it will also be sorted correctly.

All of this would be handled automatically by SimFin's Python API and server, so the user never knows it is happening behind the scenes. The user would just call the normal function for loading a dataset.

In order for this to work, the server must be reasonably fast at generating the ZIP-files with new data, and that is why it probably doesn't make sense to do it for smaller datasets of only a few MB, because it is probably faster for the server to just send a pre-packaged ZIP-file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant