Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Freeze][Resume After Silver + Streamlit] Fix #769 - OHLCV with CSV Data Store #858

Closed

Conversation

kdetry
Copy link
Contributor

@kdetry kdetry commented Apr 4, 2024

Fixes #769

Changes proposed in this PR:

  • OHLCV data writing logic is removed
  • OHLCV is integrated with CSV Data Store
  • OHCLV tests are updated to use CSV Data Store
  • all commands (psutil.py as an example) that are not being used any longer (used only by tests) have been deprecated

@kdetry kdetry changed the base branch from main to issue685-duckdb-integration April 4, 2024 13:43
@idiom-bytes
Copy link
Member

There are a bunch of functions from plutil.py that are not being used anymore. These should be either re-wired to use CDS and prove that CSV-writing works in the exact same way.

These commands are only being used in tests

initialize_rawohlcv_df,
load_rawohlcv_file,
save_rawohlcv_file,

The following are still being used elsewhere (in common code)

concat_next_df,
has_data,
newest_ut,
oldest_ut,

@idiom-bytes idiom-bytes marked this pull request as draft April 12, 2024 14:11
@idiom-bytes idiom-bytes changed the title Fix #769 - OHLCV with CSV Data Store [Freeze][Do Not Merge] Fix #769 - OHLCV with CSV Data Store Apr 12, 2024
@idiom-bytes idiom-bytes changed the title [Freeze][Do Not Merge] Fix #769 - OHLCV with CSV Data Store [Freeze][Resume After DuckDB] Fix #769 - OHLCV with CSV Data Store Apr 12, 2024
@idiom-bytes idiom-bytes changed the title [Freeze][Resume After DuckDB] Fix #769 - OHLCV with CSV Data Store [Freeze][Resume After Silver + Streamlit] Fix #769 - OHLCV with CSV Data Store Apr 12, 2024
pdr_backend/lake/csv_data_store.py Show resolved Hide resolved

# read the last record from the file
last_file = pl.read_csv(file_path)
return int(last_file["timestamp"][-1])
return UnixTimeMs(int(last_file["timestamp"][-1]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm putting a comment here, for lack of a better place.)

The implementation of _get_from_value() and _get_to_value() currently depends on the name of the file.

This is too brittle, and will be hard to maintain.

Better: auto-detect the values by reading the file itself. It's not that expensive, and will be far better to work with. This is what we've done in the past; and we can re-use that past code. Details:

  • A baseline version: simply load the whole file, and read the first entry and last entry.
  • Even faster: open the file but don't load it; then read either the second line (for "from" value); or rapidly iterate to the last line and read it (for "to" value).
  • The latter has already been built for you! It's how we used to do things, before porting to polars.
    • It's in the file pdutil.py, circa Nov 23, 2023
    • Function oldest_ut() finds the "from" value
    • Function newest_ut() finds the "to" value
    • They both rely on lower-level helper functions. Nice and clean:)
    • And it's all unit-tested too. See test_pdutil.py from that time
    • The function load_csv() may be useful too

Copy link
Contributor Author

@kdetry kdetry May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create a folder for each dataset and populate them with files named according to the timestamp values of their data. Each file contains 1000 rows of data.

test_data_folder/x_from_0000000001_to_00000212112.csv
test_data_folder/x_from_00000212112_to_00004912112.csv
test_data_folder/x_from_00004912112_to_00408912312.csv

A baseline version: simply load the whole file, and read the first entry and last entry.

so actually _get_from_value and _get_to_value methods do not read the same file, they detect the last file and return the "from" or "to" value

pdr_backend/lake/csv_data_store.py Show resolved Hide resolved
@idiom-bytes
Copy link
Member

idiom-bytes commented May 6, 2024

@kdetry have you
(1) read trents feedback?
(2) have written a ticket or updated this in the duckdb-integration code?

Can you please complete (1) and (2) and make sure to then close this PR.

@kdetry
Copy link
Contributor Author

kdetry commented May 7, 2024

I added followings "DoD"s to the main task. I close this PR for now, we can handle it after the shipping.

  • Change the IO methods to a function, in a new module cvsutil.py
  • [z] use them into the CSVDataStore class and make it cleaner and thinner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Lake][OHCLV] Update OHCLV data_factory to use csv_data_writer
3 participants