Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where to store historical data? #52

Open
Bisaloo opened this issue Feb 12, 2022 · 1 comment
Open

Where to store historical data? #52

Bisaloo opened this issue Feb 12, 2022 · 1 comment

Comments

@Bisaloo
Copy link
Member

Bisaloo commented Feb 12, 2022

Opening this issue so we have a public & central place to discuss this matter.

Having the historical data on a branch seems suboptimal:

  • it is difficult to discover (although this could be changed by advertising it more in the README & the pkgdown website)
  • inevitable growth of the branch size will have ripple effects on all operations in the main branch (in particular clones & checkouts)

The cleanest option is probably to store this data in an actual database, hosted on an external service. This makes sense since we're not actually changing the file contents, just adding new files, and therefore don't need a Version Control System. But:

  • this costs money
  • it requires more maintenance / learning how to use a new service

Another simpler (albeit imperfect) option would be to store the historical data in a distinct GitHub repository. This uses tools we already know, is free, public & easy to find.

@Bisaloo
Copy link
Member Author

Bisaloo commented May 8, 2024

We go from 553MB to 81MB by switching to parquet. Probably worth doing it alongside storage migration.

Here is a script:

arrow::open_csv_dataset(".") |> 
  dplyr::mutate(
    snapshot_year = lubridate::year(snapshot_time),
    snapshot_month = lubridate::month(snapshot_time),
    snapshot_day = lubridate::day(snapshot_time)
  ) |> 
  dplyr::collect() |> 
  dplyr::mutate(
    snapshot_time = hms::as_hms(snapshot_time)
  ) |> 
  arrow::write_dataset("../cransays-history", partitioning = c("snapshot_year", "snapshot_month", "snapshot_day"))

Or do we want to partition on something else, such as package name 🤔 (this is very inefficient in terms of storage)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant