Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve statistics for downloads #4642

Open
jonatas opened this issue Apr 24, 2024 · 8 comments
Open

Improve statistics for downloads #4642

jonatas opened this issue Apr 24, 2024 · 8 comments
Labels

Comments

@jonatas
Copy link

jonatas commented Apr 24, 2024

Is your feature request related to a problem?

I had a meeting with @simi to follow up and continue the draft
@segiddins started on #3560 and here let's break down the problem.

Problem: The actual DownloadGem does not offer granularity or insights to the team creating the gem. The idea is improve the support giving more granularity and details about the user behavior while installing the gems.

Describe the solution you'd like

Introduce a new granular track of downloads. Allowing users to know more details of when gems will are installed and expose publicly more statistics about gems being downloaded.

The gem page can present daily, weekly monthly totals. The public view can also see hourly downloads of "Today".

The ideal scenario would also include the location from where the Downloads comes from but I haven't investigated enough if we have such granular level of information available.

Describe alternatives you've considered

I haven't checked alternatives as Postgresql is already in the stack and TimescaleDB was already the suggestion.

Additional context

I'm very glad to work and support RubyGems. I'm a rubyist for almost 2 decades and last 3 years I moved to work at Timescale as a Developer Advocate, the company behind the TimescaleDB extension. I also created the timescaledb gem. So, my plan is break it down in a few PRs:

  1. Introduce the Timescaledb to the stack setting up tests and creating the new Downloads hypertable.
  2. Track downloaded gems and introduce a clone of the Fastly job that just stores the data on timescaledb
  3. Introduce the continuous aggregates for storing totals for downloads in daily, monthly, yearly timeframes
  4. Backfill data from all s3 buckets
  5. Migrate front end statistics to use the continuous aggregates
  6. Clean up old statistics and counters
@segiddins
Copy link
Member

See also https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/ for how pypi handles this

@colby-swandale
Copy link
Member

colby-swandale commented Apr 26, 2024

👋🏻 heyo, i'm Colby, i'm maintaining the infrastructure for rubygems.org and wanted to jump in to help get this done. I wanted ask some questions to better understand what changes introducing TimescaleDB will have.

I appreciate Timescale putting their hand up to help us here, it's super appreciated by everyone here. My big takeaway of this proposal is introducing a runtime dependency to rubygems.org, which we have already, ie: Fastly, but look to limit if possible. What benefit is it to run a Timescale Cloud instance vs would our use case be something simple enough for the Timescale Postgres extension could handle relatively easily? I also heard of a potential Timescale DB instance inside AWS being in active development, is this far away?

Our download logs only go as far back as 2015 when we moved to Fastly, so you'll probably need to add a step to backfill gem versions created before this date. Which you can probably backfill up to 365.days.ago to reduce the amount of logs needing to be parsed/inserted.

@simi
Copy link
Member

simi commented Apr 26, 2024

Our download logs only go as far back as 2015 when we moved to Fastly, so you'll probably need to add a step to backfill gem versions created before this date. Which you can probably backfill up to 365.days.ago to reduce the amount of logs needing to be parsed/inserted.

@colby-swandale What data could be used to backfill pre-Fastly gems? In case there is none, we can just mark those versions as incomplete statistics-wise.

@jonatas
Copy link
Author

jonatas commented Apr 26, 2024

Hello Colby! Thanks for reaching out!

What benefit is it to run a Timescale Cloud instance vs would our use case be something simple enough for the Timescale Postgres extension could handle relatively easily?

A cloud allows to use elastic computing and storage, high availability, replicas, etc. This would also be a great marketing for our product but the open source version just works.

I also heard of a potential Timescale DB instance inside AWS being in active development, is this far away?

I don't have details enough to share any estimates but will try to check with the team.

My big takeaway of this proposal is introducing a runtime dependency to rubygems.org, which we have already, ie: Fastly, but look to limit if possible.

I totally agree and I was thinking even how these statistics could be a separated service, like rubygems-analytics because the only thing we need to get is the same files from the s3 and maybe transport some rubygems metadata like rubygem_id and version_id, but the rest would be totally isolated.

So, I'm also happy to move it as an independent process to isolate the entire scenario too. If you agree I can first bring the POC that just runs totally independently.

@simi
Copy link
Member

simi commented Apr 26, 2024

I totally agree and I was thinking even how these statistics could be a separated service, like rubygems-analytics because the only thing we need to get is the same files from the s3 and maybe transport some rubygems metadata like rubygem_id and version_id, but the rest would be totally isolated.

@colby-swandale on the other side new isolated app will add maintenance burden. 🤔 @jonatas do you have any idea/estimate what kind of response time we can get for most complex queries planned?

@jonatas
Copy link
Author

jonatas commented Apr 26, 2024

I don't think we'll have anything over a second. Everything will be pre-processed, so I imagine the avg query will be under 300ms.

@jonatas
Copy link
Author

jonatas commented May 8, 2024

Hi folks, I just created this POC with the basic code to allow us to collect hourly statistics from the raw data.

We can run all logs available and just pre-load the data into some instance, but I still don't have access to run it.

@simi brought the point of make it an isolated service versus run it on the actual infrastructure, and I'd love if we could

I see a lot of positive impact on building a isolated server which just track downloads. I don't think this type of feature needs to be part of the server and having the extra database layer would add a new layer of complexity over ActiveRecord as it uses a different connections.

On an isolated server we'd need to mimic LogTickets or just have access to s3 api to list and consume all the files:

  • We'll need a listener to subscribe to messages from new logs generated to process.
  • Create an endpoint for statistics that can be consumed by the official website.
  • Drop the old counters from the rubygems and replace the source with service calls.

I'm very open to follow in both ways. I can really integrate into the point that @segiddins went before. I just explored as a POC and looking for more feedback before we proceed to the production implementation. I think as an isolated server we have more chance to develop other types of analysis and even detect patterns.

@simi
Copy link
Member

simi commented May 8, 2024

@simi brought the point of make it an isolated service versus run it on the actual infrastructure, and I'd love if we could

This was raised by @colby-swandale actually. We need to ensure Timescale service health is not going to affect health of the rest of the service. I thought we do something especial for OpenSearch, but seems we're not. 🤔 @colby-swandale would you mind to decide if it is ok to start with built-in API with some reasonable timeouts or rather start with isolated service?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants