Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch oldest authored_at correctly instead of oldest authored_at per page #4460

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

albertchae
Copy link
Contributor

@albertchae albertchae commented Feb 17, 2024

Fixes #4448

Previously, the "since date" was using oldest version.authored_at for a given
page. This change fixes it to get the oldest version.authored_at in
the database. Because authored_at depends on either built_at or
created_at
, I added a scope to replicate this behavior via
SQL and get the minimum. This should only be 1 additional query for one column.

I modified existing tests to check for the "versions since" text and
also have 2 pages of actual content.

@albertchae albertchae marked this pull request as ready for review February 17, 2024 06:26
@@ -87,27 +87,37 @@ class VersionsControllerTest < ActionController::TestCase
end

assert_select ".gem__version__date sup", text: "*", count: 1

assert_select ".t-list__heading", text: /1 versions since January 01, 2000/, count: 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"1 versions" and not "1 version" is an existing bug that could probably be fixed using https://guides.rubyonrails.org/i18n.html#pluralization in

versions_since: "%{count} versions since %{since}"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted. Would you mind to open PR to fix this please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR for this is #4466

def set_oldest_version_date
oldest_created_at = @rubygem.versions.order(:created_at).first
oldest_built_at = @rubygem.versions.order(:built_at).first
@oldest_version_date = [oldest_created_at, oldest_built_at].compact.map(&:authored_at).min
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to move this logic into SQL side using MIN function and get the oldest version date in 1 query directly? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this should be possible.

I used rubygem id=183850 as an example, which is https://rubygems.org/gems/sorbet-static. It has the most versions (10991) in the database as of the Feb 12 dump on https://rubygems.org/pages/data.

I first tried this

SELECT
    min(least (created_at, built_at))
FROM
    versions
WHERE
    rubygem_id = 183850;

However, this doesn't exactly replicate the behavior in the application code

def authored_at
return built_at if rely_on_built_at?
created_at
end
# Originally used to prevent showing misidentified dates for gems predating RubyGems,
# this method also covers cases where a Gem::Specification date is obviously invalid due to build-time considerations.
def rely_on_built_at?
return false if created_at.to_date != RUBYGEMS_IMPORT_DATE
built_at && built_at <= RUBYGEMS_IMPORT_DATE
end

In fact, for sorbet-static, the SQL query above returns 2019-05-17 00:00:00 but Rubygem.find(183850).versions.map(&:authored_at).min returns Sat, 18 May 2019, so there is a one day difference.

If we want to duplicate the application code into SQL, I believe this achieves the same thing

SELECT
    min(
        CASE WHEN DATE(created_at) = '2009-07-25'
            AND built_at < created_at THEN
            built_at
        ELSE
            created_at
        END) AS authored_at
FROM
    versions
WHERE
    rubygem_id = 183850;

So to generate this query, I could replace this code with

@oldest_version_date = @rubygem.versions.minimum("case when DATE(created_at) = '#{Version::RUBYGEMS_IMPORT_DATE}' and built_at < created_at then built_at else created_at end")

or if we prefer Arel

versions = Version.arel_table

created_at_date = Arel::Nodes::NamedFunction.new(
  'DATE',
  [versions[:created_at]]
)

rely_on_built_at = created_at_date.eq(Version::RUBYGEMS_IMPORT_DATE).and(versions[:built_at].lt(versions[:created_at]))
authored_at_case_statement = Arel::Nodes::Case.new
                                 .when(rely_on_built_at).then(versions[:built_at])
                                 .else(versions[:created_at])
@oldest_version_date = @rubygem.versions.minimum(authored_at_case_statement)

Do you think it's worth the additional complexity of the case statement to replicate the same behavior as Version#authored_at? I don't fully understand where built_at comes from, especially when a gem doesn't predate RUBYGEMS_IMPORT_DATE, so I'm not sure what the implications are of blindly using least(created_at, built_at) and would probably lean towards duplicating the authored_at logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also checked the performance of these 2 queries vs the min+least or case statement. Surprisingly, at least on my local database, while the query planner thinks the separate checks on created_at and built_at have lower cost, the execution time of either the min+least or case statement are both much shorter.

rubygems_development=# explain analyze select min(built_at) from versions where rubygem_id=183850;
                                                                            QUERY PLAN                                                            >
-------------------------------------------------------------------------------------------------------------------------------------------------->
 Result  (cost=41.96..41.97 rows=1 width=8) (actual time=2365.886..2365.896 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.43..41.96 rows=1 width=8) (actual time=2365.874..2365.875 rows=1 loops=1)
           ->  Index Scan using index_versions_on_built_at on versions  (cost=0.43..429251.91 rows=10336 width=8) (actual time=2365.871..2365.872 >
                 Index Cond: (built_at IS NOT NULL)
                 Filter: (rubygem_id = 183850)
                 Rows Removed by Filter: 1071279
 Planning Time: 1.692 ms
 Execution Time: 2366.057 ms
(9 rows)
rubygems_development=# explain analyze select min(created_at) from versions where rubygem_id=183850;

                                                                             QUERY PLAN                                                           >
-------------------------------------------------------------------------------------------------------------------------------------------------->
 Result  (cost=41.88..41.89 rows=1 width=8) (actual time=2718.618..2718.619 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.43..41.88 rows=1 width=8) (actual time=2718.605..2718.606 rows=1 loops=1)
           ->  Index Scan using index_versions_on_created_at on versions  (cost=0.43..428426.01 rows=10336 width=8) (actual time=2718.604..2718.60>
                 Index Cond: (created_at IS NOT NULL)
                 Filter: (rubygem_id = 183850)
                 Rows Removed by Filter: 1068171
 Planning Time: 22.439 ms
 Execution Time: 2719.168 ms
(9 rows)
rubygems_development=# explain analyze select min(least(created_at, built_at)) from versions where rubygem_id=183850;
                                                                      QUERY PLAN                                                                  >
-------------------------------------------------------------------------------------------------------------------------------------------------->
 Aggregate  (cost=30761.62..30761.63 rows=1 width=8) (actual time=187.097..187.098 rows=1 loops=1)
   ->  Bitmap Heap Scan on versions  (cost=196.53..30709.94 rows=10336 width=16) (actual time=20.778..185.133 rows=10991 loops=1)
         Recheck Cond: (rubygem_id = 183850)
         Heap Blocks: exact=6022
         ->  Bitmap Index Scan on index_versions_on_rubygem_id  (cost=0.00..193.95 rows=10336 width=0) (actual time=18.283..18.284 rows=10991 loop>
               Index Cond: (rubygem_id = 183850)
 Planning Time: 0.541 ms
 Execution Time: 187.278 ms
(8 rows)
rubygems_development=# explain analyze select min(case when created_at::date = '2009-07-25'::date and built_at < created_at then built_at else created_at end) as authored_at from versions where rubygem_id=183850;
                                                                     QUERY PLAN                                                                   >
-------------------------------------------------------------------------------------------------------------------------------------------------->
 Aggregate  (cost=30813.30..30813.31 rows=1 width=8) (actual time=94.130..94.132 rows=1 loops=1)
   ->  Bitmap Heap Scan on versions  (cost=196.53..30709.94 rows=10336 width=16) (actual time=4.185..90.765 rows=10991 loops=1)
         Recheck Cond: (rubygem_id = 183850)
         Heap Blocks: exact=6022
         ->  Bitmap Index Scan on index_versions_on_rubygem_id  (cost=0.00..193.95 rows=10336 width=0) (actual time=3.377..3.378 rows=10991 loops=>
               Index Cond: (rubygem_id = 183850)
 Planning Time: 2.007 ms
 Execution Time: 94.376 ms
(8 rows)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertchae thanks for detailed research on this. I'm also using sorbet-static for "extreme" cases performance testing, good choice!

I did quick check in production DB and I see similar pattern. It is most likely caused by uneven distribution of versions (thanks for this extreme sorbet-static gem). I did quick test also on more common case like nokogiri gem and select min(case when created_at::date = '2009-07-25'::date and built_at < created_at then built_at else created_at end) as authored_at from versions where rubygem_id=? query works well in all cases and if I understand it well, it will keep the same value logic as today (but with bug fixed).

So for now, I would prefer to add @oldest_version_date = @rubygem.versions.minimum("case when DATE(created_at) = '#{Version::RUBYGEMS_IMPORT_DATE}' and built_at < created_at then built_at else created_at end") (maybe hide the SQL into model somehow) in controller.

And for the future, we can check if there's any way to synchronise dates in DB with RUBYGEMS_IMPORT_DATE constant and normalize them in DB so one MIN(...) clause will be enough.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for now, I would prefer to add @oldest_version_date = @rubygem.versions.minimum("case when DATE(created_at) = '#{Version::RUBYGEMS_IMPORT_DATE}' and built_at < created_at then built_at else created_at end") (maybe hide the SQL into model somehow) in controller.

Sounds good. I pushed up a commit to add this as a scope to the Version model. One small note, I changed the query to use <= to match the application code exactly, but I think we get the same result with or without the =

And for the future, we can check if there's any way to synchronise dates in DB with RUBYGEMS_IMPORT_DATE constant and normalize them in DB so one MIN(...) clause will be enough.

This sounds like a good long term fix!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I pushed it up as a separate commit for ease of review, but let me know if you'd like me to squash them. I also edited the PR description to match the new implementation.

Copy link

codecov bot commented Feb 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (1d33ca5) 97.15% compared to head (c26f464) 97.15%.
Report is 6 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #4460   +/-   ##
=======================================
  Coverage   97.15%   97.15%           
=======================================
  Files         391      391           
  Lines        8259     8264    +5     
=======================================
+ Hits         8024     8029    +5     
  Misses        235      235           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

albertchae added a commit to albertchae/rubygems.org that referenced this pull request Feb 19, 2024
From rubygems#4460 (comment)

Added a test to verify this string is being pluralized properly.

Also added every locale with varying degrees of confidence
- es/pt-BR: I know enough to be fairly confident this is correct
- ja: I know enough to be fairly confident this is correct and verified
  with native speaker
- fr: I verified with native speaker
- de/nl: I used online translation tools to check singular vs plural
- zh-CN/zh-TW: I don't think they pluralize, but I just copied the
  existing string into both `other` and `one` which maintains behavior
  before this PR anyway.
simi pushed a commit that referenced this pull request Feb 19, 2024
From #4460 (comment)

Added a test to verify this string is being pluralized properly.

Co-authored-by: Albert Chae <albertchae@users.noreply.github.com>
@albertchae albertchae force-pushed the 4448-correct-oldest-version-date branch from 0d2f04e to d286515 Compare February 20, 2024 09:02
created_at
END
SQL
)&.in_time_zone
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum query by itself returns a Time but #authored_at returns an ActiveSupport::TimeWithZone. So I added in_time_zone to convert the Time ➡️ ActiveSupport::TimeWithZone.

But I don't think it is strictly necessary because the equality check in the tests pass without the conversion and the rendering on the versions page is the same as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no timezone conversion currently happening in the codebase and all times are just printed to the page as stored in DB. IMHO not needed as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, removed in_time_zone

…page

Fixes rubygems#4448

Previously, the "since date" was using oldest `version.authored_at` for a given
page. This change fixes it to get the oldest `version.authored_at` in
the database. Because `authored_at` depends on either `built_at` or
`created_at`, I fetch both and take the min. This should only be 2
additional queries of 1 record each.

I modified existing tests to check for the "versions since" text and
also have 2 pages of actual content.
Replicate `Version#authored_at` logic in a SQL case statement and call
SQL `minimum` on that to get the oldest `authored_at`

Add tests for the model using similar structure as tests for `authored_at`
@albertchae albertchae force-pushed the 4448-correct-oldest-version-date branch from d286515 to b146132 Compare February 20, 2024 09:11
@@ -53,6 +53,19 @@ class Version < ApplicationRecord # rubocop:disable Metrics/ClassLength
validate :metadata_attribute_length
validate :no_dashes_in_version_number, on: :create

scope :oldest_authored_at, lambda {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to make this class level method instead of scope? Scope is usually used to return collection (is chainable).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that makes sense, changed to a class method and moved closer to #authored_at

…ethod

Also now that `.oldest_authored_at` is a class method and not a scope,
moved it next to `#authored_at
@albertchae
Copy link
Contributor Author

@simi thanks for all the reviews and comments so far. Was there anything else you wanted me to address for this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"All versions of #{gem} since #{date}" has paginated date
2 participants