Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV task fails if uuid is used for active_storage_attachments.record_id #462

Open
shouichi opened this issue Jul 29, 2021 · 15 comments
Open

Comments

@shouichi
Copy link
Contributor

shouichi commented Jul 29, 2021

Background

We use uuid for primary keys, thus active_storage_attachments.record_id is uuid. On the other hand, maintenance_tasks_runs.id is bigint. Therefore, when a CSV task tries to upload a CSV as an ActiveStorage::Attachment, the attachment fails to reference the CSV task's run record (MaintenanceTasks::Run) because of the type inconsistency.

Possible solutions

  • Document that the generated migration file must be manually edited when uuid is used.
  • Respect g.orm :active_record, primary_key_type: :uuid when generating the migration file.

Thanks in advance.

@shouichi shouichi changed the title CSV fails if uuid is used for active_storage_attachments.record_id CSV task fails if uuid is used for active_storage_attachments.record_id Jul 29, 2021
@shouichi
Copy link
Contributor Author

Unfortunately simply changing maintenance_tasks_runs.id to uuid did not work because of the following code.

id: available_task_runs.select("MAX(id) as id").group(:task_name)

@etiennebarrie
Copy link
Member

Thanks for the report. Indeed we'd want to support primary_key_type: :uuid, but not sure how to deal with this query yet.

@shouichi
Copy link
Contributor Author

shouichi commented Aug 4, 2021

The project only uses sqlite for testing. It seems that the first step is using both sqlite and postgres for testing. Is this the right direction @etiennebarrie ?

@etiennebarrie
Copy link
Member

Yes, if it's necessary. I wonder if we could just test UUIDs with sqlite?

@shouichi
Copy link
Contributor Author

shouichi commented Aug 5, 2021

Mmm, maybe we could use blob as a primary key and store uuid. I'll take a look.

@shouichi
Copy link
Contributor Author

I tried using string type as a primary key but it seems this comes with a lot of trouble because we cannot rely on database to generate ids. Personally testing both sqlite and postgress seems the way to go because it truly tests type: :uuid. @etiennebarrie What do you think?

@shouichi
Copy link
Contributor Author

Changed primary keys to uuid, ran tests, and found that the current code depends on monotonically increasing ID more than I thought. For example, Run.last or Run.where("id < ?", @cursor). It would be better to change incrementally. The first step should be to stop depending on monotonically increasing IDs. Does this make sense?

@etiennebarrie
Copy link
Member

etiennebarrie commented Aug 30, 2021

I'm not sure how which approach we should take:

  • stop depending on monotonically-increasing IDs
  • make all code dependencies on monotonically-increasing IDs abstract so we can change the behavior when using UUIDs, but keep the behavior as it is for existing users.

For the first solution: from your investigation, do we need a migration, or can we rely on the existing (task_name, created_at) index? It seems that would be the case for last run and pagination (runs#show), but for the query you mentioned above (#462 (comment)), is that the case too?

If we can avoid a migration, then the first solution is great: we don't need two code paths for UUIDs. But if we need more work for existing users, I think the second solution would be preferable.

In these situations, I tend to do a spike so that I can see all the moving parts, even if I end up shipping change incrementally. I would favor having a branch where we can see everything we will need, even things aren't perfect or tests fail.

@shouichi
Copy link
Contributor Author

shouichi commented Sep 1, 2021

Thanks for the advice.

If we can avoid a migration, then the first solution is great: we don't need two code paths for UUIDs. But if we need more work for existing users, I think the second solution would be preferable.

Honestly, I'm not sure if we can avoid migration or not.

In these situations, I tend to do a spike so that I can see all the moving parts, even if I end up shipping change incrementally. I would favor having a branch where we can see everything we will need, even things aren't perfect or tests fail.

OK. I'm going to open a PR that simply changes pk type to uuid so we can discuss it more in detail. Then, I'm also going to open several PRs against the first PR branch. So that PRs stay small (though the last merge might end up large).

@dhnaranjo
Copy link

dhnaranjo commented Nov 28, 2022

I found myself in this situation and decided to hack it out with some monkeypatching and I thought I'd just offer this to anyone else who has a similar short-sighted need.

I'm using Zeitwerk overrides as described here: https://guides.rubyonrails.org/classic_to_zeitwerk_howto.html#decorating-classes-and-modules-from-engines

It uses ActiveRecordExtended for a window function and CTE because I was already using it, I'm sure we wouldn't want to pull this dependency into the gem but it could certainly be written without it.

# app/overrides/maintenance_tasks/run_override.rb

module MaintenanceTasks
  class Run < ApplicationRecord
    scope :last_completed_per_task, -> do
      window_cte = define_window(:ended_at_window)
        .partition_by(:task_name, order_by: {ended_at: :desc})
        .select(:id)
        .select_window(:row_number, over: :ended_at_window, as: :ended_position)

      completed
        .with(run_ended_positions: window_cte)
        .joins("JOIN run_ended_positions ON run_ended_positions.id = maintenance_tasks_runs.id")
        .where(run_ended_positions: {ended_position: 1})
    end
  end
end
# app/overrides/maintenance_tasks/task_data_index_override.rb

module MaintenanceTasks
  class TaskDataIndex
    class << self
      def available_tasks
        tasks = []

        task_names = Task.available_tasks.map(&:name)

        active_runs = Run.with_attached_csv.active.where(task_name: task_names)
        active_runs.each do |run|
          tasks << TaskDataIndex.new(run.task_name, run)
          task_names.delete(run.task_name)
        end

        last_runs = Run.with_attached_csv.last_completed_per_task.where(task_name: task_names)
        task_names.map do |task_name|
          last_run = last_runs.find { |run| run.task_name == task_name }
          tasks << TaskDataIndex.new(task_name, last_run)
        end

        # We add an additional sorting key (status) to avoid possible
        # inconsistencies across database adapters when a Task has
        # multiple active Runs.
        tasks.sort_by! { |task| [task.name, task.status] }
      end
    end
  end
end

The meat of the change above is substituting the old:

        completed_runs = Run.completed.where(task_name: task_names)
        last_runs = Run.with_attached_csv.where(id: completed_runs.select("MAX(id) as id").group(:task_name))

for the new scope:

        last_runs = Run.with_attached_csv.last_completed_per_task.where(task_name: task_names)

With no other changes

# db/migrate/###_maintenance_tasks_runs_uuid_pkey.rb

class MaintenanceTasksRunsUuidPkey < ActiveRecord::Migration[7.0]
  def change
    add_column :maintenance_tasks_runs, :uuid, :uuid, default: "gen_random_uuid()", null: false
    change_table :maintenance_tasks_runs do |t|
      t.remove :id
      t.rename :uuid, :id
    end
    execute "ALTER TABLE maintenance_tasks_runs ADD PRIMARY KEY (id);"
  end
end

And now I can run CSV tasks. No tests, no promises, works for me.

@gmcgibbon
Copy link
Member

gmcgibbon commented Jan 19, 2024

Can we not refactor to use run timestamps in all of these cases? If someone is interested in issuing a PR where we use created_at and not id, I think that would solve both problems identified in this issue. There may be further issues, but IMO refactoring to not depend on incrementing IDs is better in the long run.

Copy link

This issue has been marked as stale because it has not been commented on in two months.
Please reply in order to keep the issue open. Otherwise, it will close in 14 days.
Thank you for contributing!

@github-actions github-actions bot added the stale label Mar 20, 2024
@karnowski
Copy link

Any movement on this? We're still impacted.

@github-actions github-actions bot removed the stale label Apr 2, 2024
@etiennebarrie
Copy link
Member

Are you interested in contributing? I think with the task_name, status, created_at DESC index, we could change this query to use timestamps like @gmcgibbon said. The main blocker for us maintainers is that we don't use UUID as primary keys, so we're not really confident in contributing this, but as mentioned above, it's something we want to support since it's in Rails/Active Record.

@karnowski
Copy link

My team is currently pretty heads down, but we could potentially look at it later this year. If this issue gets closed as stale, can we reopen it later in the year when we're able to better contribute?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants