Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the reasoning for Define Model Class Migrations #284

Open
pirj opened this issue Apr 9, 2021 · 8 comments
Open

Clarify the reasoning for Define Model Class Migrations #284

pirj opened this issue Apr 9, 2021 · 8 comments

Comments

@pirj
Copy link
Member

pirj commented Apr 9, 2021

  1. https://rails.rubystyle.guide/#define-model-class-migrations claims that

or cause serious data corruption

Can you think of a use case when an old migration could cause a serious data corruption?
Yes, it will most probably break db:migrate on a clean database.

  1. There's yet another reason not to use real models in migrations.
    If you refer to a real model in an older migration (e.g. for update_all), Rails will cache column information.
    And subsequent migrations that would alter that table's schema won't affect the column cache.
    This will lead to breakages due to an outdated (fixed for the moment when the model was first referred to in migrations) table schema. Specifically, if you add a column in a later migration and update_all that new column, this will result in an error.
    Also, if you run rake db:migrate db:seed in one command, this will also affect your seeds, and it will also fail with:
ActiveModel::UnknownAttributeError: unknown attribute 'some_attr' for ModelName
@andyw8
Copy link
Contributor

andyw8 commented Apr 9, 2021

For reference, this was originally added in #205 and discussed in #143

@pirj
Copy link
Member Author

pirj commented Jul 19, 2021

There are two more things that recently hit me:

  1. We suggest defining the class on the top level:
class MigrationProduct < ActiveRecord::Base
  self.table_name = :products
end

class ModifyDefaultStatusForProducts < ActiveRecord::Migration

while the MigrationProduct can be defined by multiple migrations.
If it happens that self.table_name = is different, the last one wins.
I don't have a good example, maybe something like:

class MigrationUser < ActiveRecord::Base
  self.table_name = 'accounts' # not 'users'
end

The suggestion would be to define migration model classes inside migration classes:

class ModifyDefaultStatusForProducts < ActiveRecord::Migration
  class MigrationProduct < ActiveRecord::Base
    self.table_name = :products
  end
  ...

@pirj
Copy link
Member Author

pirj commented Jul 19, 2021

  1. Migration models are loaded as needed in development, while loaded all at once in production with eager_load. (this needs to be confirmed with the Rails code, if the migrator loads them or it loads the whole db/migrate and instantiates migration classes as needed.

Still, the issue was as following:

# migration 1
class AddCategoryToFoos < ActiveRecord::Migration[6.0]
  def change
    add_column :foos, :category, :string
  end
end

# migration 2
class BackfillCategoryToFoos < ActiveRecord::Migration[6.0]
  class MigrationFoo < ActiveRecord::Base
    self.table_name = 'foos'
  end
  def change
    MigrationFoo.find_each { |foo| foo.update!(category: 'primary') } # <= The real problem is here
  end
end

# migration 3
class RestrainCategoryToFoosToNotNull < ActiveRecord::Migration[6.0]
  def change
    change_column_null :foos, :category, false # BOOM, `PG::NotNullViolation: ERROR: column "category" contains null values`
  end
end

The problem is with foo.update!(category: 'primary').
ActiveRecord::Dirty has no awareness of the recently added category column, and save! decided not to do an UPDATE.

I don't have a good explanation as to why this happened. My theory is that if those two last migrations were executed in the same run, Rails loaded both, and immediately cached model's schema for MigrationFoo.
This was also happening on local.

Might be related to:

config.active_record.use_schema_cache_dump enables users to get schema cache information from db/schema_cache.yml (generated by rails db:schema:cache:dump), instead of having to send a query to the database to get this information. Defaults to true.

But I couldn't find db/schema_cache.yml locally.

One suggestion would be to dynamically create model classes instead:

def change
  MigrationFoo = Class.new(ActiveRecord::Base) { self.table_name = 'foos' }
  
  MigrationFoo.find_each { |foo| foo.update!(category: 'primary') }
end

but I can't recommend it without testing it. And I have some doubts if schema cache will be updated for a new model pointing to the same table, or an existing cached definition (that e.g. lacks category column info) will be used.

@pirj
Copy link
Member Author

pirj commented Sep 16, 2021

Another approach is to use:

def change
  MigrationFoo.reset_column_information
  
  MigrationFoo.find_each { |foo| foo.update!(category: 'primary') }
end

See https://stackoverflow.com/a/9728033/202914 for explanation.
Otherwise, an error is raised:

unknown attribute 'category' for MyCoolMigration::MigrationFoo.

Apparently, migrations are loaded all at once, or somehow column information is shared between MyCoolMigration::MigrationFoo and MyPreviousMigration::MigrationFoo.

I'm on the fence if it's worth adding to the guide, as it's for this relatively rare case when a project tries to keep migrations from early days in a state that allows running rails db:create db:migrate rather than rails db:create db:schema:load.

@pirj
Copy link
Member Author

pirj commented Oct 3, 2021

reset_column_information is what Rails suggests using:

Sometimes you'll want to add a column in a migration and populate it immediately after. In that case, you'll need to make a call to reset_column_information in order to ensure that the model has the latest column data from after the new column was added.

@pirj
Copy link
Member Author

pirj commented Dec 17, 2021

One case when it's not possible to properly use migration model in migration when you use it to schedule an Active Job job:

class MigrationUser < ActiveRecord::Base
  self.table_name = :users
end

def up
  MigrationUser.find_each { |user| SyncWithThirdParty.perform_later(user) }
end

as this will end up with a failure to deserialize it, as the global id would use a weird model name.

As a workaround, global id key should be overridden in the migration model.

@khamusa
Copy link

khamusa commented Apr 4, 2022

An example of data corruption (logical):

  1. You write a migration that adds a new nullable column, and you write code to make sure that certain types of records get backfilled with some type of value. For this, you use ModelName.where(...).update_all(...)
  2. If you have a gem such as act_as_paranoid or any code that added a default scope to your model, migrating data this way will prevent records not matching your default scope from being updated.

In the case of act_as_paranoid, that could not be a big issue since it will likely exclude soft-deleted records, which you may not care that much (I do, though). If you try, however, to update your newly added column to be non-nullable, that could break your production deployment because of the presence of soft-deleted records (which would still be null as they were excluded by the default scope), while it's very likely you won't notice this in development (unless you're constantly soft-deleting records locally, which most developers won't do).

One could argue this is a problem with the usage of default scopes, and I'd also agree. Doesn't change the fact that avoiding direct references to your models in migrations prevents this kind of error.

Thought it was worth sharing.

@matteeyah
Copy link

I agree with not using regular models in migrations - that's just asking for trouble, but I strongly disagree with defining model classes inside of migration - that is too cumbersome and goes against the first principle of the rails doctrine (Optimize for programmer happiness), since it's a fact that it causes headaches for engineers.

I propose just using raw SQL to change data in a migration like this. I know some people will disagree (maybe even barf a little) when reading this. However, this produces simpler and cleaner code, is much easier for developers to use, but is controversial and not widespread. It's very common in Basecamp (https://discuss.rubyonrails.org/t/patterns-for-data-only-rails-migrations/75320/10) and plays nicely with what the migrations were designed to be. I'll quote the linked reply from DHH here for ease of readability

We mix data and sql statements in migrations all the time at Basecamp. This is what migrations were designed to do! They’re intended to be moments in time. Not to be replayable forever. The stable unit is schema.rb (or structure.sql).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants