Add bulk upsert to lucky #789

robcole · 2022-01-12T05:27:40Z

Adds an upsert overload for Arrays to create large numbers of records at the same time.
Uses PG's UNNEST behavior to allow for a near-infinite (buyer beware) number of insertions
rather than being limited by PG's bind parameter restrictions (64k total binds, which would prevent
more than a few thousand upserts at a time depending on the number of column inserts).

Common use cases for this include building large numbers of records for insertion into stat grouping tables (similar to materialized views, but with more atomicity for which records are refreshed), loading records from CSV files or JSON files into a model, etc.

Example uses (pseudocode):

# Rollup Models
answer_stat_data = AnswerQuestion.new.correct(true).group_by(:user_id).group_count.map do |count_hash|
  { user_id: count_hash.keys.first, correct_count: count_hash.values.first }
end

UserStat.upsert!(answer_data)

# Loading Data from Files
user_marketing_preferences = CSV.parse(File.read("user_opt_in.csv")).do |profile|
  { user_id: profile[0], subscribed_on: profile[1] }
end

UserMarketingProfile.upsert(user_marketing_preferences)

# Default Relationships for models
class Draw < BaseModel
  table do
    column start_at : Time
    belongs_to bonspiel : Bonspiel

    has_many games : Game
  end

  def games!
    return previous_def.sort_by(&.sheet) unless previous_def.empty?

    SaveGame.upsert(
      bonspiel!.sheets.times.map do |n|
        {sheet: n.to_i16 + 1, draw_id: self.id}
      end.to_a
    )
  end
  ...
end

src/avram/bulk_upsert.cr

spec/avram/bulk_upsert_spec.cr

src/avram/upsert.cr

robacarp · 2022-01-12T13:14:02Z

src/avram/bulk_upsert.cr

+  private def set_timestamps(collection)
+    collection.map do |record|
+      record.created_at.value ||= Time.utc if record.responds_to?(:created_at)
+      record.updated_at.value = Time.utc if record.responds_to?(:updated_at)


Could these be NOW() statements instead? Especially when you're iterating thousands of rows it's worth saving as much time as you can.

Yeah we were considering this! The one thing we needed to work through for this was a MAJOR data changeup -- these would then need to be NOW() instead of bound $n params which changes up the models a lot

Related: #393 (note the issue on updated_at here that would require a trigger to know when to change)
Somewhat related: luckyframework/lucky#1583

src/avram/upsert.cr

jwoertink

I LOVE that you guys are tackling this. I know it's a super difficult problem to solve, and to make it extra hard, we have to take additional considerations in to mind with the Lucky core goals.

Is this type-safe? (as in providing more safety than just what Crystal gives you)
Does this help me catch bugs at compile-time?
Does it add security by ensuring bad data can't be used?

I know the current upsert doesn't quite work like a normal DB upsert. @paulcsmith wanted to add ON CONFLICT support directly in to the query builder which would add some extra support. You can read up on the original PR #334

Also while you're working on this, keep in mind on how this would look coming from an HTML form. How do the params get built and passed over to this? One tip I've learned while working on Lucky is don't be afraid to break away from the normal "rails mindset" coding styles. Sometimes an API that looks a bit funky / offputting at first may end up being a better solution in the long run. A little extra typing characters could mean a few less specs and more confidence!

Feel free to hit me up if you have any questions I can answer. I'll keep an eye on the PR and try to chime in when I can. Stoke to see this! 🚀

spec/avram/bulk_upsert_spec.cr

jwoertink · 2022-01-12T16:14:04Z

src/avram/bulk_upsert.cr

+  private def set_timestamps(collection)
+    collection.map do |record|
+      record.created_at.value ||= Time.utc if record.responds_to?(:created_at)
+      record.updated_at.value = Time.utc if record.responds_to?(:updated_at)


Related: #393 (note the issue on updated_at here that would require a trigger to know when to change)
Somewhat related: luckyframework/lucky#1583

spec/avram/operations/save_operation_spec.cr

grepsedawk · 2022-01-12T17:12:32Z

keep in mind on how this would look coming from an HTML form. How do the params get built and passed over to this?
@jwoertink

anything that is manual (from a form) I'd generally consider "small enough" to not care about utilizing bulk operations directly from an html form

To me, bulk operations are more for:

CSV imports
API requests/syncs (in our production use case, syncing entire gmail inboxes from the gmail API)
generative data (generate 10k records for a seed DB)

robacarp · 2022-01-12T17:14:52Z

The tricky part with this is how delicate it is to provide a nice interface to the developer while still retaining as much compile time safety as possible. It may be possible to do both in an easy way by dividing out the concerns here. Consider the string builder approach:

Instead of:

"this is a title #{data} #{data.metadata ? data.processed : ""}"

String builder has an intuitive and easy to read interface:

String.build do |b|
  b << "this is a title "
  b << data
  b << data.processed if data.metadata
end

In the same way I wonder if it's possible or perhaps even more safe to iteratively build the upsert. Here's some "imagine code" that hopefully shows what I mean:

class BulkUserOperation < UserOperation
end

upsert_op = BulkUserOperation.build_upsert do |upserter|
  DataSource.each_row do |row|
    upserter << upserter.bind row
    upserter.commit
  end
end

upsert_op.finish

Here are my assumptions:

Bulk data operations are consuming some sort of firehose -- a giant CSV, or whatever. We don't want to attempt to store this entire firehose in memory because it could never end. Consider an Event stream eg Kenesis or the Twitter firehose.
DataSource.each_row yields tuples to the block, maybe named tuples
Builder provides a way to bind a tuple to something more concrete. If the bind fails, perhaps a raise or perhaps it simply counts the invalid data. Counting invalid data is often what I want in any bulk operation.
#<< runs validations or other before-transformations as defined in the Operation, and then stashes the data in a temporary list.
commit would allow the Operation to decide when to emit a batch of records. Some logic or configuration to determine how frequent a commit is, similar to #upsert_lookup_columns in the current API.
finish would emit any remaining records

Handling bulk data operations is tricky, and handling upserts properly is tricky too. Perhaps Avram needs some better tooling around bulk data operations independent of upserts.

grepsedawk · 2022-01-13T04:54:17Z

Nice to haves:

Throw a clean error when UK does not exist

spec/avram/bulk_upsert_spec.cr

spec/avram/operations/save_operation_spec.cr

src/avram/bulk_upsert.cr

src/avram/save_operation.cr

src/avram/upsert.cr

src/avram/bulk_upsert.cr

robacarp · 2022-01-14T17:23:19Z

After looking over this and watching the discussion, I'm aware that there were at least three different assumptions about what a bulk insert/upsert might be used for. I observed these assumptions:

Bulk upsert would be used for consuming a CSV or other finite data source, outside any http request cycle. I think this is the assumption under which much of this code is written. This might be a one-time job (switching database engines) or it might be a daily job (ingesting some CSV from an LMS, etc).
Stream consumption. An infinite data source which may be moving incredibly fast and never end. This was my assumption and Twitter's Compliance Firehose is one such datastream.
Backing for a REST API endpoint which would allow some api customer to post bulk data. For example marking a collection of emails as "Read" in an inbox app.

They're not fundamentally incompatible, nor is it absolutely necessary that whatever is implemented here facilitate all three of these. I do think it might be helpful to identify the use cases which need to be discussed. It might provide some clearer guidance around validations, compile time guards, etc.

grepsedawk · 2022-01-14T17:57:37Z

This might be a one-time job (switching database engines) or it might be a daily job (ingesting some CSV from an LMS, etc).

Or inside the req cycle, like a user uploads a csv

grepsedawk · 2022-01-15T05:44:02Z

will/crystal-pg#244 needs to be merged/released or monkeypatched into avram before this is viable
/cc @robcole

src/avram/upsert.cr

spec/avram/operations/save_operation_spec.cr

src/avram/upsert.cr

matthewmcgarvey

Catching up on all the work y'all have done and threw some comments in there 👍

src/avram/bulk_upsert.cr

src/avram/save_operation.cr

src/avram/upsert.cr

robcole · 2022-01-20T04:35:08Z

Nice to haves:

Throw a clean error when UK does not exist

@grepsedawk totally forgot about this one, lmk if you want to pair on it later this week too

jwoertink

Honestly, this is pretty chonky, but it doesn't look "crazy".... I don't have any immediate use for it, but from what it sounds like, @grepsedawk used it and it seems to work?

The only thing I'm not seeing (and it may be ok for now) is no way to use this with Lucky::Params, right?

@robcole Can you just update the original post for this PR with an example code of how someone would use this? That will give me (and other readers) an easy overview of what this PR is really doing.

robcole · 2022-01-25T18:43:47Z

Honestly, this is pretty chonky, but it doesn't look "crazy".... I don't have any immediate use for it, but from what it sounds like, @grepsedawk used it and it seems to work?

The only thing I'm not seeing (and it may be ok for now) is no way to use this with Lucky::Params, right?

@robcole Can you just update the original post for this PR with an example code of how someone would use this? That will give me (and other readers) an easy overview of what this PR is really doing.

It was designed for bulk importing which isn’t something I often do with params. I’ll see if I can find an example where it might be a fit, but would usually be iterating over a collection of responses from some source (API, CSV, or internal model building typically). Will add the examples in using it for today or tomorrow.

I think you’d just need an array of params since we use the same general interface as upsert and it should work fine, but I haven’t tested that yet.

jwoertink · 2022-01-25T18:47:59Z

I think you’d just need an array of params since we use the same general interface as upsert

Ok, I think that's fine. We can worry about adding in support from params later then. I think your usage makes sense, so just a quick code example. Just a few lines, nothing crazy, and that would be good!

jwoertink · 2022-01-26T16:21:08Z

Overall, I love the direction, and I think this is a super useful feature. There's just a few things that concern me, so I'd like to talk about possible solutions, or direction we can take. And please forgive me if any of this is incorrect and I missed something.

Right now, this implementation does a piggy back off of SaveOperations to help keep things a bit more consistent, and make use of some existing methods so we don't end up duplicating things 👍 However, this leads to several new issues...

For example, say I have a normal SaveUser operation.

class SaveUser < User::SaveOperation
  upsert_lookup_columns :email, :username
  upsert_unique_on :email, :username
  needs passcode : Int32
  attribute password : String

  before_save do
     validate_required(password)
      validate_email_format(email)
     validate_unique(username)
   end
   after_commit do |user|
     EmailUser.new(user, passcode).deliver
   end
end

# This works fine
SaveUser.upsert(passcode: 1234, email: "...", username: "...") do |op, user|
end

# This leads to some issues
SaveUser.upsert([{email: "...", username: "..."}])

The compiler would fail because needs isn't passed in
None of the validations or callbacks would run
The interface is different which could make it confusing.
It doesn't currently support params

So at this level, it's almost not even a SaveOperation.... The lack of validations and callbacks is the biggest part I think. Assuming you're pulling in bulk data from a CSV in which you don't control, there's no way to guarantee that a malformed email won't get inserted in to your DB. Since postgres won't handle validations other than NULL or not, and possible unique (if you have the index), then your app would be at-risk of bad data.

Here's my suggestions, but I'm totally open to others as well.

We rename this upsert method to something else.

PRO: Less confusion with the current upsert method
PRO: Allows us to mentally separate the interface so we don't necessarily need to conform to the method() {} method! API
CON: It doesn't solve the SaveOperation issues with callbacks, validations, etc...

We create a new operation type like Avram::SQLOperation or something... Avram::BulkOperation...

PRO: We completely remove ourselves from trying to cram in to a current setup that doesn't really work
PRO: We could keep the current name, and usage since it's a whole different object type
PRO: It could be expanded later for other uses...
CON: It's just another operation type... Which also means even more code to manage
CON: It's going to require duplicating a lot of code. (to me, this isn't completely horrible provided we have enough specs...)

We leave this PR as-is, but we add the @[Experimental()] annotation, and ensure that it's documented as a "at your own risk" feature.

PRO: We can merge this sooner than later.
CON: We have to come back to this later since it's incomplete.

I'm sure there's other options I may not be thinking of....

@robcole, I know you and @grepsedawk have put in a ton of work in to this, and I really appreciate it ❤️ . I want to make sure it's clear that I'm not shooting down all your efforts and hard work. I think having this feature in Avram will be huge! We just need to make sure that features like this live up to Lucky's core goals. I'd be happy to also hop on a voice chat in discord if either of you (or anyone else) would like to chat more about potential solutions, or just to hash out any concerns.

grepsedawk · 2022-01-26T16:37:09Z

I side towards 3.
I really do feel the proper name for this is in a SaveOperation.upsert(Array) and this would allow us to vet out real life uses as time moves on.
Also, WORST CASE with 3 is a user has 2 save operations: 1 for bulk things and 1 for normal things

It allows us all to feel out the API a bit more before making any big changes

Also, I could be wrong but I think the way @robcole did the splat it would actually support the needs of the save operation!

robcole · 2022-01-26T17:24:41Z

I side towards 3. I really do feel the proper name for this is in a SaveOperation.upsert(Array) and this would allow us to vet out real life uses as time moves on. Also, WORST CASE with 3 is a user has 2 save operations: 1 for bulk things and 1 for normal things

It allows us all to feel out the API a bit more before making any big changes

Also, I could be wrong but I think the way @robcole did the splat it would actually support the needs of the save operation!

I also side with 3, but with a note that when we bring in sql operations or bulk operations, I’m 💯 down to work on refactors to bring the two interfaces into more uniformity. I’d like to have a sensible way to have it still optionally yield an operation, but for the current code, it doesn’t feel like it brings enough value to change the API around that much.

A few things I’m less certain of for bulk upsert in general based on how I’ve used it:

I’ve never wanted callbacks to run when using it. By default they don’t in Rails, so this is something that could be an option that gets added, but by default, it isn’t something I’ve ever wanted. Frequently these updates have been 500k+ inserts over a matter of seconds, so having callbacks could really be problematic on some common workloads.
I think validations should always be optional. I can understand those being a lot more important and seems like an easy extension to add, once there’s some notion of how we want to handle “3 of 50 objects failed validation — what do you want to do?” Similar arguments around collection size might apply here.

The way I designed it should be a somewhat easy port for things like params once we have the following concepts in Lucky to make sure everything feels smooth.

robcole · 2022-01-26T17:44:35Z

@jwoertink In the event you're curious - https://github.com/zdennis/activerecord-import is where I got a lot of the API design inspiration (on top of your existing upsert code) as it's what I used up until Rails finally added it. Validations are optional in it, but the interface around reporting and fixing those validations can be a bit wonky - basically you're doing collection.select(&:invalid?).map(&:errors).

For the recent Rails additions, AFAIK these are all intended on being bulk operations, so callbacks and validations are skipped as well: https://www.bigbinary.com/blog/bulk-insert-support-in-rails-6

Given this LMK if you've got more thoughts here - definitely appreciating the feedback as it'll help me make sure I'm on the same page in general with Lucky API goals.

jwoertink · 2022-01-26T18:41:55Z

Oh sweet. Thanks for the links @robcole. I'll take a look at those.

Pulling things from rails can be nice since it's so well established, but it's also a double edged sword. Since rails isn't type-safe, and doesn't enforce any sort of data integrity, you're more likely to have bad data which leads to bugs in production. For example, in rails, an empty string is fine. If the column is String and set to not null in postgres, an empty string would be valid, but in Lucky that's not the case. Empty strings are treated like nil. We also make required column validations required. They're more "opt-out" than they are "opt-in".

I just had another idea.... Since one of the goals for Lucky is to try and always provide an escape hatch for when there's something we can't do at a high-level, what if we just dropped this whole API down a level?

For example, instead of doing this off the SaveOperation, it becomes lower level like

upsert = Avram::BulkUpsertt(Thing).new(array_of_named_tuples_for_thing)
upsert.run

After writing that, I didn't really like it, but the premise is still good... At a lower level, we don't need to necessarily follow Lucky convention. This is similar to when you do something like UserQuery.new.where("complex sql ?", "no safety"). That's a small snippet, but the same idea. By doing that, then later on we could create a whole new high level wrapper around it that includes all the other stuff we want!

What are your thoughts on that?

grepsedawk · 2022-01-29T07:06:04Z

src/avram/upsert.cr

@@ -100,5 +123,9 @@ module Avram::Upsert
        \{% raise "Please use the 'upsert_lookup_columns' macro in #{@type} before using '{{ method.id }}'" %}


This is an interesting case
I was passing a map directly, which didn't fall through to self.upsert(Array), OR the standard upsert, but was caught up by this method's error when I passed an Enumerable (from the map)

will/crystal-pg@v0.24.0...v0.26.0 crystal-lang/crystal-db@v0.10.1...v0.11.0 This adds the upstream changes from crystal-pg v0.11 This also allows us to remove the monkey patches required for null array elements in avram, added/utilized by the bulk upsert functionality luckyframework#789

will/crystal-pg@v0.24.0...v0.26.0 crystal-lang/crystal-db@v0.10.1...v0.11.0 This adds the upstream changes from crystal-pg v0.11 This also allows us to remove the monkey patches required for null array elements in avram, added/utilized by the bulk upsert functionality #789

grepsedawk · 2022-02-01T16:48:39Z

src/ext/db/param.cr

@@ -0,0 +1,8 @@
+# Can be removed once https://github.com/will/crystal-pg/pull/244 is merged.


This can be removed now #807

Adds an `upsert` overload for Arrays to create large numbers of records at the same time. Uses PG's UNNEST behavior to allow for a near-infinite (buyer beware) number of insertions rather than being limited by PG's bind parameter restrictions (64k total binds, which would prevent more than a few thousand upserts at a time depending on the number of column inserts). Co-authored-by: Alex Piechowski <alex@piechowski.io> Co-authored-by: robacarp <robacarp@users.noreply.github.com>

jwoertink · 2022-02-12T18:09:16Z

@robcole After talking it over with the team, there's been some suggestions on what if we made this a separate shard extension similar to how we did https://github.com/luckyframework/avram_slugify? If we go that route, you could keep the current implementation as-is. This would allow people to use the feature, and give feedback / contribute to really flesh out the different use cases. If you're cool with that, I could even start a new repo under Lucky and I think I can give you access to that. Let me know your thoughts, or you can hit me up on discord if you want 👍

robcole · 2022-02-12T18:22:04Z

@jwoertink I think it makes more sense to have this in Avram, but I agree that it isn’t ready yet. Have you done any work on the concept of a BulkOperation or SQLOperation?

My plan was to refactor this and add a BulkOperation which yields the operation in the same way as others, but would have a concept of multiple records, and initially wouldn’t handle validations and callbacks, but could down the road. Haven’t had time to get to the refactor yet but it’s on my list still.

jwoertink · 2022-02-12T18:25:01Z

Oh, ok. Yeah, I agree that it should be in Avram core too, but just not with the current implementation. If you're open to some refactors, then we can keep this open 👍

I haven't started on any of the bulk stuff. Work has just been way too crazy lately, so I've fallen behind a bit on my Lucky duties ☹️ Once I get back in the swing of things, I do want to tackle bulk stuff in general. I have a few ideas on how I think we can handle it, I just need to put that stuff down on paper 😅

robcole · 2022-02-12T18:35:04Z

Oh, ok. Yeah, I agree that it should be in Avram core too, but just not with the current implementation. If you're open to some refactors, then we can keep this open 👍

I haven't started on any of the bulk stuff. Work has just been way too crazy lately, so I've fallen behind a bit on my Lucky duties ☹️ Once I get back in the swing of things, I do want to tackle bulk stuff in general. I have a few ideas on how I think we can handle it, I just need to put that stuff down on paper 😅

Cool, I’ll take a stab at it soonish (booting two personal projects / business ideas on Lucky now so I’ve been short on time recently too), and hopefully we can piece together what a BulkOperation looks like from here with this PR.

Once I get to that, I’ll also add a couple more tests that help illuminate where this bulk insert is most useful as well, which is stuff like CSV loading and ETL batch operations that write to rollup tables.

jwoertink · 2022-02-12T18:37:38Z

Sweet! Stoked. And no rush on any of that. I don't plan on another release until around summer, and if it's not ready, we can always do the next release 😄 I just wanted to make sure you weren't waiting on me for anything. Sounds like we're on the same page though 👍 Thanks for being understanding!

robcole commented Jan 12, 2022

View reviewed changes

src/avram/bulk_upsert.cr Outdated Show resolved Hide resolved

robcole marked this pull request as draft January 12, 2022 05:30

grepsedawk reviewed Jan 12, 2022

View reviewed changes

src/avram/bulk_upsert.cr Outdated Show resolved Hide resolved

robcole commented Jan 12, 2022

View reviewed changes

spec/avram/bulk_upsert_spec.cr Outdated Show resolved Hide resolved

robcole commented Jan 12, 2022

View reviewed changes

src/avram/upsert.cr Outdated Show resolved Hide resolved

grepsedawk mentioned this pull request Jan 12, 2022

Rename current #upsert(!) methods #790

Open

robacarp reviewed Jan 12, 2022

View reviewed changes

src/avram/upsert.cr Outdated Show resolved Hide resolved

robacarp reviewed Jan 12, 2022

View reviewed changes

src/avram/upsert.cr Outdated Show resolved Hide resolved

jwoertink reviewed Jan 12, 2022

View reviewed changes

robcole force-pushed the add-bulk-upsert-to-lucky branch from 0e7a40f to 278b28b Compare January 13, 2022 05:14