Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interest in support for in-memory data? #25

Open
jamesturk opened this issue Jul 8, 2016 · 8 comments
Open

interest in support for in-memory data? #25

jamesturk opened this issue Jul 8, 2016 · 8 comments

Comments

@jamesturk
Copy link

I wanted to check if a patch to support something like:

CopyMapping(model=Entry, data=huge_list_of_unsaved_entries) would be a welcome patch

I have some code I was considering making into a library that uses pg COPY to replace Django's bulk_create in the form of a function w/ the signature:

def postgres_bulk_copy(objects, ModelCls=None,
                       table_name=None,
                       fields=None, not_null=None, null=None,
                       ):

I was considering abstracting this into a generic library to use Postgres' COPY TO/FROM when I found this library & figured I'd check if you'd be open to expanding the scope for this use case.

@palewire
Copy link
Owner

palewire commented Jul 9, 2016

Maybe! Just to make sure I understand, the idea is to insert data into an existing table managed by a Django model?

@jamesturk
Copy link
Author

jamesturk commented Jul 9, 2016

Yep, the way I have it working is that you instantiate a bunch of objects of the model type. It'd look something like:

entries = []
for data in data_from_some_source():
     # by instantiating Entry here we get a few things like proper defaults
     entries.append(Entry(**item))

# slow way
Entry.objects.bulk_create(entries)

# fast way
CopyMapping(model=Entry, data=entries).save()

& internally it'd essentially create a temporary CSV file w/ all of the data then load it via the same COPY mechanism.

@palewire
Copy link
Owner

palewire commented Jul 9, 2016

Let's do it!

On Fri, Jul 8, 2016, 9:17 PM James Turk notifications@github.com wrote:

Yep, the way I have it working is that you instantiate a bunch of objects
of the model type. It'd look something like:

entries = []
for data in data_from_some_source():
# by instantiating Entry here we get a few things like proper defaults
entries.append(Entry(**item))

slow way

Entry.objects.bulk_create(entries)

fast way

CopyMapping(model=Entry, data=entries).save()

& internally it'd essentially create a temporary CSV file w/ all of the
data then loads it via the same COPY mechanism.


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#25 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAAnCb64uIgiRNztsEcF5WNzf2ay7esfks5qTyDfgaJpZM4JIfOl
.

@virusdefender
Copy link

Is there any progress?

@denhartog
Copy link
Contributor

I thought about this when I first came across this project a few weeks ago. If I remember correctly, the in-memory object would need a read() and readlines() method based on psycopg2's copy_expert() method (see: https://github.com/psycopg/psycopg2/blob/6da3e7ee69971cd6cb692765a4d66a5ce405f104/psycopg/cursor_type.c#L1366)

@jamesturk
Copy link
Author

jamesturk commented Jan 12, 2018

I never got a chance to come back & contribute what I had, but here it is as a POC, in case someone else wants to pick this up.

import csv
from io import StringIO
from django.db import connection


def postgres_bulk_copy(objects, ModelCls=None,
                       table_name=None,
                       fields=None, not_null=None, null=None,
                       include_id=False,
                       ):
    _fields = []
    _not_null = []
    _null = []

    # if a model was passed in, use it to get defaults
    if ModelCls:
        for f in ModelCls._meta.get_fields():
            if (not f.auto_created and
                    not (f.is_relation and f.many_to_many) and
                    f.column):
                _fields.append(f.column)
                if f.null:
                    _null.append(f.column)
                else:
                    _not_null.append(f.column)

        # prefer passed in overrides if they exist
        table_name = table_name or ModelCls._meta.db_table
        fields = fields or _fields
        not_null = not_null or _not_null
        null = null or _null
        if include_id:
            fields.append('id')

        objects = [
            {k: getattr(o, k) for k in fields} for o in objects
        ]

    tmp = StringIO()
    w = csv.DictWriter(tmp, fieldnames=fields)
    w.writeheader()
    w.writerows(objects)

    # flush and seek to start
    tmp.flush()
    tmp.seek(0)

    cursor = connection.cursor()
    sql = "COPY {}({}) FROM STDIN WITH CSV HEADER".format(table_name,
                                                          ', '.join(fields)
                                                          )
    if null:
        sql += " FORCE NULL {}".format(', '.join(null))
    if not_null:
        sql += " FORCE NOT NULL {}".format(', '.join(not_null))

    cursor.copy_expert(sql, tmp)
    # need this commit here so lookups immediately after will work
    cursor.connection.commit()

@palewire
Copy link
Owner

Thanks for sharing your work, @jamesturk. If anybody wants to take a run at a pull request it would be greatly appreciated.

@palewire
Copy link
Owner

palewire commented Feb 6, 2018

I think it's only partial progress towards the goal here, but I just merged #68, which adds support for submitting file objects in addition to file paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants