Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: to_sql() add parameter "method" to control insertions method (#8… #21401

Merged
merged 18 commits into from
Dec 28, 2018

Conversation

schettino72
Copy link
Contributor

@schettino72 schettino72 commented Jun 9, 2018

…953)

@pep8speaks
Copy link

pep8speaks commented Jun 9, 2018

Hello @schettino72! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 27, 2018 at 23:51 Hours UTC

@schettino72
Copy link
Contributor Author

  • Added docs
  • removed COPY implementation from code and added on docs

There is only one problem with flake8 complaining about a long.
The problem is that it is a url link, not sure how to break in into multiple lines...

@schettino72
Copy link
Contributor Author

Note: I am targeting this to 0.24, not 0.23.x.
IMO an addition of parameter should not be made a on bug release.
Otherwise code written for 0.32.2 using the new parameter would not run on 0.23.1...

@gfyoung gfyoung added IO SQL to_sql, read_sql, read_sql_query Performance Memory or execution speed performance labels Jun 10, 2018

sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
table_name, columns)
cur.copy_expert(sql=sql, file=s_buf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice explanation here! Do you mind moving some of it to the parameter description? That's probably where viewers will look first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, indeed (similar to my comment above), I would at least put a basic explanation of each option in the parameter description.
And then you can refer to below for more detailed performance considerations / example ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe we can put this example in the user guide (io.rst) ?

@gfyoung
Copy link
Member

gfyoung commented Jun 10, 2018

Note: I am targeting this to 0.24, not 0.23.x.

@schettino72 : That's exactly what we would do as well. Good call!

@gfyoung
Copy link
Member

gfyoung commented Jun 10, 2018

@schettino72 : Looks like you have a lint error according to Travis here.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Added some comments, didn't look at the tests yet

pandas/core/generic.py Outdated Show resolved Hide resolved

sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
table_name, columns)
cur.copy_expert(sql=sql, file=s_buf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, indeed (similar to my comment above), I would at least put a basic explanation of each option in the parameter description.
And then you can refer to below for more detailed performance considerations / example ..

This usually provides a big performance for Analytic databases
like *Presto* and *Redshit*, but has worse performance for
traditional SQL backend if the table contains many columns.
For more information check SQLAlchemy `documention <http://docs.sqlalchemy.org/en/latest/core/dml.html?highlight=multivalues#sqlalchemy.sql.expression.Insert.values.params.*args>`__.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you start the url on the next line, I think flake8 will be OK with the too long line, like:

    For more information check SQLAlchemy `documention 
    <http://docs.sqlalchemy.org/en/latest/core/dml.html?highlight=multivalues#sqlalchemy.sql.expression.Insert.values.params.*args>`__.


- `'default'`: Uses standard SQL `INSERT` clause
- `'multi'`: Pass multiple values in a single `INSERT` clause.
It uses a **special** SQL syntax not supported by all backends.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'It' should be indented equally to the `'multi'` of the line above (now it is one space more I think)

traditional SQL backend if the table contains many columns.
For more information check SQLAlchemy `documention <http://docs.sqlalchemy.org/en/latest/core/dml.html?highlight=multivalues#sqlalchemy.sql.expression.Insert.values.params.*args>`__.
- callable: with signature `(pd_table, conn, keys, data_iter)`.
This can be used to implement more performant insertion based on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here about the indentation

The parameter ``method`` controls the SQL insertion clause used.
Possible values are:

- `'default'`: Uses standard SQL `INSERT` clause
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to say this is a single insert statement for each row?


sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
table_name, columns)
cur.copy_expert(sql=sql, file=s_buf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe we can put this example in the user guide (io.rst) ?

pandas/io/sql.py Outdated
def insert(self, chunksize=None, method=None):

# set insert method
if method in (None, 'default'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the 'None' needed here?

@codecov
Copy link

codecov bot commented Jun 10, 2018

Codecov Report

Merging #21401 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21401   +/-   ##
=======================================
  Coverage   91.89%   91.89%           
=======================================
  Files         153      153           
  Lines       49596    49596           
=======================================
  Hits        45576    45576           
  Misses       4020     4020
Flag Coverage Δ
#multiple 90.29% <ø> (ø) ⬆️
#single 41.86% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 96.12% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 415012f...1e5d1cc. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 10, 2018

Codecov Report

Merging #21401 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21401      +/-   ##
==========================================
+ Coverage   92.22%   92.22%   +<.01%     
==========================================
  Files         169      169              
  Lines       50897    50903       +6     
==========================================
+ Hits        46940    46946       +6     
  Misses       3957     3957
Flag Coverage Δ
#multiple 90.65% <ø> (ø) ⬆️
#single 42.29% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 96.84% <ø> (ø) ⬆️
pandas/core/arrays/categorical.py 95.56% <0%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a0c91e...f4ffbfc. Read the comment docs.

@schettino72
Copy link
Contributor Author

Thanks for quick review and tips. Ready for next round 😁

@gfyoung
Copy link
Member

gfyoung commented Jun 10, 2018

@schettino72 : Seems like you have test failures...

@jreback
Copy link
Contributor

jreback commented Jun 13, 2018

lgtm. @jorisvandenbossche merge when satisfied.

@jreback
Copy link
Contributor

jreback commented Jun 18, 2018

lgtm. ping @jorisvandenbossche

table_name = table.name

sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
table_name, columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise this is just an example, but isn't this piece vulnerable to SQL injection? thepsycopg2.sql module has an Identifier object designed to handle SQL identifiers appropriately. See also Example 41-1. Quoting Values In Dynamic Queries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dahlbaek no expert on this, but if you could provide the code that you would replace it with to make it more robust, that would certainly be welcome!

@danfrankj
Copy link
Contributor

@schettino72 would love to see this this ship - think we'll get it into the next release?

@TomAugspurger
Copy link
Contributor

What's the status on this?

@schettino72 there's a merge conflict with master, if you could merge master and push.

@wxianxin
Copy link

@schettino72 The magnificent you are the only hope for this unbelievable feature to be delivered at this wonderful release.

@schettino72
Copy link
Contributor Author

Sorry, I dont have the time now to work on this.

It seems the conflict is only on README file.
IMO should be resolved at the time the pull request is actually merged (by someone with write access), not something i can do.
Otherwise I might be playing catch up forever...

@tgriek
Copy link

tgriek commented Dec 17, 2018

Waiting for this! Thanks guys.

pandas/io/sql.py Outdated Show resolved Hide resolved
@mroeschke
Copy link
Member

@jreback Merged master, changed the method default to None and all green. Functionality looks good to me, but not sure if there's any doc followups that needs to be made.

s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this needs testing.

@mroeschke
Copy link
Member

@jreback Added a test for the example in io.rst and all green.

@jreback jreback added this to the 0.24.0 milestone Dec 27, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. minor comment on the whatsnew.

doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved
pandas/io/sql.py Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Dec 28, 2018

@TomAugspurger ?

@TomAugspurger
Copy link
Contributor

+1 aside from the example. I'll push a commit updating it.

@mroeschke mroeschke merged commit c1af4f5 into pandas-dev:master Dec 28, 2018
@mroeschke
Copy link
Member

Thanks @jreback, @TomAugspurger.

And thank you @schettino72!

@danfrankj
Copy link
Contributor

Thank you all! Looking forward to using this :)

@maxgrenderjones
Copy link
Contributor

Hooray! Thanks :)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
…ndas-dev#8… (pandas-dev#21401)

* ENH: to_sql() add parameter "method" to control insertions method (pandas-dev#8953)

* ENH: to_sql() add parameter "method". Fix docstrings (pandas-dev#8953)

* ENH: to_sql() add parameter "method". Improve docs based on reviews (pandas-dev#8953)

* ENH: to_sql() add parameter "method". Fix unit-test (pandas-dev#8953)

* doc clean-up

* additional doc clean-up

* use dict(zip()) directly

* clean up merge

* default --> None

* Remove stray default

* Remove method kwarg

* change default to None

* test copy insert snippit

* print debug

* index=False

* Add reference to documentation
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
…ndas-dev#8… (pandas-dev#21401)

* ENH: to_sql() add parameter "method" to control insertions method (pandas-dev#8953)

* ENH: to_sql() add parameter "method". Fix docstrings (pandas-dev#8953)

* ENH: to_sql() add parameter "method". Improve docs based on reviews (pandas-dev#8953)

* ENH: to_sql() add parameter "method". Fix unit-test (pandas-dev#8953)

* doc clean-up

* additional doc clean-up

* use dict(zip()) directly

* clean up merge

* default --> None

* Remove stray default

* Remove method kwarg

* change default to None

* test copy insert snippit

* print debug

* index=False

* Add reference to documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use multi-row inserts for massive speedups on to_sql over high latency connections