ENH: 'to_sql()' add param 'method' to control insert statement (#21103) #21199

schettino72 · 2018-05-25T02:52:13Z

Also revert default insert method to NOT use multi-value.

This is WIP. I would like to gather feedback on API change before going on.

support callables as parameter?
support to_sql() when used on SQLite3 (without SQLAlchemy)

Sample file for performance benchmarking

import time

import numpy as np
import pandas as pd
from sqlalchemy import create_engine


N_COLS = 20
N_ROWS= 200000
# one of to_sql insert methods: None/'default', 'multi', 'copy'
METHOD = 'copy'

engine = create_engine('postgresql://postgres:@localhost/pandas_perf')
conn = engine.connect()

start = time.time()
df = pd.DataFrame({n: np.arange(0, N_ROWS, 1) for n in range(N_COLS)})

# convert df to sql table
df.to_sql('test', conn, index=False, if_exists='replace',
          chunksize=1000, method=METHOD)

print('WRITE: {}'.format(time.time() - start))

closes "too many SQL variables" Error with pandas 0.23 - enable multivalues insert #19664 issue #21103 & closes to_sql() performance regression (#19664) when DF contains many columns #21146 & Use multi-row inserts for massive speedups on to_sql over high latency connections #8953
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry
update docstrings/docs

pep8speaks · 2018-05-25T02:52:17Z

Hello @schettino72! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 30, 2018 at 03:40 Hours UTC

schettino72 · 2018-05-25T03:03:20Z

For 200,000 rows, 20 columns, chunksize 1000. Running python 3.6, postgres:

Method	time(seconds)
default	36.6
multi	56.4
copy	2.2

jorisvandenbossche · 2018-05-28T07:58:08Z

Does COPY work for all sqlalchemy flavors?

In the past, we always have limited the pandas implementation to sqlachemy core constructs that are backend agnostic, within its limitations.
But given the big difference, we maybe should reconsider that.

schettino72 · 2018-05-28T08:51:37Z

Does COPY work for all sqlalchemy flavors?

No. I do not have enough knowledge of various DB systems. I just know it works with Postgresql and does NOT work with SQLite.

I think the main point is to allow the insertion method to be modified without requiring any monkey-patching.

I can drop the COPY handling from the patch if you guys wish... I included it to make sure the API was good enough to really use other methods. As the default and multi-value implementation are very similar.

IMO it does not hurt to include some code for a specific backend as long as the user needs to explicitly choose to do so, but I understand you guys dont want the burden to support code for every different backend.

codecov · 2018-05-29T12:01:38Z

Codecov Report

Merging #21199 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #21199      +/-   ##
==========================================
+ Coverage   91.84%   91.84%   +<.01%     
==========================================
  Files         153      153              
  Lines       49506    49538      +32     
==========================================
+ Hits        45467    45499      +32     
  Misses       4039     4039

Flag	Coverage Δ
#multiple	`90.24% <ø> (ø)`	⬆️
#single	`41.87% <ø> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/generic.py	`96.12% <ø> (ø)`	⬆️
pandas/core/indexes/base.py	`96.61% <0%> (-0.08%)`	⬇️
pandas/core/frame.py	`97.22% <0%> (ø)`	⬆️
pandas/core/arrays/categorical.py	`95.68% <0%> (ø)`	⬆️
pandas/core/indexes/datetimelike.py	`96.89% <0%> (+0.09%)`	⬆️
pandas/core/tools/datetimes.py	`84.98% <0%> (+0.54%)`	⬆️
pandas/io/formats/printing.py	`93.08% <0%> (+3.7%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f91e28c...66cec62. Read the comment docs.

…` parameter.

jorisvandenbossche · 2018-06-06T13:19:54Z

So as I said on the issue (#21103 (comment)), I think a good start might be to have this option and allow users to pass a custom function, but not to actually implement such a 'copy' method ourselves (but it is a good test case).
(if we include the copy method, we should have much more extensive testing for different platforms and datatypes etc)

schettino72 · 2018-06-06T13:25:38Z

sure. no problems. I will remove the "copy" implementation.

jorisvandenbossche · 2018-06-06T14:10:52Z

Be sure to keep it as a test case in the tests, and maybe also a good example for the documentation

jreback · 2018-06-07T11:13:34Z

for 0.23.1 ?

jorisvandenbossche · 2018-06-07T11:15:04Z

At least something, if not this, then a revert of the original PR.

jreback · 2018-06-07T11:18:57Z

fair enough. there are 2 or 3 issues open in 0.23.1. pls adjust to what makes sense.

jorisvandenbossche · 2018-06-07T14:24:18Z

I opened the PR to simply revert: #21355, we can keep this PR for 0.23.2 or 0.24.0

schettino72 · 2018-06-09T11:02:01Z

Sorry, I rebased this to master and force pushed... It seems github decided to automatically close this. Re-opened as #21401

schettino72 added a commit to schettino72/pandas that referenced this pull request May 29, 2018

refs pandas-dev#21199. Fix code for SQLiteTable and unit-tests.

8528936

schettino72 added a commit to schettino72/pandas that referenced this pull request May 30, 2018

refs pandas-dev#21199. to_sql() support passing a callable to `method…

66cec62

…` parameter.

danfrankj mentioned this pull request May 31, 2018

"too many SQL variables" Error with pandas 0.23 - enable multivalues insert #19664 issue #21103

Closed

jreback added the IO SQL to_sql, read_sql, read_sql_query label Jun 4, 2018

jreback added this to the 0.23.1 milestone Jun 7, 2018

jorisvandenbossche removed this from the 0.23.1 milestone Jun 7, 2018

jorisvandenbossche mentioned this pull request Jun 7, 2018

to_sql() performance regression (#19664) when DF contains many columns #21146

Closed

schettino72 closed this Jun 9, 2018

schettino72 force-pushed the tosql-insert-param-21103 branch from 66cec62 to abfac97 Compare June 9, 2018 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: 'to_sql()' add param 'method' to control insert statement (#21103) #21199

ENH: 'to_sql()' add param 'method' to control insert statement (#21103) #21199

schettino72 commented May 25, 2018 •

edited

pep8speaks commented May 25, 2018 •

edited

schettino72 commented May 25, 2018

jorisvandenbossche commented May 28, 2018

schettino72 commented May 28, 2018

codecov bot commented May 29, 2018 •

edited

jorisvandenbossche commented Jun 6, 2018

schettino72 commented Jun 6, 2018

jorisvandenbossche commented Jun 6, 2018

jreback commented Jun 7, 2018

jorisvandenbossche commented Jun 7, 2018

jreback commented Jun 7, 2018

jorisvandenbossche commented Jun 7, 2018

schettino72 commented Jun 9, 2018

ENH: 'to_sql()' add param 'method' to control insert statement (#21103) #21199

ENH: 'to_sql()' add param 'method' to control insert statement (#21103) #21199

Conversation

schettino72 commented May 25, 2018 • edited

pep8speaks commented May 25, 2018 • edited

Comment last updated on May 30, 2018 at 03:40 Hours UTC

schettino72 commented May 25, 2018

jorisvandenbossche commented May 28, 2018

schettino72 commented May 28, 2018

codecov bot commented May 29, 2018 • edited

Codecov Report

jorisvandenbossche commented Jun 6, 2018

schettino72 commented Jun 6, 2018

jorisvandenbossche commented Jun 6, 2018

jreback commented Jun 7, 2018

jorisvandenbossche commented Jun 7, 2018

jreback commented Jun 7, 2018

jorisvandenbossche commented Jun 7, 2018

schettino72 commented Jun 9, 2018

schettino72 commented May 25, 2018 •

edited

pep8speaks commented May 25, 2018 •

edited

codecov bot commented May 29, 2018 •

edited