Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: 'to_sql()' add param 'method' to control insert statement (#21103) #21199

Closed

Conversation

schettino72
Copy link
Contributor

@schettino72 schettino72 commented May 25, 2018

Also revert default insert method to NOT use multi-value.

This is WIP. I would like to gather feedback on API change before going on.

  • support callables as parameter?
  • support to_sql() when used on SQLite3 (without SQLAlchemy)

Sample file for performance benchmarking

import time

import numpy as np
import pandas as pd
from sqlalchemy import create_engine


N_COLS = 20
N_ROWS= 200000
# one of to_sql insert methods: None/'default', 'multi', 'copy'
METHOD = 'copy'

engine = create_engine('postgresql://postgres:@localhost/pandas_perf')
conn = engine.connect()

start = time.time()
df = pd.DataFrame({n: np.arange(0, N_ROWS, 1) for n in range(N_COLS)})

# convert df to sql table
df.to_sql('test', conn, index=False, if_exists='replace',
          chunksize=1000, method=METHOD)

print('WRITE: {}'.format(time.time() - start))

@pep8speaks
Copy link

pep8speaks commented May 25, 2018

Hello @schettino72! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 30, 2018 at 03:40 Hours UTC

@schettino72
Copy link
Contributor Author

For 200,000 rows, 20 columns, chunksize 1000. Running python 3.6, postgres:

Method time(seconds)
default 36.6
multi 56.4
copy 2.2

@jorisvandenbossche
Copy link
Member

Does COPY work for all sqlalchemy flavors?

In the past, we always have limited the pandas implementation to sqlachemy core constructs that are backend agnostic, within its limitations.
But given the big difference, we maybe should reconsider that.

@schettino72
Copy link
Contributor Author

Does COPY work for all sqlalchemy flavors?

No. I do not have enough knowledge of various DB systems. I just know it works with Postgresql and does NOT work with SQLite.

I think the main point is to allow the insertion method to be modified without requiring any monkey-patching.

I can drop the COPY handling from the patch if you guys wish... I included it to make sure the API was good enough to really use other methods. As the default and multi-value implementation are very similar.

IMO it does not hurt to include some code for a specific backend as long as the user needs to explicitly choose to do so, but I understand you guys dont want the burden to support code for every different backend.

schettino72 added a commit to schettino72/pandas that referenced this pull request May 29, 2018
@codecov
Copy link

codecov bot commented May 29, 2018

Codecov Report

Merging #21199 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21199      +/-   ##
==========================================
+ Coverage   91.84%   91.84%   +<.01%     
==========================================
  Files         153      153              
  Lines       49506    49538      +32     
==========================================
+ Hits        45467    45499      +32     
  Misses       4039     4039
Flag Coverage Δ
#multiple 90.24% <ø> (ø) ⬆️
#single 41.87% <ø> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 96.12% <ø> (ø) ⬆️
pandas/core/indexes/base.py 96.61% <0%> (-0.08%) ⬇️
pandas/core/frame.py 97.22% <0%> (ø) ⬆️
pandas/core/arrays/categorical.py 95.68% <0%> (ø) ⬆️
pandas/core/indexes/datetimelike.py 96.89% <0%> (+0.09%) ⬆️
pandas/core/tools/datetimes.py 84.98% <0%> (+0.54%) ⬆️
pandas/io/formats/printing.py 93.08% <0%> (+3.7%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f91e28c...66cec62. Read the comment docs.

schettino72 added a commit to schettino72/pandas that referenced this pull request May 30, 2018
@jreback jreback added the IO SQL to_sql, read_sql, read_sql_query label Jun 4, 2018
@jorisvandenbossche
Copy link
Member

So as I said on the issue (#21103 (comment)), I think a good start might be to have this option and allow users to pass a custom function, but not to actually implement such a 'copy' method ourselves (but it is a good test case).
(if we include the copy method, we should have much more extensive testing for different platforms and datatypes etc)

@schettino72
Copy link
Contributor Author

sure. no problems. I will remove the "copy" implementation.

@jorisvandenbossche
Copy link
Member

Be sure to keep it as a test case in the tests, and maybe also a good example for the documentation

@jreback jreback added this to the 0.23.1 milestone Jun 7, 2018
@jreback
Copy link
Contributor

jreback commented Jun 7, 2018

for 0.23.1 ?

@jorisvandenbossche
Copy link
Member

At least something, if not this, then a revert of the original PR.

@jreback
Copy link
Contributor

jreback commented Jun 7, 2018

fair enough. there are 2 or 3 issues open in 0.23.1. pls adjust to what makes sense.

@jorisvandenbossche
Copy link
Member

I opened the PR to simply revert: #21355, we can keep this PR for 0.23.2 or 0.24.0

@schettino72
Copy link
Contributor Author

Sorry, I rebased this to master and force pushed... It seems github decided to automatically close this. Re-opened as #21401

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
4 participants