Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"too many SQL variables" Error with pandas 0.23 - enable multivalues insert #19664 issue #21103

Closed
tripkane opened this issue May 17, 2018 · 37 comments
Labels
IO SQL to_sql, read_sql, read_sql_query Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@tripkane
Copy link

tripkane commented May 17, 2018

#let's import packages to use
import numpy as np
import pandas as pd
from sqlalchemy import create_engine


# #use pandas to import data
df = pd.DataFrame(np.arange(0,20000,1))
#create the engine to connect pandas with sqlite3
engine = create_engine('sqlite://')
#create connection
conn = engine.connect()
#convert df to sql table
df.to_sql('test',engine, if_exists='replace',chunksize=1000)
#print results
result = conn.execute("select * from test")
for row in result:
    print(row['index'])
conn.close()

Problem description

In pandas 0.22 I could write a dataframe to sql of reasonable size without error. Now I receive this error "OperationalError: (sqlite3.OperationalError) too many SQL variables". I am converting a dataframe with ~20k+ rows to sql. After looking around I suspect the problem lies in the limit set by sqlite3: SQLITE_MAX_VARIABLE_NUMBER which is set to 999 by default (based on their docs). This can apparently be changed by recompiling sqlite and adjusting this variable accordingly. I can confirm that for a df of length (rows) 499 this works. I can also confirm that this test version works with a row length of 20k and a chunksize of 499 inputted with df_to_sql works. In my real case the limit is 76. These numbers are clearly dependent on data size of each row so a method is required to estimate this based on data type and number of columns. #19664

Expected #Output

runfile('H:/Tests/Pandas_0.23_test.py', wdir='H:/Tests')
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

####Trace###########

runfile('H:/Tests/Pandas_0.23_test.py', wdir='H:/Tests')
Traceback (most recent call last):

File "", line 1, in
runfile('H:/Tests/Pandas_0.23_test.py', wdir='H:/Tests')

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "H:/Tests/Pandas_0.23_test.py", line 19, in
df.to_sql('test',engine, if_exists='fail',chunksize=500)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 2127, in to_sql
dtype=dtype)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 450, in to_sql
chunksize=chunksize, dtype=dtype)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 1149, in to_sql
table.insert(chunksize)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 663, in insert
self._execute_insert(conn, keys, chunk_iter)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 638, in _execute_insert
conn.execute(*self.insert_statement(data, conn))

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 948, in execute
return meth(self, multiparams, params)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\sql\elements.py", line 269, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1060, in _execute_clauseelement
compiled_sql, distilled_params

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1200, in _execute_context
context)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1413, in _handle_dbapi_exception
exc_info

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 203, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 186, in reraise
raise value.with_traceback(tb)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1193, in _execute_context
context)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 507, in do_execute
cursor.execute(statement, parameters)

OperationalError: (sqlite3.OperationalError) too many SQL variables [SQL: 'INSERT INTO test ("index", "0") VALUES (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?,

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@kristang
Copy link

This is also an issue when connecting to MSSQL. It fails silently and will not stop the function or return an exception, but will throw one if you set "chunksize" manually.

@tripkane
Copy link
Author

@kristang: I just played around a bit more. Have you tried a chunksize limit of 75 or less? My case is working for the local db based on this chunksize and strangely 499 for a sqlite db in memory, neither of which are the limits stated by sqlite. Have you tried different chunksizes in your case?

@kristang
Copy link

I tried 999, 1000 and 2100, but then just ended up downgrading to 0.22 to get on with my day.

@TomAugspurger
Copy link
Contributor

@tripkane could I bother you to narrow down the original example a bit? It should be minimal, and something that's appropriate for sticking in a unit test.

Could you also include the traceback you get?

@tripkane
Copy link
Author

@kristang: :) fair enough maybe try these lower numbers this out when you have time

@tripkane
Copy link
Author

@TomAugspurger: Hi Tom. Yep can do. So reduce it to my actual case (i.e. locally saved database i.e. not in memory and create the data from np.arange like the :memory: version?)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 17, 2018 via email

@tripkane
Copy link
Author

tripkane commented May 17, 2018

@TomAugspurger: I have shortened it and updated the comments. Basically chunksize must be used and one needs to estimate this based on the data size of one row based on each individual case. This person implemented this solution with peewee: https://stackoverflow.com/questions/35616602/peewee-operationalerror-too-many-sql-variables-on-upsert-of-only-150-rows-8-c anyway I hope this helps and for those with this issue you can't go wrong in the meantime with a chunksize of something very small like 1 :)

@tripkane
Copy link
Author

@TomAugspurger: After further testing the initial assumption seems correct and is based on the SQLITE_MAX_VARIABLE_NUMBER limit of 999. The max allowable chunksize associated with pd.df_to_sql is given by chunksize=999//(cols+1) where cols was generated for the test by cols = np.random.randint(1,20) and used to create df = pd.DataFrame(np.random.random((20000, cols)))

@jorisvandenbossche
Copy link
Member

@tripkane thanks for the minimal example! I can confirm this regression.

@jorisvandenbossche jorisvandenbossche added this to the 0.23.1 milestone May 17, 2018
@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version IO SQL to_sql, read_sql, read_sql_query labels May 17, 2018
@tripkane
Copy link
Author

@jorisvandenbossche: You're welcome, happy to help.

@schettino72
Copy link
Contributor

Since you guys are looking into this... Can you guys check if the change that caused this regression actually improved performance?

As I reported on #8953 this change actually made the operation slower for me. And IMO should be reverted...

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 18, 2018 via email

@tripkane
Copy link
Author

@TomAugspurger: I tested this quickly with essentially the same example as above for two different length df's 20k and 200k and compared chunksize = 1 with chunksize=999//(cols+1) with cols = 10 and in both cases it was twice as fast when connected to a local sqlite db.

@schettino72
Copy link
Contributor

@TomAugspurger I created issue #21146 to track this performance regression.
I did not profile it but did some experiments... my conclusion is that performance on 0.23 is better for a few columns but degrades quickly as the number of columns grow.

@cpcloud
Copy link
Member

cpcloud commented May 22, 2018

@pandas-dev/pandas-core How do people feel about reverting #19664? It both breaks existing code and degrades performance significantly. Happy to put up the patch.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 22, 2018 via email

@TomAugspurger
Copy link
Contributor

cc @danfrankj

@jorisvandenbossche
Copy link
Member

In the original issue (#8953), it was discussed to have an optional keyword for this, as which default would be best depends a lot on the exact circumstances (this discussion was largely ignored in the PR, sorry about not raising this in the PR).
So I think making it into an option with the default how it was before, is certainly a possible way forward.

I think it would still be useful for someone with some time to go through the discussion in that issue to check the raises usecases, how it interacts with chunksize, etc, as I don't recall everything that was discussed there.

@cpcloud
Copy link
Member

cpcloud commented May 22, 2018

IMO pandas should allow more flexibility here: users pass in a function that will get called with the SQLAlchemy table and they can construct the DML themselves. The default DML construction function would be the previous working code. Then any such optimization that comes up in the future doesn't need yet another keyword argument. I wonder if there isn't already a way to do this without monkey patching.

@schettino72
Copy link
Contributor

@cpcloud As of today there is no way to achieve this without monkey patching.
I use a monkey patched version to use COPY ... FROM.

I could work on patch to make this more flexible. I suggest adding a parameter insert_method to to_sql(). Apart from the default (previous) implementation, we could add multi-value (the 0.23 version) and COPY FROM.

schettino72 added a commit to schettino72/pandas that referenced this issue May 25, 2018
…s-dev#21103)

Also revert default insert method to NOT use multi-value.
@jorisvandenbossche
Copy link
Member

The way without monkeypatching is subclassing the table class and overriding the appropriate method, but that is not necessarily cleaner than monkeypatching as it private methods you are overriding anyway.

I like the proposed idea of allowing more flexibility. But we should think a bit about the exact interface we want (what does a user need to create the appropriate construct? The sqlalchemy table, the data (in which form?), the connection, ...).

@schettino72
Copy link
Contributor

@jorisvandenbossche

The way without monkeypatching is subclassing the table class and overriding the appropriate method, but that is not necessarily cleaner than monkeypatching as it private methods you are overriding anyway.

to_sql() has no direct control of which table class will be used. So currently, subclassing is not enough, you MUST monkeypatch.

Also you would need to be careful with subclassing the table class as there are 2 implementations: SQLAlchemy and SQLite.

Another problem of monkeypatching/subclassing (opposed to parameter) is that the same method has to be used globally by an application, and can not be fine tuned for each call of to_sql().

But we should think a bit about the exact interface we want (what does a user need to create the appropriate construct? The sqlalchemy table, the data (in which form?), the connection, ...).

I think just follow the API already used for _execute_insert(self, conn, keys, data_iter)

  • a reference to table object
  • a connection
  • list of keys/columns
  • and data_iter with values

That was enough for me to implement postgresql COPY without having to repeat Pandas specific processing of columns and conversion of values.

@jorisvandenbossche
Copy link
Member

to_sql() has no direct control of which table class will be used. So currently, subclassing is not enough, you MUST monkeypatch.

Yeah, for to_sql that is true, but in principle you can subclass and then use the methods of that class. With emphasis on "can", as I also don't really recommend it as I said in my previous comment.

Anyway, let's leave that discussion as I think we agree that we need some way to customize it (at least to switch between the single / multi values to solve the regression).

I think just follow the API already used for _execute_insert(self, conn, keys, data_iter)

Would there be occasions where a user would want the actual dataframe instead of data_iter?

@cpcloud what do you think of this interface?

@schettino72 One option (for the PR) would also be to implement this parameter, but leave the postgres-copy one as an example of how to use this (to not add an option that only works for a single engine for now; although given its superior speed, we should really think how to deal with this).

@danfrankj
Copy link
Contributor

Hi all! My apologies for the regression introduced. I like the solution in #21199 much better.

For context, the motivation for needing a multivalues insert was to interact with a presto database. In that case, a normal insert occurs row by row incurring a network call thus making it prohibitively slow. This may be the case for other analytic databases as well (redshift, bigquery).

But definitely we should not have this be the default so thank you @schettino72 for cleaning this up and again apologies.

@danfrankj
Copy link
Contributor

Just verified that multivalues has this behavior for Redshift as well.

In [60]: df = pd.DataFrame({'foo': np.arange(100)})

In [61]: conn.dialect
Out[61]: <sqlalchemy.dialects.postgresql.psycopg2.PGDialect_psycopg2 at 0x11d8e9748>

In [62]: conn.dialect.supports_multivalues_insert = False

In [63]: %time df.to_sql('test_no_multivalues_insert', con=conn, schema='default', index=False)
CPU times: user 21.4 ms, sys: 11.6 ms, total: 32.9 ms
Wall time: 38.7 s

In [64]: conn.dialect.supports_multivalues_insert = True

In [65]: %time df.to_sql('test_multivalues_insert', con=conn, schema='default', index=False)
CPU times: user 7.88 ms, sys: 2.38 ms, total: 10.3 ms
Wall time: 4.26 s

@jorisvandenbossche
Copy link
Member

@schettino72 do you have time to work on the PR?
Other short-term option for 0.23.1 is to revert the original change, and target the PR for 0.24.0

@schettino72
Copy link
Contributor

@schettino72 do you have time to work on the PR?

Sure. I was waiting for more feedback on API. From my side I guess just need to update docstrings and docs...

Should I target 0.24 or 0.23.1?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 6, 2018 via email

@jorisvandenbossche
Copy link
Member

Yes, we certainly want to do something for 0.23.1 (at a minimum reverse). But it would be nice to at once get the option in there as well.
I am only not fully sure about the function interface (in case this needs more discussion for the exact API of the passed function). Maybe we could limit it for 0.23.1 only to adding method='default'/'multi', and leave the ability to pass a function for 0.24.0

@gnfrazier
Copy link

Had the initial issue using 23.0 updating to 23.4 corrected the problem. Thanks for the fix.

@theSuiGenerisAakash
Copy link

On 1.0.1 and still facing the same issue

@vaibhav-chopda
Copy link

@theSuiGenerisAakash try removing "method='multi'" . This might help

@minderchen
Copy link

reduce to see whether it will work
chunksize=100

@theSuiGenerisAakash
Copy link

@theSuiGenerisAakash try removing "method='multi'" . This might help

This defeats the purpose

@eromoe
Copy link

eromoe commented May 18, 2020

Happen on pandas 1.0.1 , this issue should reopen!
session is sqlite without method='multi'

df_current.to_sql(data_schema.__tablename__, db_engine, index=False, if_exists='append')

os: win 10 , seems some people don't have this problem on linux and mac .

@AntoineWeber
Copy link

AntoineWeber commented Dec 17, 2020

Getting the error on 1.1.5. For a wide dataframe (72 cols), I cannot go above a chunksize of 10.

Uploading my dataframe (of only 5mb) took initially 30min, adding method = 'multi' sped it up 10x to be around 3min (with a chunksize of 10). Trying bigger chunks gives me the error of too many parameters.

I am working on Win10

EDIT : It actually sometimes work and sometimes fail. That's unfortunate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.