Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multi-row inserts for massive speedups on to_sql over high latency connections #8953

Closed
maxgrenderjones opened this issue Dec 1, 2014 · 49 comments · Fixed by #21401
Closed
Labels
IO SQL to_sql, read_sql, read_sql_query Performance Memory or execution speed performance
Milestone

Comments

@maxgrenderjones
Copy link
Contributor

I have been trying to insert ~30k rows into a mysql database using pandas-0.15.1, oursql-0.9.3.1 and sqlalchemy-0.9.4. Because the machine is as across the atlantic from me, calling data.to_sql was taking >1 hr to insert the data. On inspecting with wireshark, the issue is that it is sending an insert for every row, then waiting for the ACK before sending the next, and, long story short, the ping times are killing me.

However, following the instructions from SQLAlchemy, I changed

def _execute_insert(self, conn, keys, data_iter):
    data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
    conn.execute(self.insert_statement(), data)

to

def _execute_insert(self, conn, keys, data_iter):
    data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
    conn.execute(self.insert_statement().values(data))

and the entire operation completes in less than a minute. (To save you a click, the difference is between multiple calls to insert into foo (columns) values (rowX) and one massive insert into foo (columns) VALUES (row1), (row2), row3)). Given how often people are likely to use pandas to insert large volumes of data, this feels like a huge win that would be great to be included more widely.

Some challenges:

  • Not every database supports multirow inserts (SQLite and SQLServer didn't in the past, though they do now). I don't know how to check for this via SQLAlchemy
  • The MySQL server I was using didn't allow me to insert the data all in one go, I had to set the chunksize (5k worked fine, but I guess the full 30k was too much). If we made this the default insert, most people would have to add a chunk size (which might be hard to calculate, as it might be determined by the maximum packet size of the server).

The easiest way to do this, would be to add a multirow= boolean parameter (default False) to the to_sql function, and then leave the user responsible for setting the chunksize, but perhaps there's a better way?

Thoughts?

@jorisvandenbossche jorisvandenbossche added the IO SQL to_sql, read_sql, read_sql_query label Dec 2, 2014
@jorisvandenbossche
Copy link
Member

This seems reasonable. Thanks for investigating this!

For the implementation, it will depend on how sqlalchemy deals with database flavors that does not support this (I can't test this at the moment, but it seems that sqlalchemy raises an error (eg http://stackoverflow.com/questions/23886764/multiple-insert-statements-in-mssql-with-sqlalchemy). Also, if it has the consequence that a lot of people will have to set chunksize, this is indeed not a good idea to do as default (unless we set chunksize to a value by default).
So adding a keyword seems maybe better.

@artemyk @mangecoeur @hayd @danielballan

@artemyk
Copy link
Contributor

artemyk commented Dec 2, 2014

Apparently SQLAlchemy has a flag dialect.supports_multivalues_insert (see e.g. http://pydoc.net/Python/SQLAlchemy/0.8.3/sqlalchemy.sql.compiler/ , possibly called supports_multirow_insert in other versions, https://www.mail-archive.com/mediawiki-commits@lists.wikimedia.org/msg202880.html ).

Since this has the potential to speed up inserts a lot, and we can check for support easily, I'm thinking maybe we could do it by default, and also set chunksize to a default value (e.g. 16kb chunks... not sure what's too big in most situations). If the multirow insert fails, we could throw an exception suggesting lowering the chunksize?

@maxgrenderjones
Copy link
Contributor Author

Now I just need to persuade the SQLAlchemy folks to set supports_multivalues_insert to true on SQL Server >2005 (I hacked it into the code and it works fine, but it's not on by default).

On a more on-topic note, I think the chunksize could be tricky. On my mysql setup (which I probably configured to allow large packets), I can set chunksize=5000, on my SQLServer setup, 500 was too large, but 100 worked fine. However, it's probably true that most of the benefits from this technique come from going from inserting 1 row at a time to 100, rather than 100 to 1000.

@danielballan
Copy link
Contributor

What if chunksize=None meant "Adaptively choose a chunksize"? Attempt something like 5000, 500, 50, 1. Users could turn this off by specifying a chunksize. If the overhead from these attempts is too large, I like @maxgrenderjones suggestion: chunksize=10 is a better default than chunksize=1.

@jorisvandenbossche
Copy link
Member

On that last comment "chunksize=10 is a better default than chunksize=1" -> that is not fully true I think. The current situation is to do one execute statement that consists of multiline single-row insert statements (which is not a chunksize of 1), while chunksize=10 would mean doing a lot of execute statements with each time one multi-row insert.
And I don't know if this is necessarily faster, but much depends on the situation. For example with the current code and with a local sqlite database:

In [4]: engine = create_engine('sqlite:///:memory:') #, echo='debug')

In [5]: df = pd.DataFrame(np.random.randn(50000, 10))

In [6]: %timeit df.to_sql('test_default', engine, if_exists='replace')
1 loops, best of 3: 956 ms per loop

In [7]: %timeit df.to_sql('test_default', engine, if_exists='replace', chunksize=10)
1 loops, best of 3: 2.23 s per loop

But of course this does not use the multi-row feature

@nhockham
Copy link

We've figured out how to monkey patch - might be useful to someone else. Have this code before importing pandas.

from pandas.io.sql import SQLTable

def _execute_insert(self, conn, keys, data_iter):
    print "Using monkey-patched _execute_insert"
    data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
    conn.execute(self.insert_statement().values(data))

SQLTable._execute_insert = _execute_insert

@jorisvandenbossche
Copy link
Member

Maybe we can just start with adding this feature through a new multirow=True keyword (with a default of False for now), and then we can later always see if we can enable it by default?

@maxgrenderjones @nhockham interested to do a PR to add this?

@mangecoeur
Copy link
Contributor

@jorisvandenbossche I think it's risky to start adding keyword arguments to address specific performance profiles. If you can guarantee that it's faster in all cases (if necessary by having it determine the best method based on the inputs) then you don't need a flag at all.

Different DB-setups may have different performance optimizations (different DB perf profiles, local vs network, big memory vs fast SSD, etc, etc), if you start adding keyword flags for each it becomes a mess.

I would suggest creating subclasses of SQLDatabase and SQLTable to address performance specific implementations, they would be used through the object-oriented API. Perhaps a "backend switching" method could be added but frankly using the OO api is very simple so this is probably overkill for what is already a specialized use-case.

I created such a sub-class for loading large datasets to Postgres (it's actually much faster to save data to CSV then use the built-in non-standard COPY FROM sql commands than to use inserts, see https://gist.github.com/mangecoeur/1fbd63d4758c2ba0c470#file-pandas_postgres-py). To use it you just do PgSQLDatabase(engine, <args>).to_sql(frame, name,<kwargs>)

@artemyk
Copy link
Contributor

artemyk commented Feb 26, 2015

Just for reference, I tried running the code by @jorisvandenbossche (Dec 3rd post) using the multirow feature. It's quite a bit slower. So the speed-tradeoffs here is not trivial:

In [4]: engine = create_engine('sqlite:///:memory:') #, echo='debug')

In [5]: df = pd.DataFrame(np.random.randn(50000, 10))

In [6]: 

In [6]: %timeit df.to_sql('test_default', engine, if_exists='replace')
1 loops, best of 3: 1.05 s per loop

In [7]: 

In [7]: from pandas.io.sql import SQLTable

In [8]: 

In [8]: def _execute_insert(self, conn, keys, data_iter):
   ...:         data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
   ...:         conn.execute(self.insert_statement().values(data))
   ...:     

In [9]: SQLTable._execute_insert = _execute_insert

In [10]: 

In [10]: reload(pd)
Out[10]: <module 'pandas' from '/usr/local/lib/python2.7/site-packages/pandas/__init__.pyc'>

In [11]: 

In [11]: %timeit df.to_sql('test_default', engine, if_exists='replace', chunksize=10)
1 loops, best of 3: 9.9 s per loop

Also, I agree that adding keyword parameters is risky. However, the multirow feature seems pretty fundamental. Also, 'monkey-patching' is probably not more robust to API changes than keyword parameters.

@mangecoeur
Copy link
Contributor

Its as i suspected. Monkey patching isn't the solution I was suggesting - rather that we ship a number of performance oriented subclasses that the informed user could use through the OO interface (to avoid loading the functional api with too many options)

-----Original Message-----
From: "Artemy Kolchinsky" notifications@github.com
Sent: ‎26/‎02/‎2015 17:13
To: "pydata/pandas" pandas@noreply.github.com
Cc: "mangecoeur" jon.chambers3001@gmail.com
Subject: Re: [pandas] Use multi-row inserts for massive speedups on to_sqlover high latency connections (#8953)

Just for reference, I tried running the code by @jorisvandenbossche (Dec 3rd post) using the multirow feature. It's quite a bit slower. So the speed-tradeoffs here is not trivial:
In [4]: engine = create_engine('sqlite:///:memory:') #, echo='debug')

In [5]: df = pd.DataFrame(np.random.randn(50000, 10))

In [6]:

In [6]: %timeit df.to_sql('test_default', engine, if_exists='replace')
1 loops, best of 3: 1.05 s per loop

In [7]:

In [7]: from pandas.io.sql import SQLTable

In [8]:

In [8]: def _execute_insert(self, conn, keys, data_iter):
...: data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
...: conn.execute(self.insert_statement().values(data))
...:

In [9]: SQLTable._execute_insert = _execute_insert

In [10]:

In [10]: reload(pd)
Out[10]: <module 'pandas' from '/usr/local/lib/python2.7/site-packages/pandas/init.pyc'>

In [11]:

In [11]: %timeit df.to_sql('test_default', engine, if_exists='replace', chunksize=10)
1 loops, best of 3: 9.9 s per loop
Also, I agree that adding keyword parameters is risky. However, the multirow feature seems pretty fundamental. Also, 'monkey-patching' is probably not more robust to API changes than keyword parameters.

Reply to this email directly or view it on GitHub.

@maxgrenderjones
Copy link
Contributor Author

As per the initial ticket title, I don't think this approach is going to be preferable in all cases, so I wouldn't make it the default. However, without it, the pandas to_sql unusable for me, so it's important enough for me to continue to request the change. (It's also become the first thing I change when I upgrade my pandas version). As for sensible chunksize values, I don't think there is one true n, as the packet size will depend on how many columns there are (and what's in them) in hard to predict ways. Unfortunately SQLServer fails with an error message that looks totally unrelated (but isn't) if you set the chunksize too high (which is probably why multirow inserts aren't turned on except with a patch in SQLAlchemy), but it works fine with mysql. Users may need to experiment to determine what value of n is likely to result in an acceptably large packet size (for whatever their backing database is). Having pandas chose n is likely to get land us way further down in the implementation details than we want to be (i.e. the opposite direction from the maximum-possible-abstraction SQLALchemy approach)

In short, my recommendation would be to add it as a keyword, with some helpful commentary about how to use it. This wouldn't be the first time a keyword was used to select an implementation (see: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html) but that perhaps isn't the best example, as I haven't the first idea about what raw= means, even having read the explanation!

@dragonator4
Copy link

I have noticed that it also consumes a huge amount of memory. Like a 1.6+ GB DataFrame with some 700,000 rows and 301 columns requires almost 34 GB during insert! That is like over the top inefficient. Any ideas on why that might be the case? Here is a screen clip:

image

@andreacassioli
Copy link

Hi guys,
any progress on this issue?

I am try to insert around 200K rows using to_sql but it takes forever and consume a huge amount of memory! Using chuncksize helps with the memory but still the speed is very slow.

My impression, looking at the MSSQL DBase trace is that the insertion is actually performed one row at the time.

The only viable approach now is to dump to a csv file on a shared folder and use BULK INSERT. But it very annoying and inelegant!

@ostrokach
Copy link

ostrokach commented Oct 20, 2016

@andreacassioli You can use odo to insert a DataFrame into an SQL database through an intermediary CSV file. See Loading CSVs into SQL Databases.

I don't think you can come even close to BULK INSERT performance using ODBC.

@andreacassioli
Copy link

@ostrokach thank you, indeed I am using csv files now. If I could get close, I would trade a bit of time for simplicity!

@indera
Copy link

indera commented Mar 3, 2017

I thought this might help somebody:
http://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow

@jorisvandenbossche
Copy link
Member

@indera pandas does not use the ORM, only sqlalchemy Core (which is what the doc entry there suggests to use for large inserts)

@russlamb
Copy link

russlamb commented Mar 6, 2017

is there any consensus on how to work around this in the meantime? I'm inserting a several million rows into postgres and it takes forever. Is CSV / odo the way to go?

@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

@russlamb a practical way to solve this problem is simply to bulk upload. This is someone db specific though, so odo has solutions for postgresl (and may be mysql) I think. for something like sqlserver you have to 'do this yourself' (IOW you have to write it).

@indera
Copy link

indera commented Mar 6, 2017

For sqlserver I used the FreeTDS driver (http://www.freetds.org/software.html and https://github.com/mkleehammer/pyodbc ) with SQLAlchemy entities which resulted in very fast inserts (20K rows per data frame):

from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()


class DemographicEntity(Base):
    __tablename__ = 'DEMOGRAPHIC'

    patid = db.Column("PATID", db.Text, primary_key=True)
    """
    patid = db.Column("PATID", db.Text, primary_key=True, autoincrement=False, nullable=True)
    birth_date = db.Column("BIRTH_DATE", db.Date)
    birth_time = db.Column("BIRTH_TIME", db.Text(5))
    sex = db.Column("SEX", db.Text(2))

def get_db_url(db_host, db_port, db_name, db_user, db_pass):
    params = parse.quote(
        "Driver={{FreeTDS}};Server={};Port={};"
        "Database={};UID={};PWD={};"
        .format(db_host, db_port, db_name, db_user, db_pass))
    return 'mssql+pyodbc:///?odbc_connect={}'.format(params)

def get_db_pool():
    """
    Create the database engine connection.
    @see http://docs.sqlalchemy.org/en/latest/core/engines.html

    :return: Dialect object which can either be used directly
            to interact with the database, or can be passed to
            a Session object to work with the ORM.
    """
    global DB_POOL

    if DB_POOL is None:
        url = get_db_url(db_host=DB_HOST, db_port=DB_PORT, db_name=DB_NAME,
                         db_user=DB_USER, db_pass=DB_PASS)
        DB_POOL = db.create_engine(url,
                                   pool_size=10,
                                   max_overflow=5,
                                   pool_recycle=3600)

    try:
        DB_POOL.execute("USE {db}".format(db=DB_NAME))
    except db.exc.OperationalError:
        logger.error('Database {db} does not exist.'.format(db=DB_NAME))

    return DB_POOL


def save_frame():
    db_pool = get_db_pool()
    records = df.to_dict(orient='records')
    result = db_pool.execute(entity.__table__.insert(), records)
 

@jorisvandenbossche
Copy link
Member

Is CSV / odo the way to go?

This solution will almost always be faster I think, regardless of the multi-row / chunksize settings.

But, @russlamb, it is always interesting to hear whether such a multi-row keyword would be an improvement in your case. See eg #8953 (comment) on a way to easily test this out.

I think there is agreement that we want to have a way to specify this (without necessarily changing the default). So if somebody wants to make a PR for this, that is certainly welcome.
There was only some discussion on how to add this ability (new keyword vs subclass using OO api).

@indera
Copy link

indera commented Mar 6, 2017

@jorisvandenbossche The document I linked above mentions "Alternatively, the SQLAlchemy ORM offers the Bulk Operations suite of methods, which provide hooks into subsections of the unit of work process in order to emit Core-level INSERT and UPDATE constructs with a small degree of ORM-based automation."

What I am suggesting is to implement a sqlserver specific version for to_sql which under the hood uses the SQLAlchemy ORMs for speedups as in the code I posted above.

@mangecoeur
Copy link
Contributor

mangecoeur commented Mar 7, 2017 via email

@mangecoeur
Copy link
Contributor

mangecoeur commented Mar 7, 2017 via email

@dfernan
Copy link

dfernan commented Jun 1, 2017

Is this getting fixed/taken care of? As of now inserting pandas dataframes into a SQL db is extremely slow unless it's a toy dataframe. Let's decide on a solution and push it forward?

@ostrokach
Copy link

@dfernan As mentioned above, you may want to look at the odo. Using an intermediary CSV file will always be orders of magnitude faster that going through sqlalchemy, no matter what kind of improvements happen here...

@jreback jreback added this to the 0.23.1 milestone Jun 4, 2018
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.23.1, 0.24.0 Jun 7, 2018
@gfyoung gfyoung added the Performance Memory or execution speed performance label Jun 10, 2018
@citynorman
Copy link

citynorman commented Oct 14, 2018

I found d6tstack much simpler to use, it's a one-liner d6tstack.utils.pd_to_psql(df, cfg_uri_psql, 'benchmark', if_exists='replace') and it's much faster than df.to_sql(). Supports postgres and mysql. See https://github.com/d6t/d6tstack/blob/master/examples-sql.ipynb

@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Nov 6, 2018
@VincentLa14
Copy link

VincentLa14 commented Nov 25, 2018

I've been using the Monkey Patch Solution:

from pandas.io.sql import SQLTable

def _execute_insert(self, conn, keys, data_iter):
    print "Using monkey-patched _execute_insert"
    data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
    conn.execute(self.insert_statement().values(data))

SQLTable._execute_insert = _execute_insert

for some time now, but now I'm getting an error:

TypeError: insert_statement() missing 2 required positional arguments: 'data' and 'conn'

Is anyone else getting this? I'm on Python 3.6.5 (Anaconda) and pandas==0.23.0

@hnazkani
Copy link

is this getting fixed ? Currently, df.to_sql is extremely slow and can't be used at all for many practical use cases. Odo project seems to have been abandoned already.
I have following use cases in financial time series where df.to_sql is pretty much not usable:

  1. copying historical csv data to postgres database - can't use df.to_sql and had to go with custom code around psycopg2 copy_from functionality
  2. streaming data (coming in a batch of ~500-3000 rows per second) to be dumped to postgres database - again df.to_sql performance is pretty disappointing as it is taking too much time to insert these natural batches of data to postgres.
    The only place where I find df.to_sql useful now is to create tables automatically !!! - which is not the use case it was designed for.
    I am not sure if other people also share the same concern but this issue needs some attention for "dataframes-to-database" interfaces to work smoothly.
    Look forward.

@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Dec 27, 2018
mroeschke pushed a commit that referenced this issue Dec 28, 2018
#21401)

* ENH: to_sql() add parameter "method" to control insertions method (#8953)

* ENH: to_sql() add parameter "method". Fix docstrings (#8953)

* ENH: to_sql() add parameter "method". Improve docs based on reviews (#8953)

* ENH: to_sql() add parameter "method". Fix unit-test (#8953)

* doc clean-up

* additional doc clean-up

* use dict(zip()) directly

* clean up merge

* default --> None

* Remove stray default

* Remove method kwarg

* change default to None

* test copy insert snippit

* print debug

* index=False

* Add reference to documentation
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
…ndas-dev#8… (pandas-dev#21401)

* ENH: to_sql() add parameter "method" to control insertions method (pandas-dev#8953)

* ENH: to_sql() add parameter "method". Fix docstrings (pandas-dev#8953)

* ENH: to_sql() add parameter "method". Improve docs based on reviews (pandas-dev#8953)

* ENH: to_sql() add parameter "method". Fix unit-test (pandas-dev#8953)

* doc clean-up

* additional doc clean-up

* use dict(zip()) directly

* clean up merge

* default --> None

* Remove stray default

* Remove method kwarg

* change default to None

* test copy insert snippit

* print debug

* index=False

* Add reference to documentation
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
…ndas-dev#8… (pandas-dev#21401)

* ENH: to_sql() add parameter "method" to control insertions method (pandas-dev#8953)

* ENH: to_sql() add parameter "method". Fix docstrings (pandas-dev#8953)

* ENH: to_sql() add parameter "method". Improve docs based on reviews (pandas-dev#8953)

* ENH: to_sql() add parameter "method". Fix unit-test (pandas-dev#8953)

* doc clean-up

* additional doc clean-up

* use dict(zip()) directly

* clean up merge

* default --> None

* Remove stray default

* Remove method kwarg

* change default to None

* test copy insert snippit

* print debug

* index=False

* Add reference to documentation
@jconstanzo
Copy link

Hey, I'm getting this error when I try to perform a multi-insert to a SQLite database:

This is my code:
df.to_sql("financial_data", con=conn, if_exists="append", index=False, method="multi")

and I get this error:

Traceback (most recent call last):

  File "<ipython-input-11-cf095145b980>", line 1, in <module>
    handler.insert_financial_data_from_df(data, "GOOG")

  File "C:\Users\user01\Documents\Code\FinancialHandler.py", line 110, in insert_financial_data_from_df
    df.to_sql("financial_data", con=conn, if_exists="append", index=False, method="multi")

  File "C:\Users\user01\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 2531, in to_sql
    dtype=dtype, method=method)

  File "C:\Users\user01\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 460, in to_sql
    chunksize=chunksize, dtype=dtype, method=method)

  File "C:\Users\user01\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 1547, in to_sql
    table.insert(chunksize, method)

  File "C:\Users\user01\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 686, in insert
    exec_insert(conn, keys, chunk_iter)

  File "C:\Users\user01\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 609, in _execute_insert_multi
    conn.execute(self.table.insert(data))

TypeError: insert() takes exactly 2 arguments (1 given)

Why is this happening? I'm using Python 3.7.3 (Anaconda), pandas 0.24.2 and sqlite3 2.6.0.

Thank you very much in advance!

@jorisvandenbossche
Copy link
Member

@jconstanzo can you open this as a new issue?
And if possible, can you try to provide a reproducible example? (eg a small example dataframe that can show the problem)

@Bauxitedev
Copy link

@jconstanzo Having the same issue here. Using method='multi' (in my case, in combination with chunksize) seems to trigger this error when you try to insert into a SQLite database.

Unfortunately I can't really provide an example dataframe because my dataset is huge, that's the reason I'm using method and chunksize in the first place.

@jconstanzo
Copy link

I'm sorry for the delay. I just opened an issue for this problem: #29921

@ban04toufuonline
Copy link

tow to hack this? @maxgrenderjones

Now I just need to persuade the SQLAlchemy folks to set supports_multivalues_insert to true on SQL Server >2005 (I hacked it into the code and it works fine, but it's not on by default).

On a more on-topic note, I think the chunksize could be tricky. On my mysql setup (which I probably configured to allow large packets), I can set chunksize=5000, on my SQLServer setup, 500 was too large, but 100 worked fine. However, it's probably true that most of the benefits from this technique come from going from inserting 1 row at a time to 100, rather than 100 to 1000.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.