pandas to_sql is slow with fast_executemany when SQL trace/extended events is running #1215

match-gabeflores · 2023-05-05T18:25:53Z

match-gabeflores
May 5, 2023

Posting this here since I think it might be relevant (please close this if not).

We use pandas to_sql a lot to load csv files into existing tables. we don't have an issue generally since we use fast_executemany=True. A 40MB (350K records) csv file is loaded in 10 seconds.

But we've started encountering an issue where to_sql is excruciatingly slow even with fast_executemany (true or false). It appears due to SQL Sentry SQL Monitor which runs a SQL trace behind the scenes.

But I can reproduce with a simple trace without SQL Sentry.

When I run it on prod server, which is monitored with a trace by SQL Sentry, it runs for over 10 minutes before I stop it (there's no blocking - i can see the table counts slowly increasing).
When I run it on dev server (which doesn't have SQL Sentry), it finishes in 10-15sec.
When I run it on dev server again but with an extended event (or profiler trace) running, it runs slowly just like prod. If I pause the trace, it's immediately fast again.

Why would a trace have such a huge effect (>100x slower)? Because of the large number of sp_execute statements the trace has to log? Is there a workaround?

I'll talk to the DBAs about what events they're capturing on prod and if they can reduce the overhead. It's an all-day monitoring suite.

I also see some different results when I have trace running and also use chunksize argument in to_sql.

pandas 2.0 (but also 1.4)
pyodbc 4.0.35
SQL Server 2017

Posted on Stack Overflow too

import urllib
import sqlalchemy as sa
import pandas as pd
host = 'my_server'
schema = 'workdb'

params = urllib.parse.quote_plus("DRIVER={ODBC Driver 17 for SQL Server};"
                                 "SERVER=" + host + ";"
                                "DATABASE=" + schema + ";"
                               "trusted_connection=yes;")

engine = sa.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params), fast_executemany=True)
csv_path = r'C:\Users\me\Desktop\somefile.csv' # 40mb file

df = pd.read_csv(csv_path, dtype_backend='pyarrow')  # pyarrow for pandas 2.0+.
df.to_sql(con=engine, name="target_table", schema="import", index=False, if_exists='append')

The CSV file is something like this:
day,ds,gender,age_group,country,device,dormancy_cohort,reg_id,uid
2023-04-17,20230417,1,0,GBR,Android,4,03fcd0d6f4,b2405926az804

Answered by gordthompson

May 5, 2023

This really sounds like something you should be asking the SQL Sentry people about, but one thing you might try would be to use this .to_sql() method

https://gist.github.com/gordthompson/1fb0f1c3f5edbf6192e596de8350f205

(to avoid .to_sql() using .executemany()) and see if it makes any difference.

View full answer

gordthompson · 2023-05-05T22:26:01Z

gordthompson
May 5, 2023
Collaborator

This really sounds like something you should be asking the SQL Sentry people about, but one thing you might try would be to use this .to_sql() method

https://gist.github.com/gordthompson/1fb0f1c3f5edbf6192e596de8350f205

(to avoid .to_sql() using .executemany()) and see if it makes any difference.

2 replies

match-gabeflores May 8, 2023
Author

Thanks! This is considerably faster in this situation where background SQL Monitoring is performed (sometimes required for auditing purposes).

Note, For larger files, I have to use the chunksize in the to_sql call because it's too much data for OPENJSON to handle.
I had to add default=str in the json.dumps call because i was getting
TypeError: Object of type datetime is not JSON serializable: for datetime fields.
conn.exec_driver_sql(sql, (json.dumps(json_data, default=str),))
https://gist.github.com/gordthompson/1fb0f1c3f5edbf6192e596de8350f205#file-mssql_insert_json-py-L57

I'm going to do some more testing this week and check with the DBA team. Thanks for this!

gordthompson May 8, 2023
Collaborator

Thanks for the feedback! I have updated the gist.

v-chojas · 2023-05-05T23:10:49Z

v-chojas
May 5, 2023

That isn't really surprising - tracing always has overhead and it's often very very large, especially if it's logging all the data too. Even ODBC trace will cause applications to become many times slower.

1 reply

match-gabeflores Nov 15, 2023
Author

Updating here for others -

even changing to use Extended Events in SQL Sentry didn't make any difference - the default pandas.to_sql was still slow. Best approach is to use bcp, sqlbulkcopy in c#, SSIS or @gordthompson's custom to_sql() method.

For others using SQL Sentry, the trace (whether legacy SQL Trace or Extended Events) can be disabled in settings here:

Link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas to_sql is slow with fast_executemany when SQL trace/extended events is running #1215

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

pandas to_sql is slow with fast_executemany when SQL trace/extended events is running #1215

match-gabeflores May 5, 2023

Replies: 2 comments · 3 replies

gordthompson May 5, 2023 Collaborator

match-gabeflores May 8, 2023 Author

gordthompson May 8, 2023 Collaborator

v-chojas May 5, 2023

match-gabeflores Nov 15, 2023 Author

match-gabeflores
May 5, 2023

Replies: 2 comments 3 replies

gordthompson
May 5, 2023
Collaborator

match-gabeflores May 8, 2023
Author

gordthompson May 8, 2023
Collaborator

v-chojas
May 5, 2023

match-gabeflores Nov 15, 2023
Author