sqldf does not support large data sets #80

dqaudithillstone · 2020-02-06T02:45:02Z

My dataframe records more than 120,000. When I use sqldf to query, if I set limit 20000, the program does not error. but If you remove limit 20000, the program fails. The error message is as follows.

the df records is more than 120,000.

source code:
df=pd.read_csv('./dataAnalyse/ht.csv',encoding='utf_8_sig') #支持中文路径
q="""
select a.[合同编号] as htbh1,b.[合同编号] as htbh2,a.[合同名称] as htmc1,b.[合同名称] as htmc2
from df a left join df b
on a.年度=b.年度 and a.公司名称=b.公司名称 and a.[合同编号]<>b.[合同编号]
limit 20000;
"""
df0=pysqldf(q)
print(df0.info())

The error message is as follows:

PS D:\python\myworks> & C:/python/Python37-32/python.exe d:/python/myworks/DataAnalyse/xsd.py
Traceback (most recent call last):
File "C:\python\Python37-32\lib\site-packages\sqlalchemy\engine\base.py", line 3206, in fetchall
l = self.process_rows(self._fetchall_impl())
File "C:\python\Python37-32\lib\site-packages\sqlalchemy\engine\base.py", line 3173, in _fetchall_impl
return self.cursor.fetchall()
sqlite3.OperationalError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\python\Python37-32\lib\site-packages\pandasql\sqldf.py", line 61, in call
result = read_sql(query, conn)
File "C:\python\Python37-32\lib\site-packages\pandas\io\sql.py", line 438, in read_sql
chunksize=chunksize,
File "C:\python\Python37-32\lib\site-packages\pandas\io\sql.py", line 1231, in read_query
data = result.fetchall()
File "C:\python\Python37-32\lib\site-packages\sqlalchemy\engine\base.py", line 3212, in fetchall
self.cursor, self.context)
File "C:\python\Python37-32\lib\site-packages\sqlalchemy\engine\base.py", line 1843, in _handle_dbapi_exception
from e
sqlalchemy.exc.OperationalError: (OperationalError) None None

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "d:/python/myworks/DataAnalyse/xsd.py", line 20, in
df0=pysqldf(q)
File "d:/python/myworks/DataAnalyse/xsd.py", line 7, in pysqldf
return sqldf(q,globals())
File "C:\python\Python37-32\lib\site-packages\pandasql\sqldf.py", line 156, in sqldf
return PandaSQL(db_uri)(query, env)
File "C:\python\Python37-32\lib\site-packages\pandasql\sqldf.py", line 63, in call
raise PandaSQLException(ex)
pandasql.sqldf.PandaSQLException: (OperationalError) None None

zbrookle · 2020-08-19T21:38:18Z

@dqaudithillstone Part of the performance issue with this package is that it uses SqLite as a backend, which means that you're essentially not using one of the main benefits of pandas, which is in memory computation. As a solution to this problem, a few months back I created a package called dataframe_sql, which solves this problem by parsing the sql and translating it to native pandas operations. Hope this helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqldf does not support large data sets #80

sqldf does not support large data sets #80

dqaudithillstone commented Feb 6, 2020

zbrookle commented Aug 19, 2020

sqldf does not support large data sets #80

sqldf does not support large data sets #80

Comments

dqaudithillstone commented Feb 6, 2020

zbrookle commented Aug 19, 2020