Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqldf does not support large data sets #80

Open
dqaudithillstone opened this issue Feb 6, 2020 · 1 comment
Open

sqldf does not support large data sets #80

dqaudithillstone opened this issue Feb 6, 2020 · 1 comment

Comments

@dqaudithillstone
Copy link

My dataframe records more than 120,000. When I use sqldf to query, if I set limit 20000, the program does not error. but If you remove limit 20000, the program fails. The error message is as follows.

the df records is more than 120,000.

source code:
df=pd.read_csv('./dataAnalyse/ht.csv',encoding='utf_8_sig') #支持中文路径
q="""
select a.[合同编号] as htbh1,b.[合同编号] as htbh2,a.[合同名称] as htmc1,b.[合同名称] as htmc2
from df a left join df b
on a.年度=b.年度 and a.公司名称=b.公司名称 and a.[合同编号]<>b.[合同编号]
limit 20000;
"""
df0=pysqldf(q)
print(df0.info())

The error message is as follows:

PS D:\python\myworks> & C:/python/Python37-32/python.exe d:/python/myworks/DataAnalyse/xsd.py
Traceback (most recent call last):
File "C:\python\Python37-32\lib\site-packages\sqlalchemy\engine\base.py", line 3206, in fetchall
l = self.process_rows(self._fetchall_impl())
File "C:\python\Python37-32\lib\site-packages\sqlalchemy\engine\base.py", line 3173, in _fetchall_impl
return self.cursor.fetchall()
sqlite3.OperationalError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\python\Python37-32\lib\site-packages\pandasql\sqldf.py", line 61, in call
result = read_sql(query, conn)
File "C:\python\Python37-32\lib\site-packages\pandas\io\sql.py", line 438, in read_sql
chunksize=chunksize,
File "C:\python\Python37-32\lib\site-packages\pandas\io\sql.py", line 1231, in read_query
data = result.fetchall()
File "C:\python\Python37-32\lib\site-packages\sqlalchemy\engine\base.py", line 3212, in fetchall
self.cursor, self.context)
File "C:\python\Python37-32\lib\site-packages\sqlalchemy\engine\base.py", line 1843, in _handle_dbapi_exception
from e
sqlalchemy.exc.OperationalError: (OperationalError) None None

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "d:/python/myworks/DataAnalyse/xsd.py", line 20, in
df0=pysqldf(q)
File "d:/python/myworks/DataAnalyse/xsd.py", line 7, in pysqldf
return sqldf(q,globals())
File "C:\python\Python37-32\lib\site-packages\pandasql\sqldf.py", line 156, in sqldf
return PandaSQL(db_uri)(query, env)
File "C:\python\Python37-32\lib\site-packages\pandasql\sqldf.py", line 63, in call
raise PandaSQLException(ex)
pandasql.sqldf.PandaSQLException: (OperationalError) None None

@zbrookle
Copy link

@dqaudithillstone Part of the performance issue with this package is that it uses SqLite as a backend, which means that you're essentially not using one of the main benefits of pandas, which is in memory computation. As a solution to this problem, a few months back I created a package called dataframe_sql, which solves this problem by parsing the sql and translating it to native pandas operations. Hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants