Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

From redshift DB, read_sql_query distorts values, but unload does not #2216

Open
lampretl opened this issue Apr 20, 2023 · 9 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@lampretl
Copy link

Describe the bug

In our private redshift database on AWS, I have a table with a column probability of type numeric(38, 20), its values are between 0 and 1. In AWS SageMaker jupyter notebook, I query/download the content of that table. The values obtained via read_sql_query are also negative, but the ones obtained via unload are all positive.

How to Reproduce

When executing the following code

query = "SELECT user_id, probability p FROM models.churn_short_term_predictions"
df1 = wr.redshift.read_sql_query(query, con=con)
print(df1.p.min())
df2 = wr.redshift.unload(query, con=con, path=staging_area()) 
print(df2.p.min())

the output is unexpectedly:

-0.09223216322371685616
0.00028089172873606330

Expected behavior

Outputs df1.p.min() and df2.p.min() should be equal. And certainly, both must be between 0 and 1.

Your project

No response

Screenshots

photo_2023-04-20_18-37-54

Screenshot at 2023-04-20 18-41-48

OS

Linux (AWS)

Python version

3.8.16

AWS SDK for pandas version

3.0.0

Additional context

No response

@lampretl lampretl added the bug Something isn't working label Apr 20, 2023
@kukushking kukushking self-assigned this Apr 21, 2023
@kukushking
Copy link
Contributor

Thanks @lampretl looking into it

@kukushking
Copy link
Contributor

HI @lampretl I've added a test case to simulate overflow which is what most likely happens here but wasn't able to reproduce so far. Is this consistently reproducing for you? What are the actual MIN and MAX values in Redshift?

@lampretl
Copy link
Author

lampretl commented Apr 21, 2023

If I run that jupyter cell several times, over different days, I always get the slightly negative result with read_sql_query.

According to DBeaver, the min and max values in my table are 0.00028089172873606330 and 0.98937731753215710000. The table has 2360571 rows.

@kukushking
Copy link
Contributor

Could you share the output of pip freeze, please?

@lampretl
Copy link
Author

Of course, if I run ! pip freeze in a jupyter notebook in SageMaker Studio on AWS, I get:

alembic==1.10.3
appdirs @ file:///home/conda/feedstock_root/build_artifacts/appdirs_1603108395799/work
asn1crypto==1.5.1
asttokens==2.2.1
attrs==22.2.0
awscli==1.27.70
awsio @ https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl
awswrangler==3.0.0
backcall==0.2.0
bcrypt==4.0.1
beautifulsoup4==4.12.2
bokeh==2.4.3
boto3==1.26.70
botocore==1.29.70
brotlipy @ file:///home/conda/feedstock_root/build_artifacts/brotlipy_1666764652625/work
cached-property @ file:///home/conda/feedstock_root/build_artifacts/cached_property_1615209429212/work
certifi==2022.12.7
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1671179356964/work
charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1661170624537/work
click==8.1.3
cloudpickle @ file:///home/conda/feedstock_root/build_artifacts/cloudpickle_1674202310934/work
cmaes==0.9.1
colorama==0.4.4
colorlog==6.7.0
conda==22.11.1
conda-content-trust @ file:///home/conda/feedstock_root/build_artifacts/conda-content-trust_1621370699668/work
conda-package-handling @ file:///home/conda/feedstock_root/build_artifacts/conda-package-handling_1669907009957/work
conda_package_streaming @ file:///home/conda/feedstock_root/build_artifacts/conda-package-streaming_1669733752472/work
contextlib2==21.6.0
contourpy==1.0.7
cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography-split_1675828607636/work
cycler==0.11.0
Cython @ file:///home/conda/feedstock_root/build_artifacts/cython_1673054071802/work
decorator==5.1.1
dgl==0.9.1.post1
dill==0.3.6
docutils==0.16
executing==1.2.0
fonttools==4.38.0
fsspec==2023.1.0
gevent==22.10.2
google-pasta==0.2.0
greenlet==2.0.2
h5py @ file:///home/conda/feedstock_root/build_artifacts/h5py_1675704810568/work
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1663625384323/work
imageio==2.25.1
importlib-metadata==4.13.0
importlib-resources==5.10.2
inotify-simple==1.2.1
ipykernel==5.5.6
ipython==8.10.0
ipython-genutils==0.2.0
ipywidgets==8.0.6
jedi==0.18.2
Jinja2==3.1.2
jmespath==1.0.1
joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1663332044897/work
jsonschema==4.17.3
jupyter-client==6.1.5
jupyter-core==4.9.2
jupyterlab-widgets==3.0.7
kiwisolver==1.4.4
libmambapy @ file:///home/conda/feedstock_root/build_artifacts/mamba-split_1671598321536/work/libmambapy
lightgbm==3.3.5
llvmlite==0.36.0
lxml==4.9.2
Mako==1.2.4
mamba @ file:///home/conda/feedstock_root/build_artifacts/mamba-split_1671598321536/work/mamba
MarkupSafe==2.1.2
matplotlib==3.5.3
matplotlib-inline==0.1.6
multiprocess==0.70.14
natsort==8.3.1
networkx @ file:///home/conda/feedstock_root/build_artifacts/networkx_1673151334029/work
numba==0.53.1
numpy==1.21.6
opencv-python==4.7.0.68
optuna==3.1.0
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1673482170163/work
pandas==1.3.5
paramiko==3.0.0
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work
pathos==0.3.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.4.0
pkgutil_resolve_name==1.3.10
platformdirs==3.2.0
plotly==5.9.0
pluggy @ file:///home/conda/feedstock_root/build_artifacts/pluggy_1667232663820/work
pooch @ file:///home/conda/feedstock_root/build_artifacts/pooch_1643032624649/work
portalocker==2.7.0
pox==0.3.2
ppft==1.7.6.6
prompt-toolkit==3.0.36
protobuf==3.20.2
protobuf3-to-dict==0.1.5
psutil==5.9.0
psycopg2-binary==2.9.6
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==11.0.0
pyasn1==0.4.8
pycosat @ file:///home/conda/feedstock_root/build_artifacts/pycosat_1666834293299/work
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1636257122734/work
pyfunctional==1.4.3
Pygments==2.14.0
pyinstrument==3.4.2
pyinstrument-cext==0.2.4
PyNaCl==1.5.0
pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1672659226110/work
pyparsing==3.0.9
pyrsistent==0.19.3
PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work
pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1673864280276/work
PyYAML==5.4.1
pyzmq==19.0.0
redshift-connector==2.0.910
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1673863902341/work
retrying==1.3.4
rsa==4.7.2
ruamel.yaml @ file:///home/conda/feedstock_root/build_artifacts/ruamel.yaml_1666827402316/work
ruamel.yaml.clib @ file:///home/conda/feedstock_root/build_artifacts/ruamel.yaml.clib_1670412724006/work
s3fs==0.4.2
s3transfer==0.6.0
sagemaker==2.146.0
sagemaker-experiments==0.1.42
sagemaker-pytorch-training==2.7.0
sagemaker-training==4.4.5
schema==0.7.5
scikit-learn==1.0.2
scipy==1.7.3
scramp==1.4.4
seaborn==0.12.2
seedir==0.4.2
shap @ file:///home/conda/feedstock_root/build_artifacts/shap_1655716944211/work
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
slicer @ file:///home/conda/feedstock_root/build_artifacts/slicer_1608146800664/work
smclarify==0.3
smdebug==1.0.24b20230214
smdebug-rulesconfig==1.0.1
soupsieve==2.4.1
SQLAlchemy==1.4.47
sqlalchemy-redshift==0.8.14
stack-data==0.6.2
tabulate==0.9.0
tenacity==8.2.1
threadpoolctl @ file:///home/conda/feedstock_root/build_artifacts/threadpoolctl_1643647933166/work
toolz @ file:///home/conda/feedstock_root/build_artifacts/toolz_1657485559105/work
torch @ https://aws-pytorch-unified-cicd-binaries.s3.us-west-2.amazonaws.com/r1.12.1_sm/20230106-033032/626f1a6ec58817e524705e0917dbab97c81b440e/torch-1.12.1%2Bcpu-cp38-cp38-linux_x86_64.whl
torchaudio @ https://download.pytorch.org/whl/cpu/torchaudio-0.12.1%2Bcpu-cp38-cp38-linux_x86_64.whl
torchdata @ https://download.pytorch.org/whl/test/torchdata-0.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
torchvision @ https://download.pytorch.org/whl/cpu/torchvision-0.13.1%2Bcpu-cp38-cp38-linux_x86_64.whl
tornado==6.2
tqdm @ file:///home/conda/feedstock_root/build_artifacts/tqdm_1662214488106/work
traitlets==5.9.0
typing_extensions==4.4.0
urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1673452138552/work
wcwidth==0.2.6
Werkzeug==2.2.2
widgetsnbextension==4.0.7
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1675982654259/work
zope.event==4.6
zope.interface==5.5.2
zstandard==0.19.0

@lampretl
Copy link
Author

@kukushking Have you been able to reproduce the bug?
If not, shall I post here a dataframe containing the table that causes the issue?

@kukushking
Copy link
Contributor

@lampretl Unfortunately no I wasn't able to reproduce this. Yes please, if you could share here a test case that would be much appreciated!

@lampretl
Copy link
Author

@kukushking I was not able to reproduce the wrong output by uploading a dataframe, but I have found the source of the problem: internal casting of redshift's numeric(38,20) to pandas's float64. Have a look at my screenshot:

Screenshot at 2023-05-31 16-54-49
When I run the usual query, read_sql_query fetches the data incorrectly, but when I use explicit casting with ::numeric(18,17), the obtained result seems correct. Also, you can see how much the results differ, e.g. in the first row, we have -2% instead of 70%.

So my questions are:

  • would it be possible that you integrate automatic appropriate casting for read_sql_query?
  • are we sure no such incorrect loading occurs with unload?

@kukushking
Copy link
Contributor

kukushking commented Jun 9, 2023

Hi @lampretl yes looks like that's the issue. I still wasn't able to reproduce with the same numpy, pandas, and arrow versions - all returned values are decimals wrapped in pandas object, but there is something I think you can do:

import pyarrow as pa

df1 = wr.redshift.read_sql_query(query, con=redshift_con, dtype={"c0": "int64", "p": pa.decimal128(38, 37)})

You can force read_sql_query to return a decimal with expected precision/scale. Please give it a try. If that matches your db schema, you should have no issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants