From redshift DB, `read_sql_query` distorts values, but `unload` does not #2216

lampretl · 2023-04-20T16:50:38Z

Describe the bug

In our private redshift database on AWS, I have a table with a column probability of type numeric(38, 20), its values are between 0 and 1. In AWS SageMaker jupyter notebook, I query/download the content of that table. The values obtained via read_sql_query are also negative, but the ones obtained via unload are all positive.

How to Reproduce

When executing the following code

query = "SELECT user_id, probability p FROM models.churn_short_term_predictions"
df1 = wr.redshift.read_sql_query(query, con=con)
print(df1.p.min())
df2 = wr.redshift.unload(query, con=con, path=staging_area()) 
print(df2.p.min())

the output is unexpectedly:

-0.09223216322371685616
0.00028089172873606330

Expected behavior

Outputs df1.p.min() and df2.p.min() should be equal. And certainly, both must be between 0 and 1.

Your project

No response

Screenshots

OS

Linux (AWS)

Python version

3.8.16

AWS SDK for pandas version

3.0.0

Additional context

No response

The text was updated successfully, but these errors were encountered:

kukushking · 2023-04-21T12:20:10Z

Thanks @lampretl looking into it

kukushking · 2023-04-21T16:04:30Z

HI @lampretl I've added a test case to simulate overflow which is what most likely happens here but wasn't able to reproduce so far. Is this consistently reproducing for you? What are the actual MIN and MAX values in Redshift?

lampretl · 2023-04-21T16:31:32Z

If I run that jupyter cell several times, over different days, I always get the slightly negative result with read_sql_query.

According to DBeaver, the min and max values in my table are 0.00028089172873606330 and 0.98937731753215710000. The table has 2360571 rows.

kukushking · 2023-04-24T15:47:22Z

Could you share the output of pip freeze, please?

lampretl · 2023-04-25T07:41:14Z

Of course, if I run ! pip freeze in a jupyter notebook in SageMaker Studio on AWS, I get:

alembic==1.10.3
appdirs @ file:///home/conda/feedstock_root/build_artifacts/appdirs_1603108395799/work
asn1crypto==1.5.1
asttokens==2.2.1
attrs==22.2.0
awscli==1.27.70
awsio @ https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl
awswrangler==3.0.0
backcall==0.2.0
bcrypt==4.0.1
beautifulsoup4==4.12.2
bokeh==2.4.3
boto3==1.26.70
botocore==1.29.70
brotlipy @ file:///home/conda/feedstock_root/build_artifacts/brotlipy_1666764652625/work
cached-property @ file:///home/conda/feedstock_root/build_artifacts/cached_property_1615209429212/work
certifi==2022.12.7
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1671179356964/work
charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1661170624537/work
click==8.1.3
cloudpickle @ file:///home/conda/feedstock_root/build_artifacts/cloudpickle_1674202310934/work
cmaes==0.9.1
colorama==0.4.4
colorlog==6.7.0
conda==22.11.1
conda-content-trust @ file:///home/conda/feedstock_root/build_artifacts/conda-content-trust_1621370699668/work
conda-package-handling @ file:///home/conda/feedstock_root/build_artifacts/conda-package-handling_1669907009957/work
conda_package_streaming @ file:///home/conda/feedstock_root/build_artifacts/conda-package-streaming_1669733752472/work
contextlib2==21.6.0
contourpy==1.0.7
cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography-split_1675828607636/work
cycler==0.11.0
Cython @ file:///home/conda/feedstock_root/build_artifacts/cython_1673054071802/work
decorator==5.1.1
dgl==0.9.1.post1
dill==0.3.6
docutils==0.16
executing==1.2.0
fonttools==4.38.0
fsspec==2023.1.0
gevent==22.10.2
google-pasta==0.2.0
greenlet==2.0.2
h5py @ file:///home/conda/feedstock_root/build_artifacts/h5py_1675704810568/work
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1663625384323/work
imageio==2.25.1
importlib-metadata==4.13.0
importlib-resources==5.10.2
inotify-simple==1.2.1
ipykernel==5.5.6
ipython==8.10.0
ipython-genutils==0.2.0
ipywidgets==8.0.6
jedi==0.18.2
Jinja2==3.1.2
jmespath==1.0.1
joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1663332044897/work
jsonschema==4.17.3
jupyter-client==6.1.5
jupyter-core==4.9.2
jupyterlab-widgets==3.0.7
kiwisolver==1.4.4
libmambapy @ file:///home/conda/feedstock_root/build_artifacts/mamba-split_1671598321536/work/libmambapy
lightgbm==3.3.5
llvmlite==0.36.0
lxml==4.9.2
Mako==1.2.4
mamba @ file:///home/conda/feedstock_root/build_artifacts/mamba-split_1671598321536/work/mamba
MarkupSafe==2.1.2
matplotlib==3.5.3
matplotlib-inline==0.1.6
multiprocess==0.70.14
natsort==8.3.1
networkx @ file:///home/conda/feedstock_root/build_artifacts/networkx_1673151334029/work
numba==0.53.1
numpy==1.21.6
opencv-python==4.7.0.68
optuna==3.1.0
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1673482170163/work
pandas==1.3.5
paramiko==3.0.0
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work
pathos==0.3.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.4.0
pkgutil_resolve_name==1.3.10
platformdirs==3.2.0
plotly==5.9.0
pluggy @ file:///home/conda/feedstock_root/build_artifacts/pluggy_1667232663820/work
pooch @ file:///home/conda/feedstock_root/build_artifacts/pooch_1643032624649/work
portalocker==2.7.0
pox==0.3.2
ppft==1.7.6.6
prompt-toolkit==3.0.36
protobuf==3.20.2
protobuf3-to-dict==0.1.5
psutil==5.9.0
psycopg2-binary==2.9.6
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==11.0.0
pyasn1==0.4.8
pycosat @ file:///home/conda/feedstock_root/build_artifacts/pycosat_1666834293299/work
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1636257122734/work
pyfunctional==1.4.3
Pygments==2.14.0
pyinstrument==3.4.2
pyinstrument-cext==0.2.4
PyNaCl==1.5.0
pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1672659226110/work
pyparsing==3.0.9
pyrsistent==0.19.3
PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work
pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1673864280276/work
PyYAML==5.4.1
pyzmq==19.0.0
redshift-connector==2.0.910
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1673863902341/work
retrying==1.3.4
rsa==4.7.2
ruamel.yaml @ file:///home/conda/feedstock_root/build_artifacts/ruamel.yaml_1666827402316/work
ruamel.yaml.clib @ file:///home/conda/feedstock_root/build_artifacts/ruamel.yaml.clib_1670412724006/work
s3fs==0.4.2
s3transfer==0.6.0
sagemaker==2.146.0
sagemaker-experiments==0.1.42
sagemaker-pytorch-training==2.7.0
sagemaker-training==4.4.5
schema==0.7.5
scikit-learn==1.0.2
scipy==1.7.3
scramp==1.4.4
seaborn==0.12.2
seedir==0.4.2
shap @ file:///home/conda/feedstock_root/build_artifacts/shap_1655716944211/work
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
slicer @ file:///home/conda/feedstock_root/build_artifacts/slicer_1608146800664/work
smclarify==0.3
smdebug==1.0.24b20230214
smdebug-rulesconfig==1.0.1
soupsieve==2.4.1
SQLAlchemy==1.4.47
sqlalchemy-redshift==0.8.14
stack-data==0.6.2
tabulate==0.9.0
tenacity==8.2.1
threadpoolctl @ file:///home/conda/feedstock_root/build_artifacts/threadpoolctl_1643647933166/work
toolz @ file:///home/conda/feedstock_root/build_artifacts/toolz_1657485559105/work
torch @ https://aws-pytorch-unified-cicd-binaries.s3.us-west-2.amazonaws.com/r1.12.1_sm/20230106-033032/626f1a6ec58817e524705e0917dbab97c81b440e/torch-1.12.1%2Bcpu-cp38-cp38-linux_x86_64.whl
torchaudio @ https://download.pytorch.org/whl/cpu/torchaudio-0.12.1%2Bcpu-cp38-cp38-linux_x86_64.whl
torchdata @ https://download.pytorch.org/whl/test/torchdata-0.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
torchvision @ https://download.pytorch.org/whl/cpu/torchvision-0.13.1%2Bcpu-cp38-cp38-linux_x86_64.whl
tornado==6.2
tqdm @ file:///home/conda/feedstock_root/build_artifacts/tqdm_1662214488106/work
traitlets==5.9.0
typing_extensions==4.4.0
urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1673452138552/work
wcwidth==0.2.6
Werkzeug==2.2.2
widgetsnbextension==4.0.7
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1675982654259/work
zope.event==4.6
zope.interface==5.5.2
zstandard==0.19.0

lampretl · 2023-05-29T06:59:01Z

@kukushking Have you been able to reproduce the bug?
If not, shall I post here a dataframe containing the table that causes the issue?

kukushking · 2023-05-29T13:26:13Z

@lampretl Unfortunately no I wasn't able to reproduce this. Yes please, if you could share here a test case that would be much appreciated!

lampretl · 2023-05-31T15:04:16Z

@kukushking I was not able to reproduce the wrong output by uploading a dataframe, but I have found the source of the problem: internal casting of redshift's numeric(38,20) to pandas's float64. Have a look at my screenshot:

When I run the usual query, read_sql_query fetches the data incorrectly, but when I use explicit casting with ::numeric(18,17), the obtained result seems correct. Also, you can see how much the results differ, e.g. in the first row, we have -2% instead of 70%.

So my questions are:

would it be possible that you integrate automatic appropriate casting for read_sql_query?
are we sure no such incorrect loading occurs with unload?

kukushking · 2023-06-09T14:01:00Z

Hi @lampretl yes looks like that's the issue. I still wasn't able to reproduce with the same numpy, pandas, and arrow versions - all returned values are decimals wrapped in pandas object, but there is something I think you can do:

import pyarrow as pa

df1 = wr.redshift.read_sql_query(query, con=redshift_con, dtype={"c0": "int64", "p": pa.decimal128(38, 37)})

You can force read_sql_query to return a decimal with expected precision/scale. Please give it a try. If that matches your db schema, you should have no issues.

lampretl added the bug Something isn't working label Apr 20, 2023

kukushking self-assigned this Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

From redshift DB, `read_sql_query` distorts values, but `unload` does not #2216

From redshift DB, `read_sql_query` distorts values, but `unload` does not #2216

lampretl commented Apr 20, 2023

kukushking commented Apr 21, 2023

kukushking commented Apr 21, 2023

lampretl commented Apr 21, 2023 •

edited

kukushking commented Apr 24, 2023

lampretl commented Apr 25, 2023

lampretl commented May 29, 2023

kukushking commented May 29, 2023

lampretl commented May 31, 2023

kukushking commented Jun 9, 2023 •

edited

From redshift DB, read_sql_query distorts values, but unload does not #2216

From redshift DB, read_sql_query distorts values, but unload does not #2216

Comments

lampretl commented Apr 20, 2023

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

kukushking commented Apr 21, 2023

kukushking commented Apr 21, 2023

lampretl commented Apr 21, 2023 • edited

kukushking commented Apr 24, 2023

lampretl commented Apr 25, 2023

lampretl commented May 29, 2023

kukushking commented May 29, 2023

lampretl commented May 31, 2023

kukushking commented Jun 9, 2023 • edited

From redshift DB, `read_sql_query` distorts values, but `unload` does not #2216

From redshift DB, `read_sql_query` distorts values, but `unload` does not #2216

lampretl commented Apr 21, 2023 •

edited

kukushking commented Jun 9, 2023 •

edited