Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error with prepare_aliccp() #1226

Open
ZhanqiuHu opened this issue Nov 9, 2023 · 2 comments
Open

[BUG] Error with prepare_aliccp() #1226

ZhanqiuHu opened this issue Nov 9, 2023 · 2 comments
Labels
bug Something isn't working status/needs-triage

Comments

@ZhanqiuHu
Copy link

I ran into this error when running prepare_aliccp() on downloaded Ali-CCP datasets.

Traceback (most recent call last):
  File "/share/suh-scrap/zh338/aliccp/preprocess.py", line 13, in <module>
    prepare_aliccp(DATA_DIR, convert_train=False, convert_test=True)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/datasets/ecommerce/aliccp/dataset.py", line 164, in prepare_aliccp
    _convert_data(
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/datasets/ecommerce/aliccp/dataset.py", line 449, in _convert_data
    merlin.io.Dataset(tmp_files, dtypes=dtypes).to_parquet(out_dir)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/io/dataset.py", line 380, in __init__
    self.infer_schema()
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/io/dataset.py", line 1240, in infer_schema
    dtypes = self.sample_dtypes(n=n, annotate_lists=True)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/io/dataset.py", line 1264, in sample_dtypes
    _real_meta = _set_dtypes(_real_meta, self.dtypes)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/io/dataset.py", line 1301, in _set_dtypes
    chunk[col] = chunk[col].astype(dtype)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/generic.py", line 6240, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 448, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 352, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 526, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 299, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 230, in astype_array
    values = astype_nansafe(values, dtype, copy=copy)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 170, in astype_nansafe
    return arr.astype(dtype, copy=True)
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

I saw another issue (#507 ) talking about a similar problem but didn't really mention the solution/workaround, so I'm wondering what is a workaround to avoid this error?

Thanks!

@ZhanqiuHu ZhanqiuHu added bug Something isn't working status/needs-triage labels Nov 9, 2023
@ibraheemalayan
Copy link

Same issue here, any updates ?

@ibraheemalayan
Copy link

The dataset contains None values as seen if you display the head of the dataset

Screenshot 2024-05-04 at 15 38 22

I solved it by changing

dtypes = {f.name: "int32" for f in _Features().features}

to

dtypes = {f.name: "Int32" for f in _Features().features}

( Int32 with capital means nullable integer )

with the new dtypes

Screenshot 2024-05-04 at 15 39 06

ibraheemalayan added a commit to ibraheemalayan/models that referenced this issue May 4, 2024
Since the datasets contains many null values, Int32 ( nullable integer ) should be used instead
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status/needs-triage
Projects
None yet
Development

No branches or pull requests

2 participants