Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index error with Categorify on transform step for columns with 100% NaNs #1865

Open
lecardozo opened this issue Oct 2, 2023 · 0 comments · May be fixed by #1869
Open

Index error with Categorify on transform step for columns with 100% NaNs #1865

lecardozo opened this issue Oct 2, 2023 · 0 comments · May be fixed by #1869

Comments

@lecardozo
Copy link

I was running a workflow.transform(sampled_dataset) step on a sample of my inference dataset and received the following error

Traceback (most recent call last):
  File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 510, in transform
    encoded = _encode(
  File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 1707, in _encode
    if isinstance(df[cl].dropna().iloc[0], (np.ndarray, list)):
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1625, in _getitem_axis
    self._validate_integer(key, axis)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1557, in _validate_integer
    raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/databricks/python/lib/python3.8/site-packages/merlin/dag/executors.py", line 237, in _run_node_transform
    transformed_data = node.op.transform(selection, input_data)
  File "/databricks/python/lib/python3.8/site-packages/merlin/core/dispatch.py", line 69, in inner2
    return func(*args, **kwargs)
  File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 534, in transform
    raise RuntimeError(f"Failed to categorical encode column {name}") from e
RuntimeError: Failed to categorical encode column my_categorical_column

I noticed this happens when the dataset to be transformed has a categorical column (my_categorical_column) with 100% NaNs. It looks like that happens when this line is reached 👇 where we do a dropna() followed by iloc[0]

if isinstance(df[cl].dropna().iloc[0], (np.ndarray, list)):

It's not a huge blocker for me right now, as this mostly happens on dataset samples, but I'm wondering whether that behavior is expected. Any thoughts? 😃

@lecardozo lecardozo linked a pull request Oct 18, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant