Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatisticsGen treats zeros as missing data after FileBasedExampleGen with parquet_executor #6407

Open
tgrunzweig-cpacket opened this issue Oct 30, 2023 · 1 comment

Comments

@tgrunzweig-cpacket
Copy link

tgrunzweig-cpacket commented Oct 30, 2023

If the bug is related to a specific library below, please raise an issue in the
respective repo directly:

TensorFlow Data Validation Repo

TensorFlow Model Analysis Repo

TensorFlow Transform Repo

TensorFlow Serving Repo

System information

  • Have I specified the code to reproduce the issue (Yes, No):Yes
  • Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows),
    Interactive Notebook, Google Cloud, etc): Linux, AWS EC2 instance, jupyer notebook
  • TensorFlow version: 2.13.1
  • TFX Version:1.14.0
  • Python version: 3.9.18
  • Python dependencies (from pip freeze output):

Describe the current behavior

Describe the expected behavior

Standalone code to reproduce the issue
`
import pandas as pd
import numpy as np
import string
import sys
import tensorflow as tf
from tfx import v1 as tfx
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
from google.protobuf.json_format import MessageToDict
from tfx.components import FileBasedExampleGen, CsvExampleGen
from tfx.components.example_gen.custom_executors import parquet_executor
from tfx.dsl.components.base import executor_spec

arr_random = np.random.randint(low=0, high=3, size=(100,5))
columns = list(string.ascii_uppercase[0:5])
df = pd.DataFrame(arr_random, columns=columns)
df.to_parquet('./gen_data/lots_of_zeros.parquet', index=False)
_pipeline_root = './pipeline/'
_data_root = './gen_data/'
context = InteractiveContext(pipeline_root=_pipeline_root)
custom_executor_spec = executor_spec.BeamExecutorSpec(parquet_executor.Executor)
example_gen = FileBasedExampleGen(input_base=_data_root,
custom_executor_spec=custom_executor_spec)

context.run(example_gen)
statistics_gen = tfx.components.StatisticsGen(
examples=example_gen.outputs['examples'])

context.run(statistics_gen)
context.show(statistics_gen.outputs['statistics'])

`
Visually inspecting this result, I find for the numeric features the following errrors:

  1. the fraction of missing is bigger than zero (which is worng, there are no missing),
  2. the fraction of zeros is 0% (which is wrong, there are several zeros).
  3. the mean value is incorrect
  4. the standard deviation is incorrect
  5. The min value is not 0, as it should be, but rather 1.
  6. the median value is wrong (as it doesn't count how manu zeros are in the data)

What I think is happening is that the FileBasedExampleGen crates sparse representation of the parquet input file, and the statisticsGen interpets it as if there are no zeros in the input file.

This is in contrast to the CsvExampleGen, that for the same input (but saved as csv), has no missing values, shows the correct number of zeros, and shows the correct statistics.

Providing a bare minimum test case or step(s) to reproduce the problem will
greatly help us to debug the issue. If possible, please share a link to
Colab/Jupyter/any notebook.

Name of your Organization (Optional)
cpacket
Other info / logs

Include any logs or source code that would be helpful to diagnose the problem.
If including tracebacks, please include the full traceback. Large logs and files
should be attached.
Screenshot 2023-10-30 at 1 51 03 PM

@singhniraj08
Copy link
Contributor

@tgrunzweig-cpacket, Thank you for reporting this bug. The problem is with the parquet executor exampleGen, instead of SchemaGen. The zeros are not taken up by the Parquet executor, resulting in null values in place of zeros in dataset, due to which SchemaGen reports missing values instead of zero values.

Let us debug more on this issue and we will update this thread. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants