Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST]how can i change int64 to float64 #1768

Closed
gukejun1 opened this issue Feb 21, 2023 · 7 comments
Closed

[QST]how can i change int64 to float64 #1768

gukejun1 opened this issue Feb 21, 2023 · 7 comments
Assignees
Labels
P2 question Further information is requested

Comments

@gukejun1
Copy link

gukejun1 commented Feb 21, 2023

https://nvidia-merlin.github.io/Merlin/main/examples/scaling-criteo/01-Download-Convert.html#conversion-script-for-criteo-dataset-csv-to-parquet

According to the content given by the official website, I run the case, but I end up reporting an error.

File "/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/csv.py", line 285, in coerce_dtypes
   raise ValueError(msg)
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+--------+---------+----------+
| Column | Found   | Expected |
+--------+---------+----------+
| I12    | float64 | int64    |
| I2     | float64 | int64    |
| I7     | float64 | int64    |
+--------+---------+----------+

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'I12': 'float64',
      'I2': 'float64',
      'I7': 'float64'}

to the call to `read_csv`/`read_table`.

Alternatively, provide `assume_missing=True` to interpret
all unspecified integer columns as floats.

Then I modify some of the code as follows:

dtypes = {}
dtypes["label"] = np.int32
for x in cont_names:
    dtypes[x] = np.int32
for x in ["I12", "I2","I7"]:
    dtypes[x] = "float64"
for x in cat_names:
    dtypes[x] = "hex"

But it still makes the same mistake,
I'll add the code for

dataset = nvt.Dataset(
    file_list,
    engine="csv",
    names=cols,
    part_mem_fraction=0.10,
    sep="\t",
    dtypes=dtypes,
    assume_missing=True,  
    client=client,
)

A new error occurs:

File "/usr/local/lib/python3.8/dist-packages/pandas/core/dtypes/cast.py", line 1213, in astype_float_to_int_nansafe
   raise IntCastingNaNError(
pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

The data is the first decompressed data file(/raid/data/criteo/crit_orig/day_0).How do I handle the initial data format exception?

@gukejun1 gukejun1 added the question Further information is requested label Feb 21, 2023
@gukejun1
Copy link
Author

gukejun1 commented Feb 21, 2023

input_path = "/raid/data/criteo/crit_orig"
    BASE_DIR = "/raid/data/criteo"
    INPUT_PATH = os.environ.get("INPUT_DATA_DIR", input_path)
    OUTPUT_PATH = os.environ.get("OUTPUT_DATA_DIR", os.path.join(BASE_DIR, "converted"))
    # CUDA_VISIBLE_DEVICES = os.environ.get("CUDA_VISIBLE_DEVICES", "0,1,2,3,4,5")
    CUDA_VISIBLE_DEVICES = os.environ.get("CUDA_VISIBLE_DEVICES", "3")

    cluster = None  # Connect to existing cluster if desired
    if cluster is None:
        cluster = LocalCUDACluster(
            CUDA_VISIBLE_DEVICES=CUDA_VISIBLE_DEVICES,
            rmm_pool_size=get_rmm_size(0.8 * device_mem_size()),
            local_directory=os.path.join(OUTPUT_PATH, "dask-space"),
        )

    client = Client(cluster)

    # Specify column names
    cont_names = ["I" + str(x) for x in range(1, 14)]
    cat_names = ["C" + str(x) for x in range(1, 27)]
    cols = ["label"] + cont_names + cat_names

    # Specify column dtypes. Note that "hex" means that
    # the values will be hexadecimal strings that should
    # be converted to int32
    dtypes = {}
    dtypes["label"] = np.int32
    for x in cont_names:
       dtypes[x] = np.int32
    for x in cat_names:
        dtypes[x] = "hex"

    file_list = glob.glob(os.path.join(INPUT_PATH, "day_0"))

    dataset = nvt.Dataset(
        file_list,
        engine="csv",
        names=cols,
        part_mem_fraction=0.10,
        sep="\t",
        dtypes=dtypes,
        client=client,
    )

    dataset.to_parquet(
        os.path.join(OUTPUT_PATH, "criteo"),
        preserve_files=True,
    )

This is my original full code,from notebook(https://nvidia-merlin.github.io/Merlin/main/examples/scaling-criteo/01-Download-Convert.html#conversion-script-for-criteo-dataset-csv-to-parquet)

@gukejun1 gukejun1 changed the title [QST] [QST]how can i change int64 to float64 Feb 21, 2023
@gukejun1
Copy link
Author

@rnyak could you tell me the solution? I use docker images(nvcr.io/nvidia/merlin/merlin-tensorflow 22.12) from website(https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow)

@rnyak
Copy link
Contributor

rnyak commented Feb 24, 2023

@gukejun1 hello. not sure I understood, did you fix it or not? you wrote fix it..

are you running this notebook? which line is giving you the error?

are you running this code on GPU and with multiple GPUs?

@gukejun1
Copy link
Author

gukejun1 commented Feb 24, 2023

@rnyak NO,i can't fix it by use \Merlin\examples\scaling-criteo\01_download_convert.ipynb. I run this code on GPU.The data is from criteo dataset . the error is from the code

dataset = nvt.Dataset(
        file_list,
        engine="csv",
        names=cols,
        part_mem_fraction=0.10,
        sep="\t",
        dtypes=dtypes,
        client=client,
    )

@rnyak
Copy link
Contributor

rnyak commented Feb 24, 2023

@gukejun1 thanks. how many GPUs are you using for running this notebook? I am still confused. if you could not make this notebook run properly, how did you generated the parquet files you mention in this ticket? #1770

@gukejun1
Copy link
Author

@rnyak the late I change the notebook code to

for x in cont_names:
      dtypes[x] = np.zeros(0)## change  here
      # dtypes[x] = np.int32
  for x in cat_names:
      dtypes[x] = "hex"
    #.......
   dataset = nvt.Dataset(
      file_list,
      engine="csv",
      names=cols,
      part_mem_fraction=0.10,
      sep="\t",
      dtypes=dtypes,
      client=client,
      assume_missing=True  ## here  add
  )

i run it in 6 GPUS. The modification to my way eventually worked. But I wonder why does it fail like notebook? I'm not sure if my approach deviates from the idea in the notebook.and is there a better solution?

@rnyak rnyak added the P2 label Feb 28, 2023
@rnyak rnyak closed this as completed Jun 8, 2023
@hsezhiyan
Copy link

I also encountered this exact same problem and resolved it using the solution from @gukejun1 . @gukejun1 did you find out if the modifications you made do not affect the correctness of the training?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants