[QST]how can i change int64 to float64 #1768

gukejun1 · 2023-02-21T07:28:54Z

https://nvidia-merlin.github.io/Merlin/main/examples/scaling-criteo/01-Download-Convert.html#conversion-script-for-criteo-dataset-csv-to-parquet

According to the content given by the official website, I run the case, but I end up reporting an error.

File "/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/csv.py", line 285, in coerce_dtypes
   raise ValueError(msg)
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+--------+---------+----------+
| Column | Found   | Expected |
+--------+---------+----------+
| I12    | float64 | int64    |
| I2     | float64 | int64    |
| I7     | float64 | int64    |
+--------+---------+----------+

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'I12': 'float64',
      'I2': 'float64',
      'I7': 'float64'}

to the call to `read_csv`/`read_table`.

Alternatively, provide `assume_missing=True` to interpret
all unspecified integer columns as floats.

Then I modify some of the code as follows:

dtypes = {}
dtypes["label"] = np.int32
for x in cont_names:
    dtypes[x] = np.int32
for x in ["I12", "I2","I7"]:
    dtypes[x] = "float64"
for x in cat_names:
    dtypes[x] = "hex"

But it still makes the same mistake,
I'll add the code for

dataset = nvt.Dataset(
    file_list,
    engine="csv",
    names=cols,
    part_mem_fraction=0.10,
    sep="\t",
    dtypes=dtypes,
    assume_missing=True,  
    client=client,
)

A new error occurs:

File "/usr/local/lib/python3.8/dist-packages/pandas/core/dtypes/cast.py", line 1213, in astype_float_to_int_nansafe
   raise IntCastingNaNError(
pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

The data is the first decompressed data file(/raid/data/criteo/crit_orig/day_0).How do I handle the initial data format exception?

The text was updated successfully, but these errors were encountered:

gukejun1 · 2023-02-21T07:39:10Z

input_path = "/raid/data/criteo/crit_orig"
    BASE_DIR = "/raid/data/criteo"
    INPUT_PATH = os.environ.get("INPUT_DATA_DIR", input_path)
    OUTPUT_PATH = os.environ.get("OUTPUT_DATA_DIR", os.path.join(BASE_DIR, "converted"))
    # CUDA_VISIBLE_DEVICES = os.environ.get("CUDA_VISIBLE_DEVICES", "0,1,2,3,4,5")
    CUDA_VISIBLE_DEVICES = os.environ.get("CUDA_VISIBLE_DEVICES", "3")

    cluster = None  # Connect to existing cluster if desired
    if cluster is None:
        cluster = LocalCUDACluster(
            CUDA_VISIBLE_DEVICES=CUDA_VISIBLE_DEVICES,
            rmm_pool_size=get_rmm_size(0.8 * device_mem_size()),
            local_directory=os.path.join(OUTPUT_PATH, "dask-space"),
        )

    client = Client(cluster)

    # Specify column names
    cont_names = ["I" + str(x) for x in range(1, 14)]
    cat_names = ["C" + str(x) for x in range(1, 27)]
    cols = ["label"] + cont_names + cat_names

    # Specify column dtypes. Note that "hex" means that
    # the values will be hexadecimal strings that should
    # be converted to int32
    dtypes = {}
    dtypes["label"] = np.int32
    for x in cont_names:
       dtypes[x] = np.int32
    for x in cat_names:
        dtypes[x] = "hex"

    file_list = glob.glob(os.path.join(INPUT_PATH, "day_0"))

    dataset = nvt.Dataset(
        file_list,
        engine="csv",
        names=cols,
        part_mem_fraction=0.10,
        sep="\t",
        dtypes=dtypes,
        client=client,
    )

    dataset.to_parquet(
        os.path.join(OUTPUT_PATH, "criteo"),
        preserve_files=True,
    )

This is my original full code,from notebook(https://nvidia-merlin.github.io/Merlin/main/examples/scaling-criteo/01-Download-Convert.html#conversion-script-for-criteo-dataset-csv-to-parquet)

gukejun1 · 2023-02-23T01:25:14Z

@rnyak could you tell me the solution? I use docker images(nvcr.io/nvidia/merlin/merlin-tensorflow 22.12) from website(https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow)

rnyak · 2023-02-24T00:18:31Z

@gukejun1 hello. not sure I understood, did you fix it or not? you wrote fix it..

are you running this notebook? which line is giving you the error?

are you running this code on GPU and with multiple GPUs?

gukejun1 · 2023-02-24T01:31:38Z

@rnyak NO,i can't fix it by use \Merlin\examples\scaling-criteo\01_download_convert.ipynb. I run this code on GPU.The data is from criteo dataset . the error is from the code

dataset = nvt.Dataset(
        file_list,
        engine="csv",
        names=cols,
        part_mem_fraction=0.10,
        sep="\t",
        dtypes=dtypes,
        client=client,
    )

rnyak · 2023-02-24T19:09:47Z

@gukejun1 thanks. how many GPUs are you using for running this notebook? I am still confused. if you could not make this notebook run properly, how did you generated the parquet files you mention in this ticket? #1770

gukejun1 · 2023-02-25T01:36:11Z

@rnyak the late I change the notebook code to

for x in cont_names:
      dtypes[x] = np.zeros(0)## change  here
      # dtypes[x] = np.int32
  for x in cat_names:
      dtypes[x] = "hex"
    #.......
   dataset = nvt.Dataset(
      file_list,
      engine="csv",
      names=cols,
      part_mem_fraction=0.10,
      sep="\t",
      dtypes=dtypes,
      client=client,
      assume_missing=True  ## here  add
  )

i run it in 6 GPUS. The modification to my way eventually worked. But I wonder why does it fail like notebook? I'm not sure if my approach deviates from the idea in the notebook.and is there a better solution?

hsezhiyan · 2023-11-13T22:44:19Z

I also encountered this exact same problem and resolved it using the solution from @gukejun1 . @gukejun1 did you find out if the modifications you made do not affect the correctness of the training?

gukejun1 added the question Further information is requested label Feb 21, 2023

gukejun1 changed the title ~~[QST]~~ [QST]how can i change int64 to float64 Feb 21, 2023

rnyak assigned radekosmulski, rnyak and bschifferer and unassigned radekosmulski Feb 24, 2023

rnyak added the P2 label Feb 28, 2023

rnyak closed this as completed Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST]how can i change int64 to float64 #1768

[QST]how can i change int64 to float64 #1768

gukejun1 commented Feb 21, 2023 •

edited

gukejun1 commented Feb 21, 2023 •

edited

gukejun1 commented Feb 23, 2023

rnyak commented Feb 24, 2023 •

edited

gukejun1 commented Feb 24, 2023 •

edited

rnyak commented Feb 24, 2023 •

edited

gukejun1 commented Feb 25, 2023

hsezhiyan commented Nov 13, 2023

[QST]how can i change int64 to float64 #1768

[QST]how can i change int64 to float64 #1768

Comments

gukejun1 commented Feb 21, 2023 • edited

gukejun1 commented Feb 21, 2023 • edited

gukejun1 commented Feb 23, 2023

rnyak commented Feb 24, 2023 • edited

gukejun1 commented Feb 24, 2023 • edited

rnyak commented Feb 24, 2023 • edited

gukejun1 commented Feb 25, 2023

hsezhiyan commented Nov 13, 2023

gukejun1 commented Feb 21, 2023 •

edited

gukejun1 commented Feb 21, 2023 •

edited

rnyak commented Feb 24, 2023 •

edited

gukejun1 commented Feb 24, 2023 •

edited

rnyak commented Feb 24, 2023 •

edited