Skip to content
This repository has been archived by the owner on Nov 14, 2023. It is now read-only.

GridSearchCV, RandomizedSearchCV don't support big dataset train? #232

Open
ICESDHR opened this issue Jan 10, 2022 · 9 comments
Open

GridSearchCV, RandomizedSearchCV don't support big dataset train? #232

ICESDHR opened this issue Jan 10, 2022 · 9 comments
Assignees

Comments

@ICESDHR
Copy link

ICESDHR commented Jan 10, 2022

GridSearchCV and RandomizedSearchCV work well when dataset is small, but it's broken down when dataset is large.

I read GridSearchCV implementation. In _fit() function, you put dataset to Ray Object Store, but ray.put() function use grpc to transform data, grpc protobuf doesn't support data greater than 2GB. Is that right?

X_id = ray.put(X)
y_id = ray.put(y)

I ask this question because I want to know whether you know this situation and whether you have an optimization plan?

thanks for your reply!

@Yard1
Copy link
Member

Yard1 commented Jan 10, 2022

ray.put should support objects bigger than 2 GB. Are you sure you are not simply running out of memory? What sort of errors are you getting?

@ICESDHR
Copy link
Author

ICESDHR commented Jan 11, 2022

my ray cluster work node hava 20GB memory, i ask question in ray project too, they advised me not to use PUT to transfer big data :(

 ERROR - Exception serializing message!
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/grpc/_common.py", line 86, in _transform
    return transformer(message)
ValueError: Message ray.rpc.DataRequest exceeds maximum protobuf size of 2GB: 5979803541
ERROR dataclient.py:150 -- Unrecoverable error in data channel.

@richardliaw
Copy link
Collaborator

Are you using Kubernetes / ray client @ICESDHR ?

@ICESDHR
Copy link
Author

ICESDHR commented Feb 24, 2022

yeah, i use kubernets ray operator here.follow these step:

  1. deploy raycluster crd,
  2. deploy operator,
  3. create a raycluster which contain one head and three worker,
  4. create a kubernetes job, in this job i connet to raycluster and success, but if i use ray.put() to put dataset which large than 2GB i 'll get this error(Inevitable mistakes).If i use other samll dataset, it works well.

so, i think Here are some bugs, do you think? if so, I wonder if we could discuss a solution? if not, plz help me how to use this function, thx : )

@Yard1
Copy link
Member

Yard1 commented Feb 24, 2022

Ok, I can see this being a problem. Would it be possible for you to load the data from S3/NFS/disk on the nodes? If yes, we could add support for that. How does that sound?

@ICESDHR
Copy link
Author

ICESDHR commented Feb 25, 2022

thanks for your reply! It would be nice if official support could be provided. How soon will this patch be released?

@ICESDHR
Copy link
Author

ICESDHR commented Feb 25, 2022

after load big dataset, i face another two problems:

  1. use this method, if one node run many trials, it will load multiple copies of data into memory repeatedly, cause memory waste;
  2. In my practice, with sufficient resources, after load big dataset from disk, training process will trigger problems as shown in the figure. i change ray/tune/ray_trial_executor.py DEFAULT_GET_TIMEOUT 60 -> 300,it works.

Is there a better solution?

Untitled
Untitled

@Yard1
Copy link
Member

Yard1 commented Feb 25, 2022

It's not possible right now, but with the proposed changes, you should be able to use Ray Datasets, which should solve both the 2GB issue and ensure a minimum amount of copying required. Will keep you updated.

@ICESDHR
Copy link
Author

ICESDHR commented Feb 28, 2022

I'm trying to use ray Data function, i found Operate as above, ray.data.from_pandas() still have this problem, but ray.data.read_csv() works :( Usually, the data will be processed by pandas first and trained with ray, so it would be great if it could be optimized.

@Yard1 Yard1 self-assigned this Mar 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants