flwr_datasets with custom/local dataset #3201

gubertoli · 2024-04-03T08:38:44Z

Describe the bug

FDS raising warning for custom datasets:

Lines 82 to 88 in dcffb48

    
           def _check_if_dataset_tested(dataset: str) -> None: 
        
               """Check if the dataset is in the narrowed down list of the tested datasets.""" 
        
               if dataset not in tested_datasets: 
        
                   warnings.warn( 
        
                       f"The currently tested dataset are {tested_datasets}. Given: {dataset}.", 
        
                       stacklevel=1, 
        
                   )

Steps/Code to Reproduce

I am using the following code (a custom dataset loaded with pandas):

dataset = Dataset.from_pandas(full_data, preserve_index=False, split="train")
partitioner = IidPartitioner(num_partitions=30)
fds = FederatedDataset(
        dataset=dataset,
        partitioners={"train": partitioner}
    )

Expected Results

No warning for custom datasets.

Actual Results

In this case the dataset is a custom dataset, but I am receiving the following warning (features were redacted):

utils.py:85: UserWarning: The currently tested dataset are ['mnist', 'cifar10', 'fashion_mnist', 'sasha/dog-food', 'zh-plus/tiny-imagenet']. Given: Dataset({
    features: ['xxx', 'xxx', 'xxx', 'label'],
    num_rows: 103904
}).
  warnings.warn(

gubertoli · 2024-04-03T15:09:52Z

It seems that FederatedDataset class is only downloading the dataset from HF. Probably should be changed to enable a custom dataset in the HF's dataset format be directly referenced instead of relying on downloading it.

flower/datasets/flwr_datasets/federated_dataset.py

Lines 43 to 44 in f78ef0a

    
               dataset : str 
        
                   The name of the dataset in the Hugging Face Hub.

flower/datasets/flwr_datasets/federated_dataset.py

Line 79 in f78ef0a

dataset: str,

flower/datasets/flwr_datasets/federated_dataset.py

Lines 237 to 239 in f78ef0a

    
           self._dataset = datasets.load_dataset( 
        
               path=self._dataset_name, name=self._subset 
        
           )

Evidenced by:

fds.load_partition(1, split="train")

And the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[137], line 1
----> 1 fds.load_partition(1, split="train")

File ~...\Lib\site-packages\flwr_datasets\federated_dataset.py:131, in FederatedDataset.load_partition(self, partition_id, split)
    108 """Load the partition specified by the idx in the selected split.
    109 
    110 The dataset is downloaded only when the first call to `load_partition` or
   (...)
    128     Single partition from the dataset split.
    129 """
    130 if not self._dataset_prepared:
--> 131     self._prepare_dataset()
    132 if self._dataset is None:
    133     raise ValueError("Dataset is not loaded yet.")

File ~...\Lib\site-packages\flwr_datasets\federated_dataset.py:237, in FederatedDataset._prepare_dataset(self)
    216 def _prepare_dataset(self) -> None:
    217     """Prepare the dataset (prior to partitioning) by download, shuffle, replit.
    218 
    219     Run only ONCE when triggered by load_* function. (In future more control whether
   (...)
    235     happen before the resplitting.
    236     """
--> 237     self._dataset = datasets.load_dataset(
    238         path=self._dataset_name, name=self._subset
    239     )
    240     if self._shuffle:
    241         # Note it shuffles all the splits. The self._dataset is DatasetDict
    242         # so e.g. {"train": train_data, "test": test_data}. All splits get shuffled.
    243         self._dataset = self._dataset.shuffle(seed=self._seed)

File ~...\Lib\site-packages\datasets\load.py:2538, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2536 if data_files is not None and not data_files:
   2537     raise ValueError(f"Empty 'data_files': '{data_files}'. It should be either non-empty or None (default).")
-> 2538 if Path(path, config.DATASET_STATE_JSON_FILENAME).exists():
   2539     raise ValueError(
   2540         "You are trying to load a dataset that was saved using `save_to_disk`. "
   2541         "Please use `load_from_disk` instead."
   2542     )
   2544 if streaming and num_proc is not None:

File ~...\Lib\pathlib.py:1162, in Path.__init__(self, *args, **kwargs)
   1159     msg = ("support for supplying keyword arguments to pathlib.PurePath "
   1160            "is deprecated and scheduled for removal in Python {remove}")
   1161     warnings._deprecated("pathlib.PurePath(**kwargs)", msg, remove=(3, 14))
-> 1162 super().__init__(*args)

File ~...\Lib\pathlib.py:373, in PurePath.__init__(self, *args)
    371             path = arg
    372         if not isinstance(path, str):
--> 373             raise TypeError(
    374                 "argument should be a str or an os.PathLike "
    375                 "object where __fspath__ returns a str, "
    376                 f"not {type(path).__name__!r}")
    377         paths.append(path)
    378 self._raw_paths = paths

TypeError: argument should be a str or an os.PathLike object where __fspath__ returns a str, not 'Dataset'

gubertoli added the bug Something isn't working label Apr 3, 2024

gubertoli changed the title ~~FDS giving UserWarning for custom/local dataset~~ flwr_datasets with custom/local dataset Apr 3, 2024

gubertoli mentioned this issue Apr 4, 2024

Enabling flwr_datasets to work with custom (not on Hugging Face Hub) Dataset #3212

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flwr_datasets with custom/local dataset #3201

flwr_datasets with custom/local dataset #3201

gubertoli commented Apr 3, 2024 •

edited

gubertoli commented Apr 3, 2024 •

edited

flwr_datasets with custom/local dataset #3201

flwr_datasets with custom/local dataset #3201

Comments

gubertoli commented Apr 3, 2024 • edited

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

gubertoli commented Apr 3, 2024 • edited

gubertoli commented Apr 3, 2024 •

edited

gubertoli commented Apr 3, 2024 •

edited