Division error in function “divide_dataset” #3188

jmsw4bn · 2024-04-02T04:59:29Z

Describe the bug

When I divide a dataset with [0.2, 0.2, 0.2, 0.94], I find that the sub_datasets obtained is error.
Among the 1st, 2nd, 3rd, 4th sub_datasets, the 3rd sub_dataset has 0 samples.
Therefore, I try [0.2, 0.2, 0.2, 0.2, 0.92], and find that, the 3rd and 4th sub_datasets have 0 samples.

Finally, I find it is caused by the "_create_division_indices_ranges" function in "utils.py".
The code "start_idx += end_idx" should be "start_idx = end_idx".

Steps/Code to Reproduce

fds = FederatedDataset(dataset="cifar10", partitioners={"train": 1})
tds = fds.load_partition(0, "train")
partition = divide_dataset(dataset=tds, division=[0.2, 0.2, 0.2, 0.2, 0.92])

Expected Results

partition should have 5 dataset, whose sample size are respectively 10000, 10000, 10000, 10000, 46000 (total 50000 samples),
while the results are 10000, 10000, 0, 0, ...

Actual Results

partition should have 5 dataset, whose sample size are respectively 10000, 10000, 10000, 10000, 46000 (total 50000 samples),
while the results are 10000, 10000, 0, 0, ...

adam-narozniak · 2024-04-02T09:49:27Z

Hi @jmsw4bn. Thanks for pointing it out and figuring out the fix. I've opened the PR that fixes it.
As for the expected results [0.2, 0.2, 0.2, 0.94] won't be possible since the values sum up to more than 1. An error will be raised in that case. It might be confusing what should the 0.94 come from (it would have to overlap with some other parts that are expected to be separate). Alternatively, I think you might have meant 0.02, ... then it'd sum up to 1 and everything would work ok.

jmsw4bn · 2024-04-02T15:12:20Z

Hi @jmsw4bn. Thanks for pointing it out and figuring out the fix. I've opened the PR that fixes it. As for the expected results [0.2, 0.2, 0.2, 0.94] won't be possible since the values sum up to more than 1. An error will be raised in that case. It might be confusing what should the 0.94 come from (it would have to overlap with some other parts that are expected to be separate). Alternatively, I think you might have meant 0.02, ... then it'd sum up to 1 and everything would work ok.

I am sorry, I wrote the wrong values.
Actually, I test the code is with "division=[0.02, 0.02, 0.02, 0.02, 0.92]".
These values sum up to 1, and the 3rd and 4th sub_datasets have 0 samples,
you can validate the error by debuging the following codes, and the output shows the 5 sub_datasets in "partition" have 1000 1000 0 0 40000 samples respectively (the right output should be 1000 1000 1000 1000 46000):

from flwr_datasets import FederatedDataset
from flwr_datasets.utils import divide_dataset
fds = FederatedDataset(dataset="cifar10", partitioners={"train": 1})
tds = fds.load_partition(0, "train")
partition = divide_dataset(dataset=tds, division=[0.02, 0.02, 0.02, 0.02, 0.92])
print(len(partition[0]), len(partition[1]), len(partition[2]), len(partition[3]), len(partition[4]))

jmsw4bn added the bug Something isn't working label Apr 2, 2024

adam-narozniak mentioned this issue Apr 2, 2024

Fix divide_dataset in Federated Datasets #3192

Merged

adam-narozniak self-assigned this Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Division error in function “divide_dataset” #3188

Division error in function “divide_dataset” #3188

jmsw4bn commented Apr 2, 2024

adam-narozniak commented Apr 2, 2024

jmsw4bn commented Apr 2, 2024

Division error in function “divide_dataset” #3188

Division error in function “divide_dataset” #3188

Comments

jmsw4bn commented Apr 2, 2024

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

adam-narozniak commented Apr 2, 2024

jmsw4bn commented Apr 2, 2024