Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Division error in function “divide_dataset” #3188

Open
jmsw4bn opened this issue Apr 2, 2024 · 2 comments
Open

Division error in function “divide_dataset” #3188

jmsw4bn opened this issue Apr 2, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@jmsw4bn
Copy link

jmsw4bn commented Apr 2, 2024

Describe the bug

When I divide a dataset with [0.2, 0.2, 0.2, 0.94], I find that the sub_datasets obtained is error.
Among the 1st, 2nd, 3rd, 4th sub_datasets, the 3rd sub_dataset has 0 samples.
Therefore, I try [0.2, 0.2, 0.2, 0.2, 0.92], and find that, the 3rd and 4th sub_datasets have 0 samples.

Finally, I find it is caused by the "_create_division_indices_ranges" function in "utils.py".
The code "start_idx += end_idx" should be "start_idx = end_idx".

Steps/Code to Reproduce

fds = FederatedDataset(dataset="cifar10", partitioners={"train": 1})
tds = fds.load_partition(0, "train")
partition = divide_dataset(dataset=tds, division=[0.2, 0.2, 0.2, 0.2, 0.92])

Expected Results

partition should have 5 dataset, whose sample size are respectively 10000, 10000, 10000, 10000, 46000 (total 50000 samples),
while the results are 10000, 10000, 0, 0, ...

Actual Results

partition should have 5 dataset, whose sample size are respectively 10000, 10000, 10000, 10000, 46000 (total 50000 samples),
while the results are 10000, 10000, 0, 0, ...

@jmsw4bn jmsw4bn added the bug Something isn't working label Apr 2, 2024
@adam-narozniak
Copy link
Member

Hi @jmsw4bn. Thanks for pointing it out and figuring out the fix. I've opened the PR that fixes it.
As for the expected results [0.2, 0.2, 0.2, 0.94] won't be possible since the values sum up to more than 1. An error will be raised in that case. It might be confusing what should the 0.94 come from (it would have to overlap with some other parts that are expected to be separate). Alternatively, I think you might have meant 0.02, ... then it'd sum up to 1 and everything would work ok.

@adam-narozniak adam-narozniak self-assigned this Apr 2, 2024
@jmsw4bn
Copy link
Author

jmsw4bn commented Apr 2, 2024

Hi @jmsw4bn. Thanks for pointing it out and figuring out the fix. I've opened the PR that fixes it. As for the expected results [0.2, 0.2, 0.2, 0.94] won't be possible since the values sum up to more than 1. An error will be raised in that case. It might be confusing what should the 0.94 come from (it would have to overlap with some other parts that are expected to be separate). Alternatively, I think you might have meant 0.02, ... then it'd sum up to 1 and everything would work ok.

I am sorry, I wrote the wrong values.
Actually, I test the code is with "division=[0.02, 0.02, 0.02, 0.02, 0.92]".
These values sum up to 1, and the 3rd and 4th sub_datasets have 0 samples,
you can validate the error by debuging the following codes, and the output shows the 5 sub_datasets in "partition" have 1000 1000 0 0 40000 samples respectively (the right output should be 1000 1000 1000 1000 46000):

from flwr_datasets import FederatedDataset
from flwr_datasets.utils import divide_dataset
fds = FederatedDataset(dataset="cifar10", partitioners={"train": 1})
tds = fds.load_partition(0, "train")
partition = divide_dataset(dataset=tds, division=[0.02, 0.02, 0.02, 0.02, 0.92])
print(len(partition[0]), len(partition[1]), len(partition[2]), len(partition[3]), len(partition[4]))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants