-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data is not shuffled prior to validation split #12
Comments
Thank you for identifying the bug. I think the same issue would persist with I will think how the problem could be best fixed. In the meanwhile, you could avoid the problem simply by resolving the filenames, shuffling them in a dataframe and then using import pandas as pd
from pathlib import Path
...
image_filenames = list(Path('data').rglob('*.png')) #or *.jpg etc. depending on your dataset file format
data_df = pd.DataFrame({'filename':image_filenames})
validation_df = data_df.sample(frac=0.2) #to enforce the fixed split, you can fix the seed in `random_state` parameter
train_df = data_df[~data_df.filename.isin(validation_df.filename)]
train_datagen = ImageDataAugmentor(
augment = AUGMENTATIONS)
val_datagen = ImageDataAugmentor()
train_generator = train_datagen.flow_from_dataframe(
train_df,
target_size=image_size,
class_mode='binary',
seed=123)
validation_generator = val_datagen.flow_from_dataframe(
validation_df,
target_size=image_size,
batch_size=1,
class_mode='binary',
seed=123)
... |
Thanks for the advice! I botched a solution in a fork yesterday: Sanqui@e5d0dc0, however I think your way is more straightforward. I had to turn the paths into strings and also gather the classes but otherwise, this worked: filepaths = list(Path(DATASET_DIRECTORY).rglob('*.png'))
filenames = [str(path) for path in filepaths]
classes = [str(path.parts[1]) for path in filepaths]
data_df = pd.DataFrame({'filename': filenames, 'class': classes})
validation_df = data_df.sample(frac=VALIDATION_SPLIT,
random_state=SEED)
train_df = data_df[~data_df.filename.isin(validation_df.filename)] Cheers. |
Bug Still Exists.... tf.keras img valid gen not working --> valid generator can't iterate next batch data (it raises Attribute Error : NoneType has shape) |
@ikarus-999 : it seems your comment is not related to the current issue. As a friendly suggestion for the future, you should post a new issue whenever it does not coincide with the existing ones. I am fairly certain that one of the paths in your dataframe is incorrect and |
Hello,
I ran into an issue when trying to use ImageDataAugmentor with a single directory, asking it to split into a training and validation set.
I have figured out two ways to acomplish this, both with faults:
This approach runs into one big issue, namely, the validation dataset now has augmentations applied, which goes against best practice. My second approach fares better in this department:
However, I have discovered a second, large issue: although the
flow_from_directory
method can handle shuffling data, because the list of filenames is not shuffled prior to the split, the validation dataset receives the first 0.2 files listed alphabetically, which can lead to huge biases. This can be verified by printingvalidation_generator.filenames
.Please advise me on this issue. I think shuffling the dataset prior to applying the validation split would be the solution here.
The text was updated successfully, but these errors were encountered: