Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve oxford_iiit_pet dataset #5349

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

enricoschroeder
Copy link

Improve oxford_iiit_pet dataset:

  • Include head bounding boxes for the training split.
  • Exclude samples with invalid or corrupt image files (preventing 'Corrupt JPEG data:...' error messages while using the dataset).
  • Use up-to-date download URL (from official website https://www.robots.ox.ac.uk/~vgg/data/pets/))

Thank you for your contribution!

Please read https://www.tensorflow.org/datasets/contribute#pr_checklist to make sure your PR follows the guidelines.

The checklist does not apply since this is only a minor fix to an existing dataset.

- Include head bounding boxes for the training split.
- Exclude samples with invalid or corrupt image files (preventing 'Corrupt JPEG data:...' error messages while using the dataset).
Copy link

google-cla bot commented Apr 4, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@fineguy fineguy self-requested a review April 16, 2024 12:00
Copy link
Collaborator

@fineguy fineguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution!

Overall it looks great, just a few minor comments.

@@ -63,6 +64,44 @@
]
_SPECIES_CLASSES = ["Cat", "Dog"]

# List of samples with corrupt image files
_SKIP_SAMPLES = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what the problem with these images is? Could it be that they're encoded in a different format? I suggest we check them in code and try our best to recover, similar to:
https://github.com/SirLefti/datasets/blob/9ae21497aee7177d931e359ffe939e18c7fc1531/tensorflow_datasets/image_classification/cats_vs_dogs.py#L102-L107

@@ -16,11 +16,12 @@
"""Oxford-IIIT pet dataset."""

import os
import xml.etree.ElementTree

from tensorflow_datasets.core.utils.lazy_imports_utils import tensorflow as tf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tensorflow is only used to open files, could you please replace this with from etils import epath?


# Disable pytype to avoid attribute-error due to find returning
# Optional[Element]
# pytype: disable=attribute-error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This disables pytype for the rest of the file, which is undesirable. How about something like size: xml.etree.ElementTree.Element = root.find("size") # pytype: disable=annotation-type-mismatch?

@@ -79,6 +118,7 @@ def _info(self):
"segmentation_mask": tfds.features.Image(
shape=(None, None, 1), use_colormap=True
),
"head": tfds.features.BBoxFeature()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a more explicit feature name, e.g. "head_bbox"?

return tfds.features.BBox(
ymin / height, xmin / width, ymax / height, xmax / width
)


class Builder(tfds.core.GeneratorBasedBuilder):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please bump VERSION and add RELEASE_NOTES.

head_bbox = _get_head_bbox(os.path.join(xmls_dir_path, xml_name))
except tf.errors.NotFoundError:
# test samples do not have an annotation file
head_bbox = tfds.features.BBox(0., 0., 0., 0.)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to a constant _EMPTY_BBOX.

- Fix corrupt samples by re-encoding instead of skipping them
- Rename head bbox feature
- Update checksums
- Bump version
- Minor refactoring
@enricoschroeder
Copy link
Author

@fineguy I have addressed your comments. The corrupt samples are now being fixed instead of skipped. For that, we still need to import Tensorflow (even though data reading is now done via etils).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants