New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve oxford_iiit_pet dataset #5349
base: master
Are you sure you want to change the base?
Improve oxford_iiit_pet dataset #5349
Conversation
- Include head bounding boxes for the training split. - Exclude samples with invalid or corrupt image files (preventing 'Corrupt JPEG data:...' error messages while using the dataset).
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution!
Overall it looks great, just a few minor comments.
@@ -63,6 +64,44 @@ | |||
] | |||
_SPECIES_CLASSES = ["Cat", "Dog"] | |||
|
|||
# List of samples with corrupt image files | |||
_SKIP_SAMPLES = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what the problem with these images is? Could it be that they're encoded in a different format? I suggest we check them in code and try our best to recover, similar to:
https://github.com/SirLefti/datasets/blob/9ae21497aee7177d931e359ffe939e18c7fc1531/tensorflow_datasets/image_classification/cats_vs_dogs.py#L102-L107
@@ -16,11 +16,12 @@ | |||
"""Oxford-IIIT pet dataset.""" | |||
|
|||
import os | |||
import xml.etree.ElementTree | |||
|
|||
from tensorflow_datasets.core.utils.lazy_imports_utils import tensorflow as tf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think tensorflow is only used to open files, could you please replace this with from etils import epath
?
|
||
# Disable pytype to avoid attribute-error due to find returning | ||
# Optional[Element] | ||
# pytype: disable=attribute-error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This disables pytype for the rest of the file, which is undesirable. How about something like size: xml.etree.ElementTree.Element = root.find("size") # pytype: disable=annotation-type-mismatch
?
@@ -79,6 +118,7 @@ def _info(self): | |||
"segmentation_mask": tfds.features.Image( | |||
shape=(None, None, 1), use_colormap=True | |||
), | |||
"head": tfds.features.BBoxFeature() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a more explicit feature name, e.g. "head_bbox"?
return tfds.features.BBox( | ||
ymin / height, xmin / width, ymax / height, xmax / width | ||
) | ||
|
||
|
||
class Builder(tfds.core.GeneratorBasedBuilder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please bump VERSION and add RELEASE_NOTES.
head_bbox = _get_head_bbox(os.path.join(xmls_dir_path, xml_name)) | ||
except tf.errors.NotFoundError: | ||
# test samples do not have an annotation file | ||
head_bbox = tfds.features.BBox(0., 0., 0., 0.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to a constant _EMPTY_BBOX
.
- Fix corrupt samples by re-encoding instead of skipping them - Rename head bbox feature - Update checksums - Bump version - Minor refactoring
@fineguy I have addressed your comments. The corrupt samples are now being fixed instead of skipped. For that, we still need to import Tensorflow (even though data reading is now done via etils). |
Improve oxford_iiit_pet dataset:
Thank you for your contribution!
Please read https://www.tensorflow.org/datasets/contribute#pr_checklist to make sure your PR follows the guidelines.
The checklist does not apply since this is only a minor fix to an existing dataset.