Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core dump "Could not decode datum" during training #763

Open
3 of 7 tasks
YaYaB opened this issue Jul 24, 2020 · 2 comments
Open
3 of 7 tasks

Core dump "Could not decode datum" during training #763

YaYaB opened this issue Jul 24, 2020 · 2 comments

Comments

@YaYaB
Copy link
Contributor

YaYaB commented Jul 24, 2020

If Ok, please give as many details as possible to help us solve the problem more efficiently.

Configuration

  • Version of DeepDetect:
    • Locally compiled on:
      • Ubuntu 14.04 LTS
      • Mac OSX
      • Other:
    • Docker
    • Amazon AMI
  • Commit (shown by the server when starting):
    ecdfad8

Your question / the problem you're facing:

I've launched a training for an image model. Everything went well during the lmdb creation (no errors seen). However at some point during the training I got a core dump.
Note that it was during the second epoch of my training so all the data has been seen and the test set has been predicted one time.

Error message (if any) / steps to reproduce the problem:

Here are the logs I obtained when it core dumped/

  • Server log output:
libpng warning: Ignoring bad adaptive filter type
libpng warning: Ignoring bad adaptive filter type
libpng warning: Ignoring bad adaptive filter type
libpng warning: Ignoring bad adaptive filter type
libpng warning: Ignoring bad adaptive filter type
libpng error: IDAT: CRC error
[2020-07-24 10:06:14.222] [caffe] [error] Could not decode datum 
terminate called after throwing an instance of 'CaffeErrorException'
  what():  src/caffe/data_transformer.cpp:895 / Check failed (custom): cv_cropped_image.data
[1]    5337 abort (core dumped)  ./dede --port 8081

I've searched a bit, it might be due to a corrupted image but I don't understand how it worked correctly in the first epoch if it is the case.

@beniz
Copy link
Collaborator

beniz commented Jul 27, 2020

Hi, libpng says it, there's an issue with an image somewhere. Best way is to write a script that decodes all images to decode all images.

To debug if it's an object detector being trained, you can also try setting this check_size variable to true: https://github.com/jolibrain/deepdetect/blob/master/src/backends/caffe/caffeinputconns.cc#L871

If the two tests above do not show anything wrong, you can try deactivating all the pragma in this layer, starting here: https://github.com/jolibrain/caffe/blob/master/src/caffe/layers/annotated_data_layer.cpp#L164

But my hunch is you have a bad png somewhere. I don't know about epochs or so, data augmentation is randomized and datum are prefetched with three threads.

@YaYaB
Copy link
Contributor Author

YaYaB commented Jul 28, 2020

Yeah I may have some weird pngs, I tried decode all those but it seemed okay.. I'll try again to see

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants