Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting error when running run_training.py with custom dataset #50

Open
AlexCHEU opened this issue Feb 3, 2022 · 5 comments
Open

Getting error when running run_training.py with custom dataset #50

AlexCHEU opened this issue Feb 3, 2022 · 5 comments

Comments

@AlexCHEU
Copy link

AlexCHEU commented Feb 3, 2022

Hello, thanks for your wonderful work!

I have a question about running run_training.py with a custom dataset.

  1. The dataset has been pre-processed with create_from_images.py, and finally got one TFRecord file. I was wondering if one single file is OK? (2.07G contains 2k images)
  2. The following error was giving when running python run_training.py --data-dir=<>d --result-dir=<> --dataset="train" --num-gpus=1 --total-kimg=10000 --mirror-augment=True
Local submit - run_dir: /content/drive/MyDrive/co-mod-gan/results/00006-co-mod-gan-train_all-1gpu
dnnlib: Running training.training_loop.training_loop() on localhost...
Streaming data using training.dataset.TFRecordDataset...
tcmalloc: large alloc 4294967296 bytes == 0x562d81b88000 @  0x7f9c7abf2001 0x7f9c776d654f 0x7f9c77726b58 0x7f9c7772ab17 0x7f9c777c9203 0x562d79b9c424 0x562d79b9c120 0x562d79c10b80 0x562d79c0b66e 0x562d79b9e36c 0x562d79bdf7b9 0x562d79bdc6d4 0x562d79b9e571 0x562d79c0d633 0x562d79c0b02f 0x562d79adce2b 0x562d79c0d633 0x562d79c0b66e 0x562d79adce2b 0x562d79c0d633 0x562d79b9d9da 0x562d79c0beae 0x562d79b9d9da 0x562d79c0c108 0x562d79c0b02f 0x562d79adce2b 0x562d79c0d633 0x562d79c0b02f 0x562d79adce2b 0x562d79c0d633 0x562d79b9d9da
tcmalloc: large alloc 4294967296 bytes == 0x562e81b88000 @  0x7f9c7abf01e7 0x7f9c776d646e 0x7f9c77726c7b 0x7f9c7772735f 0x7f9c777c9103 0x562d79b9c424 0x562d79b9c120 0x562d79c10b80 0x562d79c0b02f 0x562d79b9daba 0x562d79c0ccd4 0x562d79c0b02f 0x562d79b9daba 0x562d79c0ccd4 0x562d79c0b02f 0x562d79b9daba 0x562d79c0ccd4 0x562d79b9d9da 0x562d79c0beae 0x562d79c0b02f 0x562d79b9daba 0x562d79c102c0 0x562d79c0b02f 0x562d79b9daba 0x562d79c0ccd4 0x562d79c0b66e 0x562d79b9e36c 0x562d79bdf7b9 0x562d79bdc6d4 0x562d79b9e571 0x562d79c0d633
tcmalloc: large alloc 4294967296 bytes == 0x562f834ea000 @  0x7f9c7abf01e7 0x7f9c776d646e 0x7f9c77726c7b 0x7f9c7772735f 0x7f9c22441235 0x7f9c21dc4792 0x7f9c21dc4d42 0x7f9c21d7daee 0x562d79b9c317 0x562d79b9c120 0x562d79c10679 0x562d79b9d9da 0x562d79c0c108 0x562d79c0b1c0 0x562d79adceb0 0x562d79c0d633 0x562d79c0b02f 0x562d79b9daba 0x562d79c0c108 0x562d79c0b66e 0x562d79b9daba 0x562d79c0c108 0x562d79b9d9da 0x562d79c0c108 0x562d79c0b02f 0x562d79b9e151 0x562d79b9e571 0x562d79c0d633 0x562d79c0b02f 0x562d79b9daba 0x562d79c0beae
Dataset shape = [3, 512, 512]
Dynamic range = [0, 255]
Label size    = 0
Traceback (most recent call last):
  File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)


 tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
	 [[{{node Dataset_1/IteratorGetNext}}]]



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_training.py", line 133, in <module>
    main()
  File "run_training.py", line 128, in main
    run(**vars(args))
  File "run_training.py", line 71, in run
    dnnlib.submit_run(**kwargs)
  File "/content/drive/MyDrive/co-mod-gan/dnnlib/submission/submit.py", line 343, in submit_run
    return farm.submit(submit_config, host_run_dir)
  File "/content/drive/MyDrive/co-mod-gan/dnnlib/submission/internal/local.py", line 22, in submit
    return run_wrapper(submit_config)
  File "/content/drive/MyDrive/co-mod-gan/dnnlib/submission/submit.py", line 280, in run_wrapper
    run_func_obj(**submit_config.run_func_kwargs)
  File "/content/drive/MyDrive/co-mod-gan/training/training_loop.py", line 142, in training_loop
    grid_size, grid_reals, grid_labels, grid_masks = misc.setup_snapshot_image_grid(training_set, **grid_args)
  File "/content/drive/MyDrive/co-mod-gan/training/misc.py", line 123, in setup_snapshot_image_grid
    reals[:], labels[:] = training_set.get_minibatch_val_np(gw * gh)
  File "/content/drive/MyDrive/co-mod-gan/training/dataset.py", line 189, in get_minibatch_val_np
    return tflib.run(self._tf_minibatch_val_np)
  File "/content/drive/MyDrive/co-mod-gan/dnnlib/tflib/tfutil.py", line 31, in run
    return tf.get_default_session().run(*args, **kwargs)
  File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)


tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
	 [[node Dataset_1/IteratorGetNext (defined at /tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/ops.py:1748) ]]


Original stack trace for 'Dataset_1/IteratorGetNext':
  File "run_training.py", line 133, in <module>
    main()
  File "run_training.py", line 128, in main
    run(**vars(args))
  File "run_training.py", line 71, in run
    dnnlib.submit_run(**kwargs)
  File "/content/drive/MyDrive/co-mod-gan/dnnlib/submission/submit.py", line 343, in submit_run
    return farm.submit(submit_config, host_run_dir)
  File "/content/drive/MyDrive/co-mod-gan/dnnlib/submission/internal/local.py", line 22, in submit
    return run_wrapper(submit_config)
  File "/content/drive/MyDrive/co-mod-gan/dnnlib/submission/submit.py", line 280, in run_wrapper
    run_func_obj(**submit_config.run_func_kwargs)
  File "/content/drive/MyDrive/co-mod-gan/training/training_loop.py", line 142, in training_loop
    grid_size, grid_reals, grid_labels, grid_masks = misc.setup_snapshot_image_grid(training_set, **grid_args)
  File "/content/drive/MyDrive/co-mod-gan/training/misc.py", line 123, in setup_snapshot_image_grid
    reals[:], labels[:] = training_set.get_minibatch_val_np(gw * gh)
  File "/content/drive/MyDrive/co-mod-gan/training/dataset.py", line 188, in get_minibatch_val_np
    self._tf_minibatch_val_np = self.get_minibatch_val_tf()
  File "/content/drive/MyDrive/co-mod-gan/training/dataset.py", line 174, in get_minibatch_val_tf
    return self._tf_val_iterator.get_next()
  File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/data/ops/iterator_ops.py", line 426, in get_next
    name=name)
  File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/ops/gen_dataset_ops.py", line 2518, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

Is it related to the TensorFlow version? I am trying to implement the training session with Google Colab, it only provides Tensorflow 1.15.2 now...

Could you please help me finger what I did wrong?

Thanks for any of your help and happy CNY :)

@zsyzzsoft
Copy link
Owner

  1. fine
  2. Looks like memory OOM issue. You can try reducing shuffle_mb and prefetch_mb in dataset.py.

@AlexCHEU
Copy link
Author

AlexCHEU commented Feb 3, 2022

  1. fine
  2. Looks like memory OOM issue. You can try reducing shuffle_mb and prefetch_mb in dataset.py.

Thank you so much for your prompt reply! 🙌
I have fined the TFRecord.file, set shuffle_mb=0prefetch_mb=0 , sched.minibatch_size_base =4 , sched.minibatch_gpu_base = 2 but it still giving errors:

tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
  (0) Out of range: End of sequence
	 [[{{node GPU0/DataFetch/IteratorGetNext}}]]
  (1) Out of range: End of sequence
	 [[{{node GPU0/DataFetch/IteratorGetNext}}]]
	 [[GPU0/DataFetch/IteratorGetNext/_2837]]
0 successful operations.
0 derived errors ignored.
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
  (0) Out of range: End of sequence
	 [[node GPU0/DataFetch/IteratorGetNext (defined at /tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Out of range: End of sequence
	 [[node GPU0/DataFetch/IteratorGetNext (defined at /tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[GPU0/DataFetch/IteratorGetNext/_2837]]
0 successful operations.
0 derived errors ignored.

so confused...there are just 2.5k images:open_mouth:

@zzz105120
Copy link

Hello, may I ask if you have solved this problem?

@mingqizhang
Copy link

I have the same problem, have you solve it? @AlexCHEU

@hongruihuang
Copy link

Hello, may I ask if you have solved this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants