Evaluation issue using TPUEstimator #156

nicholasbreckwoldt · 2020-06-03T19:47:59Z

Running into an issue when using Adanet TPUEstimator. Say, for example, the estimator is configured with max_iteration_steps=500 and it is desired to evaluate the model's performance during training after every 100 training steps (i.e. steps_per_evaluation=100) for 2 complete Adanet iterations.

To achieve this, estimator.train(max_steps, train_input) followed by estimator.evaluate(eval_input) are run in a loop, while incrementing max_steps by steps_per_evaluation number of steps at the end of each loop, until max_steps=1000 is reached (i.e. corresponding to 2 complete Adanet iterations)

When running in local mode (i.e. use_tpu=False), training proceeds as expected. That is, training proceeds for 2 complete Adanet iterations (i.e. steps 0 to 500 for the first iteration and steps 500 to 1000 for the second iteration, with evaluation every 100 steps). However, when running on CloudTPU (i.e. use_tpu=True), training reaches max_steps=1000 without ever progressing to a second iteration.

On the other hand, a single call of estimator.train(max_steps=1000, train_input) using CloudTPU without the estimator.evaluate results in 2 complete Adanet iterations as expected. This makes me think the issue lies with the evaluation call? What could the issue be? If this is a TPUEstimator related issue, am I then constrained to the standard Estimator if I want this kind of train-evaluation loop configuration?

The text was updated successfully, but these errors were encountered:

cweill · 2020-07-09T21:05:35Z

@nicholasbreckwoldt: We just released adanet=0.9.0 which includes better TPU, and TF 2 support. Please try installing it, and let us know if it resolves your issue.

nicholasbreckwoldt · 2020-07-14T17:56:05Z

@cweill Thanks for the update! I am running into a new issue with the upgrade to TF 2.2 and adanet==0.9.0 which has so far prevented me from establishing whether the above evaluation issue has been resolved. I've added a description of this new issue (#157).

cweill self-assigned this Jul 9, 2020

cweill added the bug Something isn't working label Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation issue using TPUEstimator #156

Evaluation issue using TPUEstimator #156

nicholasbreckwoldt commented Jun 3, 2020

cweill commented Jul 9, 2020

nicholasbreckwoldt commented Jul 14, 2020

Evaluation issue using TPUEstimator #156

Evaluation issue using TPUEstimator #156

Comments

nicholasbreckwoldt commented Jun 3, 2020

cweill commented Jul 9, 2020

nicholasbreckwoldt commented Jul 14, 2020