Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging various fixes for Colab, Cloud TPU, TPU Pod, ... #247

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

vochicong
Copy link

@vochicong vochicong commented Nov 20, 2019

Including

bzantium and others added 11 commits July 4, 2019 17:11
When restart training, since prev_step is -1, curr_loss for the first print would be wrongly calculated.
I ran `train.py` on a TPU pod v3-256 and got the following error:

    ValueError: TPUConfig.num_shards is not set correctly ....

Found in https://cloud.google.com/tpu/docs/training-on-tpu-pods#providing_the_tpu_name_and_region_to_tpuclusterresolver that
> For single device training, you can specify either the TPU name or an IP address, for example: `grpc://1.2.3.4:8470`.
> For TPU Pods you must use the TPU name so that TensorFlow can discover the IP addresses of all the hosts available for training distribution.

So, in the case of a TPU pod, setting `master` doesn't work. I just tried setting `cluster` and it worked, all 32 hosts in the TPU pod were detected and used correctly.
@vochicong vochicong changed the title Combine #121 and #239 Merging various fixes #121, #239, ... Nov 21, 2019
@vochicong vochicong changed the title Merging various fixes #121, #239, ... Merging various fixes for Colab, Cloud TPU, TPU Pod, ... Nov 21, 2019
@guochao006
Copy link

aaa

@LifeIsStrange
Copy link

You should contribute to the XLnet implementation of the transformers library
https://huggingface.co/transformers/model_doc/xlnet.html
It is the defacto standard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants