Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan loss encountered #506

Open
abrahamhwj opened this issue Mar 19, 2024 · 2 comments
Open

nan loss encountered #506

abrahamhwj opened this issue Mar 19, 2024 · 2 comments
Labels
type/bug An issue about a bug

Comments

@abrahamhwj
Copy link

馃悰 Describe the bug

I am a beginner. I tried to run a training task on a VM equipped with A100 for testing, but always got a "nan loss encountered" error after several attempts. The following is my training configuration file
OLMo-1B.yaml.txt

Versions

Python 3.10.13
-e git+https://github.com/allenai/OLMo.git@666da70fbd89b50141a3e521ecc0d4e27b351004#egg=ai2_olmo
aiohttp==3.9.3
aiosignal==1.3.1
annotated-types==0.6.0
antlr4-python3-runtime==4.9.3
appdirs==1.4.4
async-timeout==4.0.3
attrs==23.2.0
beaker-gantry==0.22.2
beaker-py==1.26.2
black==23.12.1
boltons==23.1.1
boto3==1.34.64
botocore==1.34.64
build==1.1.1
cached_path==1.6.2
cachetools==5.3.3
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
click-help-colors==0.9.4
cryptography==42.0.5
datasets==2.18.0
dill==0.3.8
docker==6.1.3
docker-pycreds==0.4.0
docutils==0.20.1
exceptiongroup==1.2.0
face==20.1.1
filelock==3.13.1
frozenlist==1.4.1
fsspec==2024.2.0
ftfy==6.2.0
gitdb==4.0.11
GitPython==3.1.42
glom==23.5.0
google-api-core==2.17.1
google-auth==2.28.2
google-cloud-core==2.4.1
google-cloud-storage==2.15.0
google-crc32c==1.5.0
google-resumable-media==2.7.0
googleapis-common-protos==1.63.0
huggingface-hub==0.21.4
idna==3.6
importlib_metadata==7.0.2
iniconfig==2.0.0
isort==5.12.0
jaraco.classes==3.3.1
jeepney==0.8.0
Jinja2==3.1.3
jmespath==1.0.1
joblib==1.3.2
keyring==24.3.1
lightning-utilities==0.10.1
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
more-itertools==10.2.0
mpmath==1.3.0
msgspec==0.18.6
multidict==6.0.5
multiprocess==0.70.16
mypy==1.3.0
mypy-extensions==1.0.0
necessary==0.4.3
networkx==3.2.1
nh3==0.2.15
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
omegaconf==2.3.0
packaging==24.0
pandas==2.2.1
pathspec==0.12.1
petname==2.6
pkginfo==1.10.0
platformdirs==4.2.0
pluggy==1.4.0
protobuf==4.25.3
psutil==5.9.8
pyarrow==15.0.1
pyarrow-hotfix==0.6
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==2.6.4
pydantic_core==2.16.3
Pygments==2.17.2
pyproject_hooks==1.0.0
pytest==8.1.1
pytest-sphinx==0.6.0
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
readme_renderer==43.0
regex==2023.12.25
requests==2.31.0
requests-toolbelt==1.0.0
requirements-parser==0.5.0
rfc3986==2.0.0
rich==13.7.1
rsa==4.9
ruff==0.3.3
s3transfer==0.10.1
safetensors==0.4.2
scikit-learn==1.4.1.post1
scipy==1.12.0
SecretStorage==3.3.3
sentry-sdk==1.42.0
setproctitle==1.3.3
six==1.16.0
smart-open==7.0.1
smashed==0.21.5
smmap==5.0.1
sympy==1.12
threadpoolctl==3.3.0
tokenizers==0.15.2
tomli==2.0.1
torch==2.2.1
torchmetrics==1.3.1
tqdm==4.66.2
transformers==4.38.2
triton==2.2.0
trouting==0.3.3
twine==5.0.0
types-setuptools==69.2.0.20240316
typing_extensions==4.10.0
tzdata==2024.1
urllib3==2.2.1
wandb==0.16.4
wcwidth==0.2.13
websocket-client==1.7.0
wrapt==1.16.0
xxhash==3.4.1
yarl==1.9.4
zipp==3.18.1

@abrahamhwj abrahamhwj added the type/bug An issue about a bug label Mar 19, 2024
@natolambert
Copy link
Contributor

Can you say more about your dataset and tooling used? Is it with the OLMo repo? What model are you training from?

@abrahamhwj
Copy link
Author

Can you say more about your dataset and tooling used? Is it with the OLMo repo? What model are you training from?
Thank you for your reply.
I followed the training chapters of the README document, the model was olmo-1b, and the data was downloaded locally according to the introduction, because it was a test, so I only downloaded a small part of the data锛沘nd this information can be confirmed in the OLMo-1B.yaml file I attached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug An issue about a bug
Projects
None yet
Development

No branches or pull requests

2 participants