You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am a beginner. I tried to run a training task on a VM equipped with A100 for testing, but always got a "nan loss encountered" error after several attempts. The following is my training configuration file OLMo-1B.yaml.txt
Can you say more about your dataset and tooling used? Is it with the OLMo repo? What model are you training from?
Thank you for your reply.
I followed the training chapters of the README document, the model was olmo-1b, and the data was downloaded locally according to the introduction, because it was a test, so I only downloaded a small part of the data锛沘nd this information can be confirmed in the OLMo-1B.yaml file I attached.
馃悰 Describe the bug
I am a beginner. I tried to run a training task on a VM equipped with A100 for testing, but always got a "nan loss encountered" error after several attempts. The following is my training configuration file
OLMo-1B.yaml.txt
Versions
Python 3.10.13
-e git+https://github.com/allenai/OLMo.git@666da70fbd89b50141a3e521ecc0d4e27b351004#egg=ai2_olmo
aiohttp==3.9.3
aiosignal==1.3.1
annotated-types==0.6.0
antlr4-python3-runtime==4.9.3
appdirs==1.4.4
async-timeout==4.0.3
attrs==23.2.0
beaker-gantry==0.22.2
beaker-py==1.26.2
black==23.12.1
boltons==23.1.1
boto3==1.34.64
botocore==1.34.64
build==1.1.1
cached_path==1.6.2
cachetools==5.3.3
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
click-help-colors==0.9.4
cryptography==42.0.5
datasets==2.18.0
dill==0.3.8
docker==6.1.3
docker-pycreds==0.4.0
docutils==0.20.1
exceptiongroup==1.2.0
face==20.1.1
filelock==3.13.1
frozenlist==1.4.1
fsspec==2024.2.0
ftfy==6.2.0
gitdb==4.0.11
GitPython==3.1.42
glom==23.5.0
google-api-core==2.17.1
google-auth==2.28.2
google-cloud-core==2.4.1
google-cloud-storage==2.15.0
google-crc32c==1.5.0
google-resumable-media==2.7.0
googleapis-common-protos==1.63.0
huggingface-hub==0.21.4
idna==3.6
importlib_metadata==7.0.2
iniconfig==2.0.0
isort==5.12.0
jaraco.classes==3.3.1
jeepney==0.8.0
Jinja2==3.1.3
jmespath==1.0.1
joblib==1.3.2
keyring==24.3.1
lightning-utilities==0.10.1
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
more-itertools==10.2.0
mpmath==1.3.0
msgspec==0.18.6
multidict==6.0.5
multiprocess==0.70.16
mypy==1.3.0
mypy-extensions==1.0.0
necessary==0.4.3
networkx==3.2.1
nh3==0.2.15
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
omegaconf==2.3.0
packaging==24.0
pandas==2.2.1
pathspec==0.12.1
petname==2.6
pkginfo==1.10.0
platformdirs==4.2.0
pluggy==1.4.0
protobuf==4.25.3
psutil==5.9.8
pyarrow==15.0.1
pyarrow-hotfix==0.6
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==2.6.4
pydantic_core==2.16.3
Pygments==2.17.2
pyproject_hooks==1.0.0
pytest==8.1.1
pytest-sphinx==0.6.0
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
readme_renderer==43.0
regex==2023.12.25
requests==2.31.0
requests-toolbelt==1.0.0
requirements-parser==0.5.0
rfc3986==2.0.0
rich==13.7.1
rsa==4.9
ruff==0.3.3
s3transfer==0.10.1
safetensors==0.4.2
scikit-learn==1.4.1.post1
scipy==1.12.0
SecretStorage==3.3.3
sentry-sdk==1.42.0
setproctitle==1.3.3
six==1.16.0
smart-open==7.0.1
smashed==0.21.5
smmap==5.0.1
sympy==1.12
threadpoolctl==3.3.0
tokenizers==0.15.2
tomli==2.0.1
torch==2.2.1
torchmetrics==1.3.1
tqdm==4.66.2
transformers==4.38.2
triton==2.2.0
trouting==0.3.3
twine==5.0.0
types-setuptools==69.2.0.20240316
typing_extensions==4.10.0
tzdata==2024.1
urllib3==2.2.1
wandb==0.16.4
wcwidth==0.2.13
websocket-client==1.7.0
wrapt==1.16.0
xxhash==3.4.1
yarl==1.9.4
zipp==3.18.1
The text was updated successfully, but these errors were encountered: