Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eval.py hangs when config yaml's model hparams don't match model checkpoint hparams #755

Open
growlix opened this issue Nov 21, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@growlix
Copy link
Contributor

growlix commented Nov 21, 2023

Environment

0: Collecting system information...
0: ---------------------------------
0: System Environment Report
0: Created: 2023-11-21 21:17:06 UTC
0: ---------------------------------
0:
0: PyTorch information
0: -------------------
0: PyTorch version: 2.1.0+cu121
0: Is debug build: False
0: CUDA used to build PyTorch: 12.1
0: ROCM used to build PyTorch: N/A
0:
0: OS: Ubuntu 20.04.6 LTS (x86_64)
0: GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
0: Clang version: Could not collect
0: CMake version: version 3.16.3
0: Libc version: glibc-2.31
0:
0: Python version: 3.10.13 (main, Aug 25 2023, 13:20:03) [GCC 9.4.0] (64-bit runtime)
0: Python platform: Linux-5.15.0-1047-aws-x86_64-with-glibc2.31
0: Is CUDA available: True
0: CUDA runtime version: 11.8.89
0: CUDA_MODULE_LOADING set to: LAZY
0: GPU models and configuration:
0: GPU 0: NVIDIA A100-SXM4-40GB
0: GPU 1: NVIDIA A100-SXM4-40GB
0: GPU 2: NVIDIA A100-SXM4-40GB
0: GPU 3: NVIDIA A100-SXM4-40GB
0: GPU 4: NVIDIA A100-SXM4-40GB
0: GPU 5: NVIDIA A100-SXM4-40GB
0: GPU 6: NVIDIA A100-SXM4-40GB
0: GPU 7: NVIDIA A100-SXM4-40GB
0:
0: Nvidia driver version: 535.104.12
0: cuDNN version: Probably one of the following:
0: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
0: HIP runtime version: N/A
0: MIOpen runtime version: N/A
0: Is XNNPACK available: True
0:
0: CPU:
0: Architecture: x86_64
0: CPU op-mode(s): 32-bit, 64-bit
0: Byte Order: Little Endian
0: Address sizes: 46 bits physical, 48 bits virtual
0: CPU(s): 48
0: On-line CPU(s) list: 0-47
0: Thread(s) per core: 1
0: Core(s) per socket: 24
0: Socket(s): 2
0: NUMA node(s): 2
0: Vendor ID: GenuineIntel
0: CPU family: 6
0: Model: 85
0: Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
0: Stepping: 7
0: CPU MHz: 2999.998
0: BogoMIPS: 5999.99
0: Hypervisor vendor: KVM
0: Virtualization type: full
0: L1d cache: 1.5 MiB
0: L1i cache: 1.5 MiB
0: L2 cache: 48 MiB
0: L3 cache: 71.5 MiB
0: NUMA node0 CPU(s): 0-23
0: NUMA node1 CPU(s): 24-47
0: Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
0: Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported
0: Vulnerability L1tf: Mitigation; PTE Inversion
0: Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
0: Vulnerability Meltdown: Mitigation; PTI
0: Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
0: Vulnerability Retbleed: Vulnerable
0: Vulnerability Spec rstack overflow: Not affected
0: Vulnerability Spec store bypass: Vulnerable
0: Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
0: Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
0: Vulnerability Srbds: Not affected
0: Vulnerability Tsx async abort: Not affected
0: Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
0:
0: Versions of relevant libraries:
0: [pip3] numpy==1.26.0
0: [pip3] pytorch-ranger==0.1.1
0: [pip3] torch==2.1.0
0: [pip3] torch-optimizer==0.3.0
0: [pip3] torchmetrics==1.0.3
0: [pip3] torchvision==0.15.2+cu118
0: [pip3] triton==2.1.0
0: [pip3] triton-pre-mlir==2.0.0
0: [conda] Could not collect
0:
0:
0: Composer information
0: --------------------
0: Composer version: 0.16.4
0: Composer commit hash: None
0: Host processor model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
0: Host processor core count: 48
0: Number of nodes: 1
0: Accelerator model name: NVIDIA A100-SXM4-40GB
0: Accelerators per node: 1
0: CUDA Device Count: 8

To reproduce

Steps to reproduce the behavior:

Run composer --world_size ${WORLD_SIZE} --nproc ${NPROC} --node_rank ${NODE_RANK} --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --verbose /llm-foundry/scripts/eval/eval.py {path/to/eval_yaml}

using a yaml that has the model hparams from scripts/eval/yamls/mpt_eval.yaml, but pointing to a checkpoint generated by scripts/train/yamls/pretrain/mpt-3b.yaml.

Then wait for the nccl timeout.

Expected behavior

It should probably throw an error about a mismatch between the model arch and the state dict.

@growlix growlix added the bug Something isn't working label Nov 21, 2023
@growlix growlix changed the title eval.py hangs when config yaml hparams don't match model checkpoint hparams eval.py hangs when config yaml's model hparams don't match model checkpoint hparams Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant