Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terminate called after throwing an instance of 'std::runtime_error' #761

Open
HalFTeen opened this issue Sep 19, 2023 · 0 comments
Open

Comments

@HalFTeen
Copy link

my env:
GPU: 2080ti 10G*8
Driver Version: 455.23.05
I get a crash after running: ./bin/multi_gpu_gpt_example according to gpt_guide.md.
my action:

  1. cmake -DSM=75 -DCMAKE_BUILD_TYPE=Release -DBUILD_MULTI_GPU=ON ..
  2. make -j12
  3. pip install -r ../examples/pytorch/gpt/requirement.txt
  4. wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P ../models
  5. wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P ../models
  6. git clone https://huggingface.co/gpt2-xl
  7. python ../examples/pytorch/gpt/utils/huggingface_gpt_convert.py -i gpt2-xl/ -o ../models/huggingface-models/c-model/gpt2-xl -i_g 1
  8. ./bin/gpt_gemm 8 1 32 25 64 6400 50257 0 1 0
  9. ./bin/multi_gpu_gpt_example
    then I get the crash:
Total ranks: 1.
P0 is running with 0 GPU.
Device GeForce RTX 2080 Ti
[FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
[FT][WARNING] file ../models/huggingface-models/c-model/gpt2-xl/1-gpu//model.prompt_table.intent_and_slot.weight.bin cannot be opened, loading model fails! 

[FT][WARNING] file ../models/huggingface-models/c-model/gpt2-xl/1-gpu//model.prompt_table.sentiment.weight.bin cannot be opened, loading model fails! 

[FT][WARNING] file ../models/huggingface-models/c-model/gpt2-xl/1-gpu//model.prompt_table.squad.weight.bin cannot be opened, loading model fails! 

after allocation    : free:  9.63 GB, total: 10.76 GB, used:  1.13 GB
terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: invalid argument /data/wangjie/code/github/FasterTransformer/src/fastertransformer/utils/memory_utils.cu:113 

[server40:134837] *** Process received signal ***
[server40:134837] Signal: Aborted (6)
[server40:134837] Signal code:  (-6)
[server40:134837] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x7ff3797e66d0]
[server40:134837] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7ff378d24277]
[server40:134837] [ 2] /lib64/libc.so.6(abort+0x148)[0x7ff378d25968]
[server40:134837] [ 3] /lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xbc)[0x7ff39d9253df]
[server40:134837] [ 4] /lib64/libstdc++.so.6(+0x9cb16)[0x7ff39d923b16]
[server40:134837] [ 5] /lib64/libstdc++.so.6(+0x9cb4c)[0x7ff39d923b4c]
[server40:134837] [ 6] /lib64/libstdc++.so.6(__cxa_rethrow+0x0)[0x7ff39d923d28]
[server40:134837] [ 7] ./bin/multi_gpu_gpt_example[0x9041da]
[server40:134837] [ 8] ./bin/multi_gpu_gpt_example[0x478a04]
[server40:134837] [ 9] ./bin/multi_gpu_gpt_example[0x4314f1]
[server40:134837] [10] ./bin/multi_gpu_gpt_example[0x407c1f]
[server40:134837] [11] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff378d10445]
[server40:134837] [12] ./bin/multi_gpu_gpt_example[0x42b157]
[server40:134837] *** End of error message ***
已放弃(吐核)

Any suggestion is welcome. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant