Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bloomz 176B inference doesn't work #15

Open
agemagician opened this issue Mar 18, 2023 · 9 comments
Open

Bloomz 176B inference doesn't work #15

agemagician opened this issue Mar 18, 2023 · 9 comments

Comments

@agemagician
Copy link

Hello,

I have converted bloomz model successfully, but the inference doesn't work.

 ./main -m ./models/ggml-model-bloomz-f16.bin -t 8 -n 128
main: seed = 1679167152
bloom_model_load: loading model from './models/ggml-model-bloomz-f16.bin' - please wait ...
bloom_model_load: n_vocab = 250880
bloom_model_load: n_ctx   = 512
bloom_model_load: n_embd  = 14336
bloom_model_load: n_mult  = 1
bloom_model_load: n_head  = 112
bloom_model_load: n_layer = 70
bloom_model_load: f16     = 1
bloom_model_load: n_ff    = 57344
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 333257.61 MB
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 349847586752, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 349847931776, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 351081229760, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 351081459328, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 350670590144, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 349848678784, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 351081976768, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 351082206336, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 351493305664, available 349445931264)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 351493305664, available 349445931264)
Segmentation fault (core dumped)

I have enough cpu memory "420GB". Any idea what is the issue ?

@laurentperez
Copy link

out of curiosity and adding a question over your question, how much of your 420GB of RAM did you use to convert to ggml ? I barely managed to convert bloomz-7b1 using 32GB of RAM so I wonder how much 176b needs.

@agemagician
Copy link
Author

out of curiosity and adding a question over your question, how much of your 420GB of RAM did you use to convert to ggml ? I barely managed to convert bloomz-7b1 using 32GB of RAM so I wonder how much 176b needs.

all of it + approx 30GB of virtual memory.

@bil-ash
Copy link

bil-ash commented Mar 20, 2023

It seems you are running out of memory. Most probably,I can help to reduce the memory usage to 1/6th(was successful with 7b1 model). What is the model size(disk usage) of the 176B model?
Please share a link to download the quantized model because my server does not have the RAM(>400GB) to quantize the 176B model . I will then see if I am able to run the model

@agemagician
Copy link
Author

The disk size for the model is approx 360GB.
Unfortunately, quantization doesn't work; please see:
huggingface/optimum#901

I don't think it is a problem with out-of-memory as there is 420GB of main memory + 50 swap memory.

@ZhangYunchenY
Copy link

same question, while I have 1000GB RAM

@barsuna
Copy link

barsuna commented Apr 2, 2023

./main -m models/bloom/ggml-model-bloom-f16-q4_0.bin -t 96 -p "The most beautiful question is" -n 20
main: seed = 1680447842
bloom_model_load: loading model from 'models/bloom/ggml-model-bloom-f16-q4_0.bin' - please wait ...
bloom_model_load: n_vocab = 250880
bloom_model_load: n_ctx = 512
bloom_model_load: n_embd = 14336
bloom_model_load: n_mult = 1
bloom_model_load: n_head = 112
bloom_model_load: n_layer = 70
bloom_model_load: f16 = 2
bloom_model_load: n_ff = 57344
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 106877.59 MB
bloom_model_load: memory_size = 3920.00 MB, n_mem = 35840
bloom_model_load: loading model part 1/1 from 'models/bloom/ggml-model-bloom-f16-q4_0.bin'
bloom_model_load: ......................................................................................................... done
bloom_model_load: model size = 107237.48 MB / num tensors = 846

main: prompt: 'The most beautiful question is'
main: number of tokens in prompt = 5
2175 -> 'The'
6084 -> ' most'
40704 -> ' beautiful'
5893 -> ' question'
632 -> ' is'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

The most beautiful question is the one you ask yourself.
What are we doing here?
I don't understand this at all!
L

main: mem per token = 192093564 bytes
main: load time = 65292.77 ms
main: sample time = 498.68 ms
main: predict time = 407606.25 ms / 16983.59 ms per token
main: total time = 537545.81 ms

@barsuna
Copy link

barsuna commented Apr 2, 2023

Above was produced with this commit
barsuna@2d0e478

@bozo32
Copy link

bozo32 commented May 28, 2023

I have a cluster running scientific linux with basically unlimited ram but 4x15gb vram I can test things on. If anybody gets a GGML that is worth testing, tell me.

@linuxmagic-mp
Copy link

Getting lost in this thread, just converted the 176B model into GGML, fp16, and now looking at using bloom.cpp, but noticed that @barsuma Readme appears to reflect there there are still problems. Could we get a status update? Doesn't look like his code is a pull request, or that this code has been updated to solve the issue, but I am not sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants