Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

Falcon 40B : too slow and random answers #204

Open
ArnaudHureaux opened this issue Jun 6, 2023 · 7 comments
Open

Falcon 40B : too slow and random answers #204

ArnaudHureaux opened this issue Jun 6, 2023 · 7 comments
Assignees
Labels
question Further information is requested

Comments

@ArnaudHureaux
Copy link

Hi,
When i deployed the Falcon 40B model on the Basaran WebUI i had :
-random answers, by example, when i said "hi", i get : " był AbramsPlayEvent磨}$,ocempreferred LaceKUZOOOoodlesWCHawaiiVEsecured cardvue ..."
-a very slow inference, whereas i was using a RunPod server costing $10 per hour with 4 GPU A100 80GB

I tried to custom the setting like that :
kwargs = {
"local_files_only": local_files_only,
"trust_remote_code": trust_remote_code,
"torch_dtype": torch.bfloat16,
"device_map": "auto"
}

  • i used the half precision, but nothing changed,

Any idea how i could handle this issue ?

Thanks (and congrat for this beautiful webui !)

@peakji peakji added the bug Something isn't working label Jun 7, 2023
@peakji peakji added question Further information is requested and removed bug Something isn't working labels Jun 7, 2023
@peakji
Copy link
Member

peakji commented Jun 7, 2023

Hi @ArnaudHureaux! I haven't used RunPod before, and there could be multiple reasons for this issue:

  1. Falcon models seem to require PyTorch 2.0, while Basaran's images use version 1.1.4.

  2. The custom settings you mentioned are not in the format accepted by Basaran. Options supported by Basaran can be found in the Dockerfile.

We will attempt to reproduce the issue using tiiuae/falcon-40b on our local machine later.

@jgcb00
Copy link

jgcb00 commented Jun 8, 2023

Hi,
The Falcon model is pretty bad when asking very small prompt, like hi, hello etc... you often get exactly that kind of output. If you ask a longer question, you will get a proper answer, it's not related with the basaran implementation

@ArnaudHureaux
Copy link
Author

On my case, the answer was totally random with message like "był AbramsPlayEvent磨}$,ocempreferred LaceKUZOOOoodlesWCHawaiiVEsecured cardvue ..." ??

I didn't have this comportment on other implementation, so i think that the problem is from the implementation ?

@jgcb00
Copy link

jgcb00 commented Jun 8, 2023

Using only hugging face :
I got the same result with load_in_8bit=True :

Question: hi
Answer:  (4).

'I don't think I'll ever be able to forget you.'

or :

Question: hi
Answer:  
It seems that the error is caused by a problem with your `onRequestSuccess` function. Specifically, the error message mentions that the function is returning an undefined value, and it seems like the `onRequestSuccess` is trying to return before the response from the server has been read.

To fix this error, you can try modifying the `onRequestSuccess` function to use Promises instead of callbacks. Instead of using `callback` to pass data to the next function, you can use `return` statements to return Promises.

Here's an example:


function onRequestSuccess(response) {
   return new Promise(function(resolve, reject) {
      console.log(response);

      // Parse JSON
      if (response.data && response.data.hasOwnProperty('success')) {
         resolve(response);
      } else {
         reject(response);
      }
   });
}

function onError(error) {
   console.log('Error:', error);
}

function sendRequest() {
  var requestData = { "username": "myusername", "password": "

@0xDigest
Copy link

0xDigest commented Jun 9, 2023

If it helps,
I updated the Dockerfile to use nvcr.io/nvidia/pytorch:23.05-py3 and was able to load the model referenced above and run inference. I can confirm that it runs slow for me, but I am attributing that to it not loading in GPU, even in 8-bit mode which should be able to run with just 45GB/RAM per https://huggingface.co/blog/falcon#fine-tuning-with-peft. I don't see these same quality issues as @ArnaudHureaux . To me that looks like a tokenizer problem maybe?

Inference with a short prompt:

~/basaran$ curl -w 'Total: %{time_total}s\n' http://127.0.0.1/v1/completions -H 'Content-Type: application/json' -d '{ "prompt": ["once upon a time,"], "echo": true }'

{"id":"cmpl-8ba3deeed1b838469f2a0d6e","object":"text_completion","created":1686333906,"model":"/models/falcon-40b","choices":[{"text":"once upon a time, spring 2011 was going to be the beginning of the bandeau bikini.","index":0,"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":5,"completion_tokens":21,"total_tokens":26}}
Total: 274.909453s

GPUs when loaded:

| 22%   25C    P8               14W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   25C    P8               15W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   26C    P8               15W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   26C    P8               15W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   24C    P8               13W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   25C    P8               15W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   23C    P8               14W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   23C    P8               15W / 250W|      6MiB / 12288MiB |      0%      Default |
| 22%   24C    P8               14W / 250W|      6MiB / 12288MiB |      0%      Default |

@Louanes1
Copy link

Am I the only one who encountered an error saying I need to install the "einops" library when trying to deploy the Falcon 40B model ? This library is not part of the requirements.txt of the 0.19.0 version

@jgcb00
Copy link

jgcb00 commented Jun 20, 2023

einops is only used by the falcon model, it should not be a requirement for the package

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants