Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mamba #694

Open
JoaoVictorVP opened this issue Apr 25, 2024 · 10 comments
Open

Mamba #694

JoaoVictorVP opened this issue Apr 25, 2024 · 10 comments

Comments

@JoaoVictorVP
Copy link

Is mamba supported already in the current version of llama.cpp this library uses?

(ggerganov/llama.cpp#5328)

@martindevans
Copy link
Collaborator

The binaries in the latest release (0.11.1) are a little too old. The ones in the master branch were compiled after that PR was merged, so in theory they should include mamba support. I'd be interested to hear how that goes if you try it!

@JoaoVictorVP
Copy link
Author

Oh, I'm using the 0.11.2 (from nuget). I tried copying the binaries from the master and replacing the /bin ones with it.
Surprisingly it loaded the model which was not being the case before. But crashed without any printed errors and with exit code -1073740791 after 3 seconds I started an inference, not sure if it is because I'm using the same 0.11.2 or not. Just after the inferenced started and before crashing, it also outputted this to the console:

GGML_ASSERT: D:\a\LLamaSharp\LLamaSharp\llama.cpp:10282: n_threads > 0
(I tested with different configs over 'Threads' in ModelParams)

Is a new build of the nuget packet from master needed in this case?

@martindevans
Copy link
Collaborator

I tried copying the binaries from the master

That won't work I'm afraid. The llama.cpp API is unstable so every time the binaries are updated there are various internal changes on the C# side to work with the changed API. You always need to use the correc set of binaries with the correct version of the C# code.

@JoaoVictorVP
Copy link
Author

Yep. Just compiled the main package and the CPU version and the results were the same, same exit code and assertion log.
Maybe I could inspect the sources this weekend to try finding any cause for this, or do you have any ideas in mind for why this is happening?

@martindevans
Copy link
Collaborator

I don't have any ideas at the moment. I know Mamba is a bit of an unusual archtecture just because I've seen various comments inside llama.cpp about how certain APIs needs to be adjusted for Mamba, or don't quite make sense in a Mamba context. We'd definitely be interested in any investigations/PRs for Mamba support!

@JoaoVictorVP
Copy link
Author

JoaoVictorVP commented Apr 25, 2024

Ops, correction.

It actually worked, I suspected it was because nuget was caching the package (0.11.2) from the remote nuget (because I built the project using the 0.11.2 version), then I deleted the cache and now it works.

The outputs are very strange tho, but I suspect this is because I'm not formatting the inputs yet (for the tests), see here:
image

Also, the token limit is not working so I did my own with the output transformer to test here.

(This is a very small model as well, but compared to something like Phi3 it is very crude)
(On the other hands, even with weird responses the initial token time do not increase absurdly like with Phi3 model, so it seems like at least a partial win)

@AsakusaRinne
Copy link
Collaborator

I suspected it was because nuget was caching the package (0.11.2) from the remote nuget (because I built the project using the 0.11.2 version), then I deleted the cache and now it works.

Yes, nuget caches the package and will not take your compiled one if it has the same version tag.

The outputs are very strange

That's an unexpected thing. What prompt were you using? If you have cmake installed on your PC, you could also try to run the same model and prompt directly in llama.cpp to see if the output is still in a mess.

@JoaoVictorVP
Copy link
Author

JoaoVictorVP commented Apr 27, 2024

That's an unexpected thing. What prompt were you using? If you have cmake installed on your PC, you could also try to run the same model and prompt directly in llama.cpp to see if the output is still in a mess.

About this, the model was one of the unique I was able to find in hugging face in GGUF that was actually mamba (MambaHermes 3B).

I tested it with the same formatting with the same processor I made for Phi3 and it also "kinda worked" (the responses were then very short, but more coherent). I also got it working a little better with the version quantized with 6-bits instead of 4.

But I realized something a little strange, there is something on the implementation of llama.cpp that makes models run progressively slower? I thought it was because I was using transformed-based models before, but even with mamba the time for initial token is many times increasing absurdly with each message (like, from 1 second to the first token to 5, then 10, then 26, etc).
(I'm asking because later I tested the same Phi3 model [not the mamba yet] in LM Studio and the time per first token was not changing so much, more like 1-3 seconds per message at max)

One of my tests where they performed reasonably well:
Q6 https://gist.github.com/JoaoVictorVP/92f6f30ad9d3c3dc343fdf0d7685685f
Q4 https://gist.github.com/JoaoVictorVP/f4de9ee658108898eaefa2c58c37938d

@AsakusaRinne
Copy link
Collaborator

there is something on the implementation of llama.cpp that makes models run progressively slower?

AFAIK, there's no such thing in llama.cpp. Could you please post the huggingface model link here so that we can try to reproduce this case?

(I'm asking because later I tested the same Phi3 model [not the mamba yet] in LM Studio and the time per first token was not changing so much, more like 1-3 seconds per message at max)

Though LM studio is not open-source, if I remember correctly, it also uses llama.cpp as the backend. As you mentioned above, phi-3 works well in LM studio while mamba becomes slower in llama.cpp. It doesn't indicate that it's llama.cpp's problem, but also probably the model's problem. Could you please try mamba in LM-studio, or try phi-3 with llama.cpp/LLamaSharp?

@martindevans
Copy link
Collaborator

You'll get a progressive slowdown if you are using a stateless executor, and submitting larger and larger chat history each time. The stateful executors internally store the chat history and should be around the same time for every token. I'm not sure exactly how the situation there differs for Mamba, but it should be roughly the same afaik.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants