You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Does Ollama currently plan to support multiple acceleration frameworks?
We understand that Ollama currently leverages Llama.cpp for inference acceleration, which supports only the Llama architecture. The GLM has made some modifications to the model.
We are very keen on seeing the GLM ecosystem implemented with C++ capabilities. To this end, we have developed the following design proposal and would like to inquire if Ollama has plans to advance this work.
Ollama Project Integration with ChatGLM and CogVM
The Ollama project is currently built on the Llama.cpp acceleration framework, creating a one-click run framework. It leverages the inference and conversational capabilities of Llama.cpp. At a higher level, it has designed a service distribution and execution test. Users can pull quantized models from a remote image server and run them using a local client. The project currently supports Linux, Mac, and Windows terminal systems. Llama.cpp inference code can accelerate inference on mainstream hardware.
Objective
To use Ollama's service distribution method to distribute models like ChatGLM and CogVM on the server side, supporting multi-end systems (Linux/Mac/Windows) for execution.
Ollama Project Design Description
The Ollama framework relies on the language features of CGO, designing a local client runtime system. The system compiles the terminal executable files of Llama.cpp via CGO to publish the HTTP server service provided by Llama.cpp. By connecting Go and C through .h files, it supports model quantization. The upper layer has designed command modules to receive user command instructions, thereby invoking HTTP service through Go to complete model instance maintenance. Additionally, the Go module includes code for model management and retrieval.
Call Relationship Diagram
graph TD;
A[cgo Layer] --> B[llama.cpp server];
B --> C[httpserver Service];
C --> D[.h Files];
D --> E[Quantization Support];
A --> F[Command Module];
F --> C;
F --> G[Go Execution];
G --> H[Model Management];
G --> I[Model Retrieval];
G --> J[Task pubsh];
E --> C;
subgraph Ollama Framework
A;
B;
C;
D;
E;
F;
G;
H;
I;
end
K[Compilation Adaptation];
K --> L[llama.cpp server];
L --> M[Task Scheduling];
Design Proposal
Based on the design content in the above diagram, we investigated the ChatGLM.cpp repository, which provides quantization support for the GGML inference solution. On this basis, we can write a GLM server executor, making some adaptation operations at the compilation layer, checking the compatibility of Llama.cpp and ChatGLM.cpp .h header files, and scheduling the corresponding task allocation.
graph TD;
A[cgo Layer];
A --> F[Command Module];
F --> C[llama.cpp & chatglm.cpp header]
C --> D[.h Files];
D --> E[Quantization Support];
F --> G[Go Execution];
G --> H[Model Management];
G --> I[Model Retrieval];
G --> J[Task pubsh];
E --> C;
subgraph Ollama Framework
A;
B;
C;
D;
E;
F;
G;
H;
I;
end
K[Compilation Adaptation];
K --> L[llama.cpp & chatglm.cpp server];
L --> M[Task Scheduling];
Requirements
Does Ollama currently plan to support multiple acceleration frameworks?
We understand that Ollama currently leverages Llama.cpp for inference acceleration, which supports only the Llama architecture. The GLM has made some modifications to the model.
We are very keen on seeing the GLM ecosystem implemented with C++ capabilities. To this end, we have developed the following design proposal and would like to inquire if Ollama has plans to advance this work.
Ollama Project Integration with ChatGLM and CogVM
Objective
To use Ollama's service distribution method to distribute models like ChatGLM and CogVM on the server side, supporting multi-end systems (Linux/Mac/Windows) for execution.
Ollama Project Design Description
The Ollama framework relies on the language features of CGO, designing a local client runtime system. The system compiles the terminal executable files of Llama.cpp via CGO to publish the HTTP server service provided by Llama.cpp. By connecting Go and C through .h files, it supports model quantization. The upper layer has designed command modules to receive user command instructions, thereby invoking HTTP service through Go to complete model instance maintenance. Additionally, the Go module includes code for model management and retrieval.
Design Proposal
Based on the design content in the above diagram, we investigated the ChatGLM.cpp repository, which provides quantization support for the GGML inference solution. On this basis, we can write a GLM server executor, making some adaptation operations at the compilation layer, checking the compatibility of Llama.cpp and ChatGLM.cpp .h header files, and scheduling the corresponding task allocation.
link
The text was updated successfully, but these errors were encountered: