Releases: ollama/ollama
v0.1.29
AMD Preview
Ollama now supports AMD graphics cards in preview on Windows and Linux. All the features are now accelerated by AMD graphics cards, and support is included by default in Ollama for Linux, Windows and Docker.
Supported cards and accelerators
Family | Supported cards and accelerators |
---|---|
AMD Radeon RX | 7900 XTX 7900 XT 7900 GRE 7800 XT 7700 XT 7600 XT 7600 6950 XT 6900 XTX 6900XT 6800 XT 6800 Vega 64 Vega 56 |
AMD Radeon PRO | W7900 W7800 W7700 W7600 W7500 W6900X W6800X Duo W6800X W6800 V620 V420 V340 V320 Vega II Duo Vega II VII SSG |
AMD Instinct | MI300X MI300A MI300 MI250X MI250 MI210 MI200 MI100 MI60 MI50 |
What's Changed
ollama <command> -h
will now show documentation for supported environment variables- Fixed issue where generating embeddings with
nomic-embed-text
,all-minilm
or other embedding models would hang on Linux - Experimental support for importing Safetensors models using the
FROM <directory with safetensors model>
command in the Modelfile - Fixed issues where Ollama would hang when using JSON mode.
- Fixed issue where
ollama run
would error when piping output totee
and other tools - Fixed an issue where memory would not be released when running vision models
- Ollama will no longer show an error message when piping to stdin on Windows
New Contributors
- @tgraupmann made their first contribution in #2582
- @andersrex made their first contribution in #2909
- @leonid20000 made their first contribution in #2440
- @hishope made their first contribution in #2973
- @mrdjohnson made their first contribution in #2759
- @mofanke made their first contribution in #3077
- @racerole made their first contribution in #3073
- @Chris-AS1 made their first contribution in #3094
Full Changelog: v0.1.28...v0.1.29
v0.1.28
New models
- StarCoder2: the next generation of transparently trained open code LLMs that comes in three sizes: 3B, 7B and 15B parameters.
- DolphinCoder: a chat model based on StarCoder2 15B that excels at writing code.
What's Changed
- Vision models such as
llava
should now respond better to text prompts - Improved support for
llava
1.6 models - Fixed issue where switching between models repeatedly would cause Ollama to hang
- Installing Ollama on Windows no longer requires a minimum of 4GB disk space
- Ollama on macOS will now more reliably determine available VRAM
- Fixed issue where running Ollama in
podman
would not detect Nvidia GPUs - Ollama will correctly return an empty embedding when calling
/api/embeddings
with an emptyprompt
instead of hanging
New Contributors
- @Bin-Huang made their first contribution in #1706
- @elthommy made their first contribution in #2737
- @peanut256 made their first contribution in #2354
- @tylinux made their first contribution in #2827
- @fred-bf made their first contribution in #2780
- @bmwiedemann made their first contribution in #2836
Full Changelog: v0.1.27...v0.1.28
v0.1.27
Gemma
Gemma is a new, top-performing family of lightweight open models built by Google. Available in 2b
and 7b
parameter sizes:
ollama run gemma:2b
ollama run gemma:7b
(default)
What's Changed
- Performance improvements (up to 2x) when running Gemma models
- Fixed performance issues on Windows without GPU acceleration. Systems with AVX and AVX2 instruction sets should be 2-4x faster.
- Reduced likelihood of false positive Windows Defender alerts on Windows.
New Contributors
- @joshyan1 made their first contribution in #2657
- @pfrankov made their first contribution in #2138
- @adminazhar made their first contribution in #2686
- @b-tocs made their first contribution in #2510
- @Yuan-ManX made their first contribution in #2249
- @langchain4j made their first contribution in #1690
- @logancyang made their first contribution in #1918
Full Changelog: v0.1.26...v0.1.27
v0.1.26
What's Changed
- Support for
bert
andnomic-bert
embedding models - Fixed issue where system prompt and prompt template would not be updated when loading a new model
- Quotes will now be trimmed around the value of the
OLLAMA_HOST
on Windows - Fixed duplicate button issue on the Windows taskbar menu.
- Fixed issue where system prompt would be be overridden when using the
/api/chat
endpoint - Hardened AMD driver lookup logic
- Fixed issue where two versions of Ollama on Windows would run at the same time
- Fixed issue where memory would not be released after a model is unloaded with modern CUDA-enabled GPUs
- Fixed issue where AVX2 was required for GPU on Windows machines with GPUs
- Fixed issue where
/bye
or/exit
would not work with trailing spaces or characters after them
New Contributors
- @tristanbob made their first contribution in #2545
- @justinh-rahb made their first contribution in #2563
- @gerazov made their first contribution in #2188
- @eddumelendez made their first contribution in #2164
- @lulzshadowwalker made their first contribution in #2381
- @jakobhoeg made their first contribution in #2466
- @jdetroyes made their first contribution in #1673
- @djcopley made their first contribution in #1767
- @pythops made their first contribution in #2329
- @ttsugriy made their first contribution in #2511
- @medoror made their first contribution in #2180
- @nikeshparajuli made their first contribution in #1775
- @n4ze3m made their first contribution in #2447
Full Changelog: v0.1.25...v0.1.26
v0.1.25
Windows Preview
Ollama is now available on Windows in preview. Download it here. Ollama on Windows makes it possible to pull, run and create large language models in a new native Windows experience. It includes built-in GPU acceleration, access to the full model library, and the Ollama API including OpenAI compatibility.
What's Changed
- Ollama on Windows is now available in preview.
- Fixed an issue where requests would hang after being repeated several times
- Ollama will now correctly error when provided an unsupported image format
- Fixed issue where
ollama serve
wouldn't immediately quit when receiving a termination signal - Fixed issues with prompt templating for the
/api/chat
endpoint, such as where Ollama would omit the second system prompt in a series of messages - Fixed issue where providing an empty list of messages would return a non-empty response instead of loading the model
- Setting a negative
keep_alive
value (e.g.-1
) will now correctly keep the model loaded indefinitely
New Contributors
Full Changelog: v0.1.24...v0.1.25
v0.1.24
OpenAI Compatibility
This release adds initial compatibility support for the OpenAI Chat Completions API.
Usage with cURL
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}'
New Models
- Qwen 1.5: Qwen 1.5 is a new family of large language models by Alibaba Cloud spanning from 0.5B to 72B.
What's Changed
- Fixed issue where requests to
/api/chat
would hang when providing emptyuser
messages repeatedly - Fixed issue on macOS where Ollama would return a missing library error after being open for a long period of time
New Contributors
Full Changelog: v0.1.23...v0.1.24
v0.1.23
New vision models
The LLaVA model family on Ollama has been updated to version 1.6, and now includes a new 34b
version:
ollama run llava
A new 7B LLaVA model based on mistral.ollama run llava:13b
13B LLaVA modelollama run llava:34b
34B LLaVA model – one of the most powerful open-source vision models available
These new models share new improvements:
- More permissive licenses: LLaVA 1.6 models are distributed via the Apache 2.0 license or the LLaMA 2 Community License.
- Higher image resolution: support for up to 4x more pixels, allowing the model to grasp more details.
- Improved text recognition and reasoning capabilities: these models are trained on additional document, chart and diagram data sets.
keep_alive
parameter: control how long models stay loaded
When making API requests, the new keep_alive
parameter can be used to control how long a model stays loaded in memory:
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Why is the sky blue?",
"keep_alive": "30s"
}'
- If set to a positive duration (e.g.
20m
,1hr
or30
), the model will stay loaded for the provided duration - If set to a negative duration (e.g.
-1
), the model will stay loaded indefinitely - If set to
0
, the model will be unloaded immediately once finished - If not set, the model will stay loaded for 5 minutes by default
Support for more Nvidia GPUs
- GeForce GTX
TITAN X
980 Ti
980
970
960
950
750 Ti
750
- GeForce GTX
980M
970M
965M
960M
950M
860M
850M
- GeForce
940M
930M
910M
840M
830M
- Quadro
M6000
M5500M
M5000
M2200
M1200
M620
M520
- Tesla
M60
M40
- NVS
810
What's Changed
- New
keep_alive
API parameter to control how long models stay loaded - Image paths can now be provided to
ollama run
when running multimodal models - Fixed issue where downloading models via
ollama pull
would slow down to 99% - Fixed error when running Ollama with Nvidia GPUs and CPUs without AVX instructions
- Support for additional Nvidia GPUs (compute capability 5)
- Fixed issue where system prompt would be repeated in subsequent messages
ollama serve
will now print prompt whenOLLAMA_DEBUG=1
is set- Fixed issue where exceeding context size would cause erroneous responses in
ollama run
and the/api/chat
API ollama run
will now allow sending messages without images to multimodal models
New Contributors
- @jaglinux made their first contribution in #2224
- @textspur made their first contribution in #2252
- @rjmacarthy made their first contribution in #1950
- @hugo53 made their first contribution in #1957
- @RussellCanfield made their first contribution in #2313
Full Changelog: v0.1.22...v0.1.23
v0.1.22
New models
- Stable LM 2: A state-of-the-art 1.6B small language model.
What's Changed
- Fixed issue with Nvidia GPU detection that would cause Ollama to error instead of falling back to CPU
- Fixed issue where AMD integrated GPUs caused an error
Full Changelog: v0.1.21...v0.1.22
v0.1.21
New models
- Qwen: Qwen is a series of large language models by Alibaba Cloud spanning from 1.8B to 72B parameters.
- DuckDB-NSQL: A text-to-sql LLM for DuckDB
- Stable Code: A new code completion model on par with Code Llama 7B and similar models.
- Nous Hermes 2 Mixtral: The Nous Hermes 2 model from Nous Research, now trained over Mixtral.
Saving and loading models and messages
Models can now be saved and loaded with /save <model>
and /load <model>
when using ollama run
. This will save or load conversations and any model changes with /set parameter
, /set system
and more as a new model with the provided name.
MESSAGE
modelfile command
Messages can now be specified in a Modelfile
ahead of time using the MESSAGE
command:
# example Modelfile
FROM llama2
SYSTEM You are a friendly assistant that only answers with 'yes' or 'no'
MESSAGE user Is Toronto in Canada?
MESSAGE assistant yes
MESSAGE user Is Sacramento in Canada?
MESSAGE assistant no
MESSAGE user Is Ontario in Canada?
MESSAGE assistant yes
After creating this model, running it will restore the message history. This is useful for techniques such as Chain-Of-Thought prompting
ollama create -f Modelfile yesno
ollama run yesno
>>> Is Toronto in Canada?
yes
>>> Is Sacramento in Canada?
no
>>> Is Ontario in Canada?
yes
>>> Is Havana in Canada?
no
Python and Javascript libraries
The first versions of the Python and JavaScript libraries for Ollama are now available.
Intel & AMD CPU improvements
Ollama now supports CPUs without AVX. This means Ollama will now run on older CPUs and in environments (such as virtual machines, Rosetta, GitHub actions) that don't provide support for AVX instructions. For newer CPUs that support AVX2, Ollama will receive a small performance boost, running models about 10% faster.
What's Changed
- Support for a much broader set of CPUs, including CPUs without AVX instruction set support
- If a GPU detection error is hit when attempting to run a model, Ollama will fallback to CPU
- Fixed issue where generating responses with the same prompt would hang after around 20 requests
- New
MESSAGE
Modelfile command to set the conversation history when building a model - Ollama will now use AVX2 for faster performance if available
- Improved detection of Nvidia GPUs, especially in WSL
- Fixed issue where models with LoRA layers may not load
- Fixed incorrect error that would occur when retrying network connections in
ollama pull
andollama push
- Fixed issue where
/show parameter
would round decimal numbers - Fixed issue where upon hitting the context window limit, requests would hang
New Contributors
- @fpreiss made their first contribution in #1921
- @eavanvalkenburg made their first contribution in #1931
- @0atman made their first contribution in #1924
- @sachinsachdeva made their first contribution in #2021
- @Arrendy made their first contribution in #2016
- @purificant made their first contribution in #1958
- @lainedfles made their first contribution in #1999
Full Changelog: v0.1.20...v0.1.21
v0.1.20
New models
- MegaDolphin: A new 120B version of the Dolphin model.
- OpenChat: Updated to the latest version
3.5-0106
. - Dolphin Mistral: Updated to the latest DPO Laser version, which achieves higher scores with more robust outputs.
What's Changed
- Fixed additional cases where Ollama would fail with
out of memory
CUDA errors - Multi-GPU machines will now correctly allocate memory across all GPUs
- Fixed issue where Nvidia GPUs would not be detected by Ollama
Full Changelog: v0.1.19...v0.1.20