Releases · intel/neural-speed

29 Mar 11:54

kevinintel

v1.0

79c3537

Intel® Neural Speed v1.0 Release Latest

Latest

Highlights
Examples
Validated Configurations

Highlights

Support models from ModelScope

Examples

Enable Mistral-base-v0.2 (ee40f28)

Validated Configurations

Python 3.9, 3.10, 3.11
Ubuntu 22.04

Assets 2

22 Mar 11:10

kevinintel

v1.0a

1051182

Intel® Neural Speed v1.0a Release

Highlights
Improvements
Examples
Bug Fixing
Validated Configurations

Highlights

Improve performance on CPU client
Support batching and submit GPT-J results to MLPerf v4.0

Improvements

Support continuous batching and beam search inference (7c2199 )
Improvement for AVX2 platform (bc5ee16, aa4a8a, 35c6d10 )
Support FFN_fusion for the ChatGLM2(96fadd )
Enable loading model from modelscope (ad3d19 )
Extend long input tokens length (eb41b9 , e76a58e )
[BesTLA] Improve RTN quantization accuracy of int4 and int3 (a90aea)
[BesTLA] New thread pool and hybrid dispatcher (fd19a44 )

Examples

Enable Mixtral 8x7B (9bcb612 )
Enable Mistral-GPTQ (96dc55 )
Implement the YaRN rop scaling feature (6c36f54 )
Enable Qwen 1-5 (750b35 )
Support GPTQ & AWQ inference for Qwen v1, v1.5 and Mixtral-8x7B (a129213)
• Support GPTQ for Baichuan2-13B & Falcon 7B & Phi-1.5 (eed9b3)
Enable Baichuan-7B and refactor Baichuan-13B (8d5fe2d)
Enable StableLM2-1.6B & StableLM2-Zephyr-1.6B & StableLM-3B (872876 )
Enable ChatGLM3 (94e74d )
Enable Gemma-2B (e4c5f71 )

Bug Fixing

Fix convert_quantized model bug (37d01f3 )
Fix Autoround acc regression (991c35 )
Fix Qwen load error (2309fbb )
Fix the GGUF convert issue (5293ffa )

Validated Configurations

Python 3.9, 3.10, 3.11
Ubuntu 22.04

Assets 2

23 Feb 12:57

kevinintel

v0.3

150e752

Intel® Neural Speed v0.3 Release

Highlights
Improvements
Examples
Bug Fixing
Validated Configurations

Highlights

Contributed GPT-J inference to MLPerf v4.0 submission (mlperf commits)
Enabled 3-bit low precision inference (ee40f28)

Improvements

Optimization of Layernormalization (98ffee45)
Update Qwen python API (51088a)
Load processed model automatically (662553)
Support continuous batching in Offline and Server (66cb9f5)
Support loading models from HF directly (bb80273)
Support autoround (e2d3652)
Enable OMP in BesTLA (3afae427)
Enable log with NEURAL_SPEED_VERBOSE (a8d9e7)
Add YaRN rope scaling data structure (8c846d6)
Improvements targeting Windows (464239)

Examples

Enable Qwen 1.8B (ea4b713)
Enable Phi-2, Phi-1.5 and Phi-1 (c212d8)
Support 3bits & 4bits GPTQ for Gpt-j 6B (4c9070)
Support Solar 10.7B with GPTQ (26c68c7, 90f5cbd)
Support Qwen GGUF inference (cd67b92)

Bug Fixing

Fix log-level introduced perf problem (6833b2f, 6f85518f)
Fix straightforward-API issues (4c082b7)
Fix a blocker on Windows platforms (4adc15)
Fix whisper python API. (c97dbe)
Fix Qwen loading & Mistral GPTQ convert (d47984c)
Fix clang-tidy issues (ad54a1f)
Fix Mistral online loading issues (0470b1f)
Handles models that require a HF token access ID (33ffaf07)
Fix the GGUF convert issue (5293ffa5)
Fix GPTQ & AWQ convert issue (150e752)

Validated Configurations

Python 3.10
Ubuntu 22.04

Assets 2

22 Jan 14:41

kevinintel

v0.2

abcc0f4

Intel® Neural Speed v0.2 Release

Highlights
Improvements
Examples
Bug Fixing
Validated Configurations

Highlights

Support Q4_0, Q5_0 and Q8_0 GGUF models and AWQ
Enhance Tensor Parallelism with shared memory in multi-sockets in single node

Improvements

Rename Bestla files and their usage (d5c26d4 )
Update Python API and reorg scripts (40663e )
Enable AWQ with Llama2 example (9be307f )
Enable clang tidy (227e89 )
TP support multi-node (6dbaa0 )
Support accuracy calculation for GPTQ models (7b124aa )
Enable log with NEURAL_SPEED_VERBOSE (a8d9e7)

Examples

Add Magicoder example (749caca )
Enable whisper large example (24b270 )
Add Docker file and Readme (f57d4e1 )
Support multi-batch ChatGLM-V1 inference (c9fb9d)

Bug Fixing

Fix avx512-s8-dequant and asymmetric related bug (fad80b14 )
Fix warmup prompt length and add ns_log_level control (070b6b )
Fix convert: remove hardcode of AWQ (7729bb )
Fix the ChatGLM convert issue. (7671467 )
Fix Bestla windows compile issue (760e5f )

Validated Configurations

Python 3.10
Ubuntu 22.04

Assets 2

22 Dec 14:47

kevinintel

v0.1

6d8bb4a

Intel® Neural Speed v0.1 Release

Highlights
Features
Examples

Highlights

Created Neural Speed project, spinning off from Intel Extension for Transformers

Features

Support GPTQ models
Enable Beam Search post-processing.
Add MX-Format (FP8_E5M2, FP8_E4M3, FP4_E2M1, NF4)
Refactor Transformers Extension for Low-bit Inference Runtime based on the latest Jblas
Support Tensor Parallelism with jblas and shared memory.
Improving the performance of Client CPUs.
Enabling streaming LLM for Runtime
Enhance QLoRA on CPU with optimized dropout operator.
Add Script for PPL Evaluation.
Refine Python API.
Allow CompileBF16 on GCC11.
Multi-Round chat with ChatGLM2.
Shift-RoPE-based Streaming-LLM.
Enable MHA fusion for LLM.
Support AVX_VNNI and AVX2
Optimize QBits backend.
GELU support

Examples

Enable finetune for Qwen-7b-chat on CPU.
Enable Whisper C++ API
Apply the STS task to BAAI/BGE models.
Enable Qwen graph.
Enable instruction_tuning Stable Diffusion examples.
Enable Mistral-7b.
Enable Falcon-180B
Enable Baichuan/Baichuan2 example.

Validated Configurations

Python 3.9, 3.10, 3.11
GCC 13.1, 11.1
Centos 8.4 & Ubuntu 20.04 & Windows 10

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: intel/neural-speed

Intel® Neural Speed v1.0 Release

Intel® Neural Speed v1.0a Release

Intel® Neural Speed v0.3 Release

Intel® Neural Speed v0.2 Release

Intel® Neural Speed v0.1 Release