Releases: tinygrad/tinygrad
Releases · tinygrad/tinygrad
tinygrad 0.8.0
Close to the new limit of 5000 lines at 4981.
Release Highlights
- Real dtype support within kernels!
- New
.schedule()
API to separate concerns of scheduling and running - New lazy.py implementation doesn't reorder at build time.
GRAPH=1
is usable to debug issues - 95 TFLOP FP16->FP32 matmuls on 7900XTX
- GPT2 runs (jitted) in 2 ms on NVIDIA 3090
- Powerful and fast kernel beam search with
BEAM=2
- GPU/CUDA/HIP backends switched to
gpuctypes
- New (alpha) multigpu sharding API with
.shard
See the full changelog: v0.7.0...v0.8.0
Join the Discord!
tinygrad 0.7.0
Bigger again at 4311 lines :( But, tons of new features this time!
Just over 500 commits since 0.6.0
.
Release Highlights
- Windows support has been dropped to focus on Linux and Mac OS.
- Some functionality may work on Windows but no support will be provided, use WSL instead.
- DiskTensors: a way to store tensors on disk has been added.
- This is coupled with functionality in
state.py
which supports saving/loading safetensors and loading torch weights.
- This is coupled with functionality in
- Tensor Cores are supported on M1/Apple Silicon and on the 7900 XTX (WMMA).
- Support on the 7900 XTX requires weights and data to be in float16, full float16 compute support will come in a later release.
- Tensor Core behaviour/usage is controlled by the
TC
envvar.
- Kernel optimization with nevergrad
- This optimizes the shapes going into the kernel, gated by the
KOPT
envvar.
- This optimizes the shapes going into the kernel, gated by the
- P2P buffer transfers are supported on most AMD GPUs when using a single python process.
- This is controlled by the
P2P
envvar.
- This is controlled by the
- LLaMA 2 support.
- A requirement of this is bfloat16 support for loading the weights, which is semi-supported by casting them to float16, proper bfloat16 support is tracked at #1290.
- The LLaMA example now also supports 8-bit quantization using the flag
--quantize
.
- Most MLPerf models have working inference examples. Training these models is currently being worked on.
- Initial multigpu training support.
- slow multigpu training by copying through host shared memory.
- Somewhat follows torch's multiprocessing and DistributedDataParallel high-level design.
- See the hlb_cifar10.py example.
- SymbolicShapeTracker and Symbolic JIT.
- These two things combined allow models with changing shapes to be jitted like transformers.
- This means that LLaMA can now be jitted for a massive increase in performance.
- Be warned that the API for this is very WIP and may change in the future, similarly with the rest of the tinygrad API.
- aarch64 and ptx assembly backend.
- WebGPU backend, see the
compile_efficientnet.py
example. - Support for torch like tensor indexing by other tensors.
- Some more
nn
layers were promoted, namelyEmbedding
and variousConv
layers. - VITS and so-vits-svc examples added.
- Initial documentation work.
- Quickstart guide:
/docs/quickstart.md
- Environment variable reference:
/docs/env_vars.md
- Quickstart guide:
And lots of small optimizations all over the codebase.
See the full changelog: v0.6.0...v0.7.0
See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc
Join the Discord!
tinygrad 0.6.0
2516 lines now. Some day I promise a release will make it smaller.
- float16 support (needed for LLaMA)
- Fixed critical bug in training BatchNorm
- Limited support for multiple GPUs
- ConvNeXt + several MLPerf models in models/
- More torch-like methods in tensor.py
- Big refactor of the codegen into the Linearizer and CStyle
- Removed CompiledBuffer, use the LazyBuffer ShapeTracker
tinygrad 0.5.0
An upsetting 2223 lines of code, but so much great stuff!
- 7 backends : CLANG, CPU, CUDA, GPU, LLVM, METAL, and TORCH
- A TinyJit for speed (decorate your GPU function today)
- Support for a lot of onnx, including all the models in the backend tests
- No more MLOP convs, all HLOP (autodiff for convs)
- Improvements to shapetracker and symbolic engine
- 15% faster at running the openpilot model
tinygrad 0.4.0
So many changes since 0.3.0
Fairly stable and correct, though still not fast. The hlops/mlops are solid, just needs work on the llops.
The first automated release, so hopefully it works?