Skip to content

Releases: tinygrad/tinygrad

tinygrad 0.8.0

09 Jan 18:16
2c6f2e8
Compare
Choose a tag to compare

Close to the new limit of 5000 lines at 4981.

Release Highlights

  • Real dtype support within kernels!
  • New .schedule() API to separate concerns of scheduling and running
  • New lazy.py implementation doesn't reorder at build time. GRAPH=1 is usable to debug issues
  • 95 TFLOP FP16->FP32 matmuls on 7900XTX
  • GPT2 runs (jitted) in 2 ms on NVIDIA 3090
  • Powerful and fast kernel beam search with BEAM=2
  • GPU/CUDA/HIP backends switched to gpuctypes
  • New (alpha) multigpu sharding API with .shard

See the full changelog: v0.7.0...v0.8.0

Join the Discord!

tinygrad 0.7.0

27 Aug 16:40
8b354b3
Compare
Choose a tag to compare

Bigger again at 4311 lines :( But, tons of new features this time!

Just over 500 commits since 0.6.0.

Release Highlights

  • Windows support has been dropped to focus on Linux and Mac OS.
    • Some functionality may work on Windows but no support will be provided, use WSL instead.
  • DiskTensors: a way to store tensors on disk has been added.
    • This is coupled with functionality in state.py which supports saving/loading safetensors and loading torch weights.
  • Tensor Cores are supported on M1/Apple Silicon and on the 7900 XTX (WMMA).
    • Support on the 7900 XTX requires weights and data to be in float16, full float16 compute support will come in a later release.
    • Tensor Core behaviour/usage is controlled by the TC envvar.
  • Kernel optimization with nevergrad
    • This optimizes the shapes going into the kernel, gated by the KOPT envvar.
  • P2P buffer transfers are supported on most AMD GPUs when using a single python process.
    • This is controlled by the P2P envvar.
  • LLaMA 2 support.
    • A requirement of this is bfloat16 support for loading the weights, which is semi-supported by casting them to float16, proper bfloat16 support is tracked at #1290.
    • The LLaMA example now also supports 8-bit quantization using the flag --quantize.
  • Most MLPerf models have working inference examples. Training these models is currently being worked on.
  • Initial multigpu training support.
    • slow multigpu training by copying through host shared memory.
    • Somewhat follows torch's multiprocessing and DistributedDataParallel high-level design.
    • See the hlb_cifar10.py example.
  • SymbolicShapeTracker and Symbolic JIT.
    • These two things combined allow models with changing shapes to be jitted like transformers.
    • This means that LLaMA can now be jitted for a massive increase in performance.
    • Be warned that the API for this is very WIP and may change in the future, similarly with the rest of the tinygrad API.
  • aarch64 and ptx assembly backend.
  • WebGPU backend, see the compile_efficientnet.py example.
  • Support for torch like tensor indexing by other tensors.
  • Some more nn layers were promoted, namely Embedding and various Conv layers.
  • VITS and so-vits-svc examples added.
  • Initial documentation work.

And lots of small optimizations all over the codebase.

See the full changelog: v0.6.0...v0.7.0

See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc

Join the Discord!

tinygrad 0.6.0

26 May 01:02
Compare
Choose a tag to compare

2516 lines now. Some day I promise a release will make it smaller.

  • float16 support (needed for LLaMA)
  • Fixed critical bug in training BatchNorm
  • Limited support for multiple GPUs
  • ConvNeXt + several MLPerf models in models/
  • More torch-like methods in tensor.py
  • Big refactor of the codegen into the Linearizer and CStyle
  • Removed CompiledBuffer, use the LazyBuffer ShapeTracker

tinygrad 0.5.0

07 Mar 02:21
Compare
Choose a tag to compare

An upsetting 2223 lines of code, but so much great stuff!

  • 7 backends : CLANG, CPU, CUDA, GPU, LLVM, METAL, and TORCH
  • A TinyJit for speed (decorate your GPU function today)
  • Support for a lot of onnx, including all the models in the backend tests
  • No more MLOP convs, all HLOP (autodiff for convs)
  • Improvements to shapetracker and symbolic engine
  • 15% faster at running the openpilot model

tinygrad 0.4.0

08 Nov 16:49
8dc28dd
Compare
Choose a tag to compare

So many changes since 0.3.0

Fairly stable and correct, though still not fast. The hlops/mlops are solid, just needs work on the llops.

The first automated release, so hopefully it works?