Skip to content

FBGEMM_GPU v0.7.0

Latest
Compare
Choose a tag to compare
@spcyppt spcyppt released this 26 Apr 17:53

Release Note

Highlights

  • New optimizer and output type supports for Table Batched Embedding (TBE) training
  • Improvement and bug fixes for TBE variable batch size
  • Enhanced TBE pipeline prefetching for UVM caching
  • Many improvements on TBE CPU kernels
  • New and enhanced low-precision operators
  • Code refactoring and reorganization for faster builds
  • New tests and benchmarks
  • PyTorch 2 support for various operators
  • Clang compilation support

Software Requirements

FBGEMM_GPU v0.6.0 has been tested and known to work on the following setups:

  • PyTorch: v2.3
  • CUDA: v11.8, 12.1
  • Python: v3.8, 3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.7.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.7.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

  • [New] Added BF16 output support in TBE training (#2382)
  • [New] Added Support int8 output for sequence embeddings (#2316)
  • [New] Added an auto-vectorization implementation for CPU TBE-NBit kernel with user selection (#2182, #2299)
  • [New] Added CowClip optimizer (#2226, #2243)
  • [Improvement] Extended support and bug fixes for variable batch size TBE (#2256, #2388, #2394, #2333)
  • [Improvement] Optimized cache fetch for forward split (#2216, #2282, #2289, #2262, #2218)
  • [Improvement] Caching and cache lookup for pipeline prefetching fixes and enhancements (#2164, #2309, #2287, #2308)
  • [Improvement] Built hip rules by default (#2380)
  • [New] Added a method to TBE module to recompute buffers (#2338)
  • [New] Added meta functions for PyTorch 2 support (#2347)
  • [New] Added support for MTIA in TBE modules (#2273, #2286)
  • [Improvement] Improved TBE logging and stats report (#2379, #2378, #2377, #2386, #2337)
  • [Improvement] General fixes and enhancements (#2235, #2398, #2212, #2269, #1782, #2270, #2265, #2385, #2370, #2349, #2312, #2411, #2400)
  • [Deprecation] Optimizers deprecated (#2253, #2252)
  • [Deprecation] Removed double type support from fbgemm_cuda_utils.cuh (#2335)
  • [Deprecation] Removed INT8 weight/output support from TBE GPU training

Jagged Tensor Operators

  • [Improvement] Removed device-host synchronization from keyed jagged index select (#2315)
  • [Improvement] Fixed half->int build error (#2240)

Index Select Operators

  • [Improvement] Fixed BF16 group_index_select_2d on AMD GPU (#2321)

Low-precision operators

  • [New] CPU implementation of per-channel quantize operator (#2341)
  • [New] CPU implementation for qlinear_channelwise operator (#2343)
  • [New] Enabled CPU int8 output to dequantization to bf16 on CUDA (#2242)
  • [New] Enabled dequantization for bf16 (#2241)

Pooled Embedding

  • [Improvement] Used gpu_library_selector for permute_pooled_embedding_ops_gpu (#2340)

Misc

  • [New] Implementation of CPU version of all_to_one_device (#2251)
  • [Improvement] Performance improvement of _block_bucketize_sparse_features_cuda_kernel1 (#2331)
  • [New] Created cumem_utils_cpu and added to all_deps_cpu (#2215)
  • [New] Added float support to asynchronous_complete_cumsum_cpu (#2383)
  • [Improvement] Added early exit to sparse ops (#2277, #2276, #2213, #2259)
  • [New] STBE GPU coalescing kernel (#2275)
  • [Improvement] Removed symint from tbe_input_combine_with_length_abstract (#2336)
  • [New] GPU timing and basic reporting framework (#2314)
  • [Improvement] Fixes and FBGEMM PT2 compliance (#2223, #2224, #2225, #2231, #2327)

Benchmarks / Tests

Build / CI improvements and Fixes