26 Apr 17:53

spcyppt

8c06a63

FBGEMM_GPU v0.7.0 Latest

Latest

Release Note

Highlights

New optimizer and output type supports for Table Batched Embedding (TBE) training
Improvement and bug fixes for TBE variable batch size
Enhanced TBE pipeline prefetching for UVM caching
Many improvements on TBE CPU kernels
New and enhanced low-precision operators
Code refactoring and reorganization for faster builds
New tests and benchmarks
PyTorch 2 support for various operators
Clang compilation support

Software Requirements

FBGEMM_GPU v0.6.0 has been tested and known to work on the following setups:

PyTorch: v2.3
CUDA: v11.8, 12.1
Python: v3.8, 3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.7.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.7.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

[New] Added BF16 output support in TBE training (#2382)
[New] Added Support int8 output for sequence embeddings (#2316)
[New] Added an auto-vectorization implementation for CPU TBE-NBit kernel with user selection (#2182, #2299)
[New] Added CowClip optimizer (#2226, #2243)
[Improvement] Extended support and bug fixes for variable batch size TBE (#2256, #2388, #2394, #2333)
[Improvement] Optimized cache fetch for forward split (#2216, #2282, #2289, #2262, #2218)
[Improvement] Caching and cache lookup for pipeline prefetching fixes and enhancements (#2164, #2309, #2287, #2308)
[Improvement] Built hip rules by default (#2380)
[New] Added a method to TBE module to recompute buffers (#2338)
[New] Added meta functions for PyTorch 2 support (#2347)
[New] Added support for MTIA in TBE modules (#2273, #2286)
[Improvement] Improved TBE logging and stats report (#2379, #2378, #2377, #2386, #2337)
[Improvement] General fixes and enhancements (#2235, #2398, #2212, #2269, #1782, #2270, #2265, #2385, #2370, #2349, #2312, #2411, #2400)
[Deprecation] Optimizers deprecated (#2253, #2252)
[Deprecation] Removed double type support from fbgemm_cuda_utils.cuh (#2335)
[Deprecation] Removed INT8 weight/output support from TBE GPU training

Jagged Tensor Operators

[Improvement] Removed device-host synchronization from keyed jagged index select (#2315)
[Improvement] Fixed half->int build error (#2240)

Index Select Operators

[Improvement] Fixed BF16 group_index_select_2d on AMD GPU (#2321)

Low-precision operators

[New] CPU implementation of per-channel quantize operator (#2341)
[New] CPU implementation for qlinear_channelwise operator (#2343)
[New] Enabled CPU int8 output to dequantization to bf16 on CUDA (#2242)
[New] Enabled dequantization for bf16 (#2241)

Pooled Embedding

[Improvement] Used gpu_library_selector for permute_pooled_embedding_ops_gpu (#2340)

Misc

[New] Implementation of CPU version of all_to_one_device (#2251)
[Improvement] Performance improvement of _block_bucketize_sparse_features_cuda_kernel1 (#2331)
[New] Created cumem_utils_cpu and added to all_deps_cpu (#2215)
[New] Added float support to asynchronous_complete_cumsum_cpu (#2383)
[Improvement] Added early exit to sparse ops (#2277, #2276, #2213, #2259)
[New] STBE GPU coalescing kernel (#2275)
[Improvement] Removed symint from tbe_input_combine_with_length_abstract (#2336)
[New] GPU timing and basic reporting framework (#2314)
[Improvement] Fixes and FBGEMM PT2 compliance (#2223, #2224, #2225, #2231, #2327)

Benchmarks / Tests

[New] Added dynamic quantize GEMM benchmark (#2297, #2295, #2271)
[New] Added a new CPU nbit-TBE benchmark that tries to reduce CPU frequency noise (#2306)
[New] Added unit test for stochastic rounding for UVM caching (#2324)
[New] Added unit test AsyncSeriesTimer (#2364)
[New] Added int32 overflow unit test for TBE UVM caching (#2303)
[Improvement] Disabled dynamo testing in TBE (#2381)
[Improvement] Refactored and re-organized tests (#2305, #2292, #2291, #2284, #2281, #2274, #2272, #2266, #2263, #2260, #2407, #2406, #2402, #2304, #2399, #2393)
[Improvement] General fixes for tests and benchmarks (#2301, #2300, #2298, #2255, #2205, #2296)

Build / CI improvements and Fixes

[Improvement] Optimized EmbeddingSpMDMNBit_autovec (#2267)
[Improvement] Switched between hip and cuda c++ lib so load (#2236)
[Improvement] Fixred bf16 support issues (#2238)
[New] Enabled Clang compilation in OSS for fbgemm_gpu (CPU and CUDA) (#2334, #2345, #2330, #2323)
[New] Upgraded ROCm version (#2405)
[Improvement] Enabled -Winfinite-recursion in deeplearning/PACKAGE (#2329)
[Improvement] Fixed shadowed variable in deeplearning/fbgemm/src/GroupwiseConv.cc (#2268)
[Improvement] General CI and build system enhancement (#2489, #2430, #2427, #2423, #2356, #2348, #2342, #2328, #2307, #2211, #2219, #2220, #2228, #2233)
[Improvement] Documentation enhancement (#2294, #2278, #2258, #2249, #2227, #2232, #2244, #2239, #2237)

Assets 2

31 Jan 19:40

spcyppt

v0.6.0

e0d208e

FBGEMM_GPU v0.6.0

Release Note

Highlights

Improvement and bug fixes for TBE variable batch size
Many TBE extensions and benchmarks
Enhanced TBE pipeline prefetching for UVM caching
Code refactoring and reorganization for faster builds
Many improvements and new sparse ops added
Improved low precision ops
Support for Python 3.12
PyTorch 2 support for various operators

Software Requirements

FBGEMM_GPU v0.6.0 has been tested and known to work on the following setups:

PyTorch: v2.2
CUDA: v11.8, 12.1
Python: v3.8, 3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.6.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.6.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

[Improvement] Extended support and bug fixes for variable batch size (#2012, #2043, #2107, #2150, #2188)
[Improvement] caching and cache lookup for pipeline prefetching (#2147, #2154, #2151)
[New] Support MTIA device type in FBGEMM TBE training (#1994)
[New] Enable sequence TBE CPU via AVX (#2195)
[New] Enable subwarp only for unweighted (#2051)
[New] Add meta functions (#2094, #2102)
[New] Add reverse qparam option for MTIA (#2109)
[New] uvm_cache_stats for direct mapped (#1951, #1952)
[Improvement] use memcpy for cpu emb inplace update (#2166)
[Improvement] Remove indices and offsets copying from prefetch (#2186)
[Improvement] Improve perf for L=0 cases for TBE v2 (#2046)
[Improvement] General fixes and enhancements (#2030, #2009)

Jagged Tensor Operators

[Improvement] Fix incorrect SymInt signature on dense_to_jagged (#2039)
[Improvement] Fix non-contiguous tensor problem in jagged_index_select (#2060, #2061)

Index Select Operators

[Improvement] Get total D from CPU buffer in batch_index_select_dim0 (#2079)

Low-precision operators

[New] Add BF16 in padded FP8 quantize ops (#2010)
[Improvement] Improve quantize_comm error message (#2018)
[Improvement] Fix illegal memory access error and initialize empty values on fp8 quantize kernel (#2131, #2176)

Pooled Embedding

[New] Add permute_duplicate_pooled_embeddings op for CPU (#1939)
[Improvement] Use PyTorch's p2p access enable function (#2000)
[New] Add support for duplicate in permutations for permute_pooled_embs_split (#1940)
[Improvement] Improve all_to_one error message (#2019)
[New] Add meta function for fbgemm::merge_pooled_embeddings operator (#2069)
[New] Add variable batch per feature support to EBC (tw/cw only) (#1986)

Misc

[New] Add meta backend for new_managed_tensor and sparse ops (#1990, #2028, #2029, #2072)
[New] Use 4k page instead of 2M for managed tensor (#2058)
[New] Add BF16 support for reorder_batched_ad_indices (#2116)
[New] SymInts for sparse ops (#2017, #2089)
[New] Support for CPU/GPU compilation (#2040)
[New] Add impl_abstract (#2084, #2087, #2090, #2097, #2098, #2129, #2132, )
[Improvement] Make FBGEMM PT2 compliant (#2174, #2172, #2170, #2180, #2181, #2201, #2198)
[Improvement] Fix invalid CUDA configuration error for the empty input (#1993)

Benchmarks / Tests

[New] Benchmark block_bucketize_sparse_features uneven sharding (#2140, #2169)
[New] Add unit test for unique cache lookup (#2160)
[New] Add autogenerated opcheck tests (#2050, #2069, #2073, #2092, #2118, #2139, #2152, #2173, #2193)
[New] Add test for fbgemm ops. (#2136, #2082)
[Improvement] Modified TBE testbench to use FBGEMM generate_rquests function to generate indices and offsets (#1882)
[Improvement] Remove FP64 from TBE CPU tests (#2049)
[Improvement] Add warmup_runs to TBE benchmarks and run at least 1 warmup iter #2163
[Improvement] Add --pooling in TBE nbit_cpu benchmark (#2200)
[Improvement] Fill embedding tables with randomized scales and bias in split-TBE benchmarks (#2031)

Build / CI improvements and Fixes

[Improvement] General CI and build system enhancement
(#2065, #2071, #2078, #2149, #2189, #2203, #2204, #2209, #2047)
[Improvement] Reorganized code to enable faster builds (#1881, #2083, #2085, #2095, #2141, #2112, #2133, #2145, #2196, #2100, #2103)
[New] Add support for Python 3.12 (#2194)
[New] Updates for ROCm 5.6, 5.7 and 6.0 support and Hip.cmake changes (#2066, #2088, #2106)
[New] Add debug flags for HIP runs (#2206)
[Improvement] unknown c++ flag detection in CMake (#2057)
[Improvement] Fix inconsistent dll linkage warning (#2059, #2064)
[Improvement] Fix heap-buffer-overflow in radix_sort_parallel (#2075)
[Improvement] Update AVX2 and AVX512 flags (#2167)

Assets 2

05 Oct 23:52

spcyppt

v0.5.0

a4151dd

FBGEMM_GPU v0.5.0

Release Notes

Highlights

TBE training v2 (optimized TBE forward: up to 4x kernel performance improvement)
Many TBE extensions including defused TBE backward-optimizer, variable batch size support, pipeline prefetching support for UVM caching
Many improvements and new sparse ops added
ARM support
SM 9.0 support for CUDA 12.1 for H100 GPUs
PyTorch 2 support for various operators, i.e., jagged tensor, pooled embedding ops

Software Requirements

FBGEMM_GPU v0.5.0 has been tested and known to work on the following setups:

PyTorch: v2.1
CUDA: v11.8, 12.1
Python: v3.8, 3.9, 3.10, 3.11

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.5.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.5.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

[Improvement] TBE training v2 (optimized TBE forward: up to 4x kernel performance improvement) (#1641, #1804, #1787, #1904)
[New] Variable batch size support to TBE training (#1653, #1752, #1633, #1634, #1713, #1717, #1943)
[New] BFloat16 support for TBE CPU (#1839, #1851)
[New] Defused TBE backward-optimizer and SplitTBE optimizer (#1819, #1820, #1821)
[New] Max norm support for rowwise_adagrad (#1781)
[New] Support for 1024-2048 embedding dimension in TBE inference (#1656)
[Improvement] Backends via PyTorch dispatcher (#1948, #1976)
[Improvement] Deprecate many TBE optimizers (#1766, #1767, #1771, #1796, #1774, #1773, #1775, #1791, #1793)
[New] TBE UVM cache pipeline prefetching (#1883, #1893)

Jagged Tensor Operators

[New] New jagged tensor operators (#1690)
[New] Backends (Meta) (#1880, #1960)
[Improvement] Jagged operator optimizations (#1643, #1646, #1644, #1661, #1662, #1691, #1692, #1777)
[Improvement] Symbolic shape tracing on jagged operators for PyTorch 2 (#1758)

Index Select Operators

[New] batch_index_select_dim0 with TBE backend (#1897)
[New] Variable input sizes support for group_index_select_dim0 (#1968)
[Improvement] Improve group_index_select(#1764, #1884)

Low-precision operators

[New] Meta Backend FP8RowwiseQuantizedToFloat (#1890)
[New] Column-wise parallel quantization/dequantization (#1743)
[New] BF16 Support in FP8 quantize ops (#1961)
[Improvement] FP8 row-wise quantization optimization/improvement (#1729, #1858, #1981, #1909)

Pooled Embedding

[New] reduce_to_one (#1571)
[New] permute_duplicate_pooled_embeddings op (#1912)
[New] BF16 support for permute_pooled_embeddings op 1937
[New] Variable size input-output support for permute_pooled_embs_kernel (#1913)
[New] Backends (Meta) (#1853)
[Improvement] multi-gpu all_to_one enhancements (#1674, #1962)

Misc

[New] CUB kernel for 2D asynchronous_complete_cumsum (#1707)
[New] Backends (Meta) (#1709, #1905, #1970, #1971)
[New] BF16 support in permute_indices_weights_kernel_2 (#1852)
[New] FP16 and BF16 support in pack_segments (#1708)
[New] BF16 support for HBC ops. (#1744)
[New] BFloat16 support (#1832, #1865)
[Improvement] Speedup reorder_batched_ad_indices (#1901, #1902, #1932, #1933, 1711)

Benchmarks / Tests

[New] CLI support to GEMMsBenchmark (#1721, #1725)
[New] Benchmark for variable batch on TBE (#1559)
[New] BF16 output test coverage (#1835, #1838)
[New] Benchmark for reorder_batched_ad_indices (#1895)
[New] CPU support (#1874, #1926)
[Improvement] GroupIndexSelect Benchmark with zero_grad (#1559)
[Improvement] Add nbit-cpu-with-spec benchmark in FBGEMM-GPU's TBE benchmark suite (#1892)

Build / CI improvements and Fixes

[New] C++17 Support to FBGEMM and FBGEMM_GPU OSS builds (#1652)
[New] ARM Support in OSS CI (#1813)
[New] SM 9.0 Support for CUDA 12.1 (#1825, #2002)
[Improvement] General CI and build system enhancement (#1658, #1695, #1697, #1702, #1719, #1751, #1784, #1795, #1836, #1958, #2020, #2024)
[Improvement] Reorganized code to enable faster builds (#1843, #1849, #1856, #1860, #1863, #1864, #1866, #1886, #1694, #1705, #1710, #1723, #1757, #1783, #1871, #1873, #1879, #1944, #1816, #1753)

Assets 2

24 Mar 23:37

q10

v0.4.1

64833b5

FBGEMM_GPU v0.4.1

Release Notes

Software Requirements

FBGEMM_GPU v0.4.1 has been tested and known to work on the following setups:

PyTorch: v2.0
CUDA: v11.7, 11.8
Python: v3.8, 3.9, 3.10, 3.11

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU may be fetched directly from PyPI:

# FBGEMM_GPU (CUDA variant)
pip install fbgemm-gpu==0.4.1

# FBGEMM_GPU (CPU variant)
pip install fbgemm-gpu-cpu==0.4.1

Changes

This is a minor release whose main purpose is to deliver Python 3.11 support.

[New] Add support for Python 3.11 (#1646)
[Improvement] Add support for group size > 54 in group_index_select (#1611)
[Improvement] Implement cache miss emulation in UVM_CACHING (#1637)))
[Improvement] Add TensorAccessor with memcheck (#1602)

Assets 2

15 Mar 17:08

q10

v0.4.0

ea96ea3

FBGEMM_GPU v0.4.0

Release Notes

Software Requirements

FBGEMM_GPU v0.4.0 has been tested and known to work on the following setups:

PyTorch: v2.0
CUDA: v11.7, 11.8
Python: v3.8, 3.9, 3.10 (3.11 not supported yet)

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU may be fetched directly from PyPI:

# FBGEMM_GPU (CUDA variant)
pip install fbgemm-gpu==0.4.0

# FBGEMM_GPU (CPU variant)
pip install fbgemm-gpu-cpu==0.4.0

Changes

Table batched embedding (TBE) operators

[New] SSD for inference TBE (#1473, #1479, #1485, #1517, #1533, #1535)
[New] Inplace TBE update (#1480, #1482, #1492, #1529)
[New] BF16 support for inference TBE (#1498, #1503)
[New] BF16 support for TBE on CPU (#1540, #1583)
[Improvement] Training TBE backward performance improvement (#1563)

UVM cache improvement

[New] Delta in-place update (#1436)
[New] UVM caching stats report (#1623, #1462, #1433, #1623, #1570)
[Improvement] [lfu|lru]_cache_insert_byte_kernel vectorization (#1475)

Jagged Tensor Operators

[New] Backends (Meta and Autograd) (#1461, #1466, #1467, #1469, #1468, #1477, #1556)
[New] BF16 support (#1472, #1560)
[New] FP32 + BF16 hybrid support for jagged_dense_dense_elementwise_add_jagged (#1487)
[New] Jagged tensors with no inner dense dimension support (#1267)
[New] New jagged tensor operators (#1557, #1577, #1578, #1579, #1594, #1595)

Index Select Operators

[New] group_index_select (#1421, #1592)
[New] index_select for selecting KeyJaggedTensor dim 1 (previously support only dim 0) (#1429)
[New] jagged_index_select for CPU (#1586)

Low-precision operators

[New] FP8 rowwise quantized communication (#1423)

Misc

Support 2D inputs for asynchronous_complete_cumsum (#1573)

Benchmarks / Tests

[New] nbit_device_with_spec for table batched embedding inference benchmark (#1455, #1465)
[New] Variable bag sizes for TBE benchmark (#1450)
[Improvement] Parallel bottom_unique_k_per_row for faster Zipf data generation (for FBGEMM benchmarks) (#1447)

Build / CI improvements and Fixes

[New] Linter integration (#1427)
[Improvement] General CI and build system enhancement (#1444, #1407, #1541, #1542, #1544, #1546, #1549, #1562, #1568, #1589, #1603, #1598, #1606, #1619, #1627, #1631, #1635)
[Improvement] AMD GPU CI and build system enhancement (#1537, #1552, #1543)

Assets 2

19 Jan 22:22

mjanderson09

v0.3.2

b2be702

v0.3.2

Minor release

Assets 2

28 Oct 21:29

mjanderson09

v0.3.0

9c4fa58

v0.3.0

New Features

Table Batched Embedding enhancements:

TBE performance optimizations (#1224, #1279, #1292, #1293, #1294, #1295, #1300, #1332, #1334, #1335, #1338, #1339, #1340, #1341, #1353, #1365)
Added FP16 weight type and output_dtype support for Dense TBE (#1343, #1348, #1370)
Direct Mapped UVM Cache (#1298)

AMD Support (beta) (#1102, #1193)

FBGEMM previously supported only NVIDIA accelerators, but FBGEMM 0.3.0 started to support AMD GPUs in collaboration with AMD. Although its support is still beta (e.g., we don't have a stable release build for AMD GPUs yet), the AMD GPU implementation covers almost all the FBGEMM operators supported by NVIDIA GPUs. AMD GPU support is tested using CI with AMD MI250 GPUs.

Quantized Communication Primitives (#1219, #1337)

Sparse kernel enhancements

New kernel: invert_permute (#1403)
New kernel: truncate_jagged_1d (#1345)
New kernel: jagged_index_select (#1157)
Jagged Tensor optimization for inference use cases (#1236)

Improved documentation for Jagged Tensors and SplitTableBatchedEmbeddingBagsCodegen

Optimized 2x2 kernel for AVX2 (#1280)

Full Changelog: https://github.com/pytorch/FBGEMM/commits/v0.3.0

Assets 2

20 Jul 21:26

mjanderson09

v0.2.0

6c8de10

v0.2.0

New Features

Inference Table Batched Embedding (TBE) Enhancements (#951, #984)
The table batched embedding (TBE) operator is an important base operation for embedding lookup for recommendation system inference on GPU. We added the following enhancements for performance and flexibility:

Alignment restriction removed: Embedding dimension * data type size had to be multiple of 4B before and now, it is 1B. ()
UVM caching kernels now scale linearly with # of tables using UVM caching. Previously, it was having similar overhead as all tables using UVM caching
UVM caching kernel overhead is much smaller than before

Inference FP8 Table Batched Embedding (TBE) (#1091)
The table batched embedding (TBE) previously supported FP32, FP16, INT8, INT4, and INT2 embedding weight types. While these weight types work well in many models, we integrate FP8 weight types (in both GPU and CPU operations) to allow for numerical and performance evaluations of FP8 in our models. Compared to INT8, FP8 does not require the additional bias and scale storage and calculations. Additionally, the next generation of H100 GPUs has the FP8 support on Tensor Core (mainly matmul ops).

Jagged Tensor Kernels (#1006, #1008)
We added optimized kernels to speed up TorchRec Jagged Tensor. The purpose of JaggedTensor is to handle the case where one dimension of the input data is “jagged”, meaning that each consecutive row in a given dimension may be a different length, which is often the case with sparse feature inputs in recommendation systems.

Optimized permute102-baddbmm-permute102 (#1048)
It is difficult to fuse various matrix multiplications where the batch size is not the batch size of the model, switching the batch dimension is a quick solution. We created the permute102_baddbmm_permute102 operation that switches the first and the second dimension, performs the batched matrix multiplication and then switches back. Currently we only support forward pass with FP16 data type and will support FP32 type and backward pass in the future.

Optimized index_select for dim 0 index selection (#1113)
index_select is normally used as part of a sparse operation. While PyTorch supports a generic index_select for an arbitrary-dimension index selection, its performance for a special case like the dim 0 index selection is suboptimal. For this reason, we implement a specialized index_select for dim 0. In some cases, we have observed 1.4x performance gain from FBGEMM’s index_select compared to the one from PyTorch (using uniform index distribution).

Full Changelog: https://github.com/pytorch/FBGEMM/commits/v0.2.0

Assets 2

Releases: pytorch/FBGEMM

FBGEMM_GPU v0.7.0

Release Note

Highlights

Software Requirements

Availability

Changes

Table batched embedding (TBE) operators

Jagged Tensor Operators

Index Select Operators

Low-precision operators

Pooled Embedding

Misc

Benchmarks / Tests

Build / CI improvements and Fixes

FBGEMM_GPU v0.6.0

Release Note

Highlights

Software Requirements

Availability

Changes

Table batched embedding (TBE) operators

Jagged Tensor Operators

Index Select Operators

Low-precision operators

Pooled Embedding

Misc

Benchmarks / Tests

Build / CI improvements and Fixes

FBGEMM_GPU v0.5.0

Release Notes

Highlights

Software Requirements

Availability

Changes

Table batched embedding (TBE) operators

Jagged Tensor Operators

Index Select Operators

Low-precision operators

Pooled Embedding

Misc

Benchmarks / Tests

Build / CI improvements and Fixes

FBGEMM_GPU v0.4.1

Release Notes

Software Requirements

Availability

Changes

FBGEMM_GPU v0.4.0

Release Notes

Software Requirements

Availability

Changes

Table batched embedding (TBE) operators

UVM cache improvement

Jagged Tensor Operators

Index Select Operators

Low-precision operators

Misc

Benchmarks / Tests

Build / CI improvements and Fixes

v0.3.2

v0.3.0

New Features

v0.2.0

New Features