Skip to content

Commit

Permalink
Pushing changes to GitHub Pages.
Browse files Browse the repository at this point in the history
  • Loading branch information
docs-sched-rebuild committed Apr 29, 2024
1 parent 73380ca commit c8152f9
Show file tree
Hide file tree
Showing 2,232 changed files with 540,268 additions and 108,875 deletions.
731 changes: 565 additions & 166 deletions main/Introduction.html

Large diffs are not rendered by default.

563 changes: 433 additions & 130 deletions main/_modules/index.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/merlin/dag/operator.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/merlin/dag/ops/stat_operator.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/add_metadata.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/bucketize.html

Large diffs are not rendered by default.

574 changes: 436 additions & 138 deletions main/_modules/nvtabular/ops/categorify.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/clip.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/column_similarity.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/difference_lag.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/drop_low_cardinality.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/dropna.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/fill.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/filter.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/groupby.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/hash_bucket.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/hashed_cross.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/join_external.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/join_groupby.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/list_slice.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/logop.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/normalize.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/reduce_dtype_size.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/rename.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/target_encoding.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/ops/value_counts.html

Large diffs are not rendered by default.

564 changes: 433 additions & 131 deletions main/_modules/nvtabular/workflow/workflow.html

Large diffs are not rendered by default.

99 changes: 99 additions & 0 deletions main/_sources/Introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
## [NVTabular](https://github.com/NVIDIA/NVTabular)

[![PyPI](https://img.shields.io/pypi/v/NVTabular?color=orange&label=version)](https://pypi.python.org/pypi/NVTabular/)
[![LICENSE](https://img.shields.io/github/license/NVIDIA-Merlin/NVTabular)](https://github.com/NVIDIA-Merlin/NVTabular/blob/stable/LICENSE)
[![Documentation](https://img.shields.io/badge/documentation-blue.svg)](https://nvidia-merlin.github.io/NVTabular/stable/Introduction.html)

NVTabular is a feature engineering and preprocessing library for tabular data that is designed to easily manipulate terabyte scale datasets and train deep learning (DL) based recommender systems. It provides high-level abstraction to simplify code and accelerates computation on the GPU using the [RAPIDS Dask-cuDF](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf) library.

NVTabular is a component of [NVIDIA Merlin](https://developer.nvidia.com/nvidia-merlin), an open source framework for building and deploying recommender systems and works with the other Merlin components including [Merlin Models](https://github.com/NVIDIA-Merlin/models), [HugeCTR](https://github.com/NVIDIA/HugeCTR) and [Merlin Systems](https://github.com/NVIDIA-Merlin/systems) to provide end-to-end acceleration of recommender systems on the GPU. Extending beyond model training, with NVIDIA’s [Triton Inference Server](https://github.com/NVIDIA/tensorrt-inference-server), the feature engineering and preprocessing steps performed on the data during training can be automatically applied to incoming data during inference.

<!-- <img src='https://developer.nvidia.com/blog/wp-content/uploads/2020/07/recommender-system-training-pipeline-1.png'/> -->

### Benefits

When training DL recommender systems, data scientists and machine learning (ML) engineers have been faced with the following challenges:

- **Huge Datasets**: Commercial recommenders are trained on huge datasets that may be several terabytes in scale.
- **Complex Data Feature Engineering and Preprocessing Pipelines**: Datasets need to be preprocessed and transformed so that they can be used with DL models and frameworks. In addition, feature engineering creates an extensive set of new features from existing ones, requiring multiple iterations to arrive at an optimal solution.
- **Input Bottleneck**: Data loading, if not well optimized, can be the slowest part of the training process, leading to under-utilization of high-throughput computing devices such as GPUs.
- **Extensive Repeated Experimentation**: The entire data engineering, training, and evaluation process can be repetitious and time consuming, requiring significant computational resources.

NVTabular alleviates these challenges and helps data scientists and ML engineers:

- process datasets that exceed GPU and CPU memory without having to worry about scale.
- focus on what to do with the data and not how to do it by using abstraction at the operation level.
- prepare datasets quickly and easily for experimentation so that more models can be trained.
- deploy models into production by providing faster dataset transformation

Learn more in the NVTabular [core features documentation](https://nvidia-merlin.github.io/NVTabular/stable/core_features.html).

### Performance

When running NVTabular on the Criteo 1TB Click Logs Dataset using a single V100 32GB GPU, feature engineering and preprocessing was able to be completed in 13 minutes. Furthermore, when running NVTabular on a DGX-1 cluster with eight V100 GPUs, feature engineering and preprocessing was able to be completed within three minutes. Combined with [HugeCTR](http://www.github.com/NVIDIA/HugeCTR/), the dataset can be processed and a full model can be trained in only six minutes.

The performance of the Criteo DRLM workflow also demonstrates the effectiveness of the NVTabular library. The original ETL script provided in Numpy took over five days to complete. Combined with CPU training, the total iteration time is over one week. By optimizing the ETL code in Spark and running on a DGX-1 equivalent cluster, the time to complete feature engineering and preprocessing was reduced to three hours. Meanwhile, training was completed in one hour.

### Installation

NVTabular requires Python version 3.7+. Additionally, GPU support requires:

- CUDA version 11.0+
- NVIDIA Pascal GPU or later (Compute Capability >=6.0)
- NVIDIA driver 450.80.02+
- Linux or WSL

#### Installing NVTabular Using Conda

NVTabular can be installed with Anaconda from the `nvidia` channel by running the following command:

```
conda install -c nvidia -c rapidsai -c numba -c conda-forge nvtabular python=3.7 cudatoolkit=11.2
```

#### Installing NVTabular Using Pip

NVTabular can be installed with `pip` by running the following command:

```
pip install nvtabular
```

> Installing NVTabular with Pip causes NVTabular to run on the CPU only and might require installing additional dependencies manually.
> When you run NVTabular in one of our Docker containers, the dependencies are already installed.
#### Installing NVTabular with Docker

NVTabular Docker containers are available in the [NVIDIA Merlin container
repository](https://catalog.ngc.nvidia.com/?filters=&orderBy=scoreDESC&query=merlin).
The following table summarizes the key information about the containers:

| Container Name | Container Location | Functionality |
| ----------------- | ------------------------------------------------------------------------------------ | ------------------------------------------ |
| merlin-hugectr | https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-hugectr | NVTabular, HugeCTR, and Triton Inference |
| merlin-tensorflow | https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow | NVTabular, Tensorflow and Triton Inference |
| merlin-pytorch | https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-pytorch | NVTabular, PyTorch, and Triton Inference |

To use these Docker containers, you'll first need to install the [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker) to provide GPU support for Docker. You can use the NGC links referenced in the table above to obtain more information about how to launch and run these containers. To obtain more information about the software and model versions that NVTabular supports per container, see [Support Matrix](https://github.com/NVIDIA/NVTabular/blob/stable/docs/source/resources/support_matrix.rst).

### Notebook Examples and Tutorials

We provide a [collection of examples](https://github.com/NVIDIA-Merlin/NVTabular/tree/stable/examples) to demonstrate feature engineering with NVTabular as Jupyter notebooks:

- Introduction to NVTabular's High-Level API
- Advanced workflows with NVTabular
- NVTabular on CPU
- Scaling NVTabular to multi-GPU systems

In addition, NVTabular is used in many of our examples in other Merlin libraries:

- [End-To-End Examples with Merlin](https://github.com/NVIDIA-Merlin/Merlin/tree/stable/examples)
- [Training Examples with Merlin Models](https://github.com/NVIDIA-Merlin/models/tree/stable/examples)
- [Training Examples with Transformer4Rec](https://github.com/NVIDIA-Merlin/Transformers4Rec/tree/stable/examples)

### Feedback and Support

If you'd like to contribute to the library directly, see the [Contributing.md](https://github.com/NVIDIA/NVTabular/blob/stable/CONTRIBUTING.md). We're particularly interested in contributions or feature requests for our feature engineering and preprocessing operations. To further advance our Merlin Roadmap, we encourage you to share all the details regarding your recommender system pipeline in this [survey](https://developer.nvidia.com/merlin-devzone-survey).

If you're interested in learning more about how NVTabular works, see
[our NVTabular documentation](https://nvidia-merlin.github.io/NVTabular/stable/Introduction.html). We also have [API documentation](https://nvidia-merlin.github.io/NVTabular/stable/api/index.html) that outlines the specifics of the available calls within the library.
121 changes: 121 additions & 0 deletions main/_sources/api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
*****************
API Documentation
*****************

Workflow Constructors
---------------------

.. currentmodule:: nvtabular.workflow.workflow

.. autosummary::
:toctree: generated

Workflow
WorkflowNode

.. currentmodule:: nvtabular.ops


Categorical Operators
---------------------

.. autosummary::
:toctree: generated

Bucketize
Categorify
DropLowCardinality
HashBucket
HashedCross
TargetEncoding


Continuous Operators
--------------------

.. autosummary::
:toctree: generated

Clip
LogOp
Normalize
NormalizeMinMax


Missing Value Operators
-----------------------

.. autosummary::
:toctree: generated

Dropna
FillMissing
FillMedian


Row Manipulation Operators
--------------------------

.. autosummary::
:toctree: generated

DifferenceLag
Filter
Groupby
JoinExternal
JoinGroupby


Schema Operators
----------------

.. autosummary::
:toctree: generated

AddMetadata
AddProperties
AddTags
Rename
ReduceDtypeSize
TagAsItemFeatures
TagAsItemID
TagAsUserFeatures
TagAsUserID


List Operators
--------------

.. autosummary::
:toctree: generated

ListSlice
ValueCount


Vector Operators
----------------

.. autosummary::
:toctree: generated

ColumnSimilarity


User-Defined Function Operators
-------------------------------

.. autosummary::
:toctree: generated

LambdaOp


Operator Base Classes
---------------------

.. autosummary::
:toctree: generated

Operator
StatOperator
83 changes: 83 additions & 0 deletions main/_sources/core_features.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Core Features

NVTabular supports the following core features:

- [TensorFlow and PyTorch Interoperability](#tensorflow-and-pytorch-interoperability)
- [HugeCTR Interoperability](#hugectr-interoperability)
- [Multi-GPU Support](#multi-gpu-support)
- [Multi-Node Support](#multi-node-support)
- [Multi-Hot Encoding and Pre-Existing Embeddings](#multi-hot-encoding-and-pre-existing-embeddings)
- [Shuffling Datasets](#shuffling-datasets)
- [Cloud Integration](#cloud-integration)
- [CPU Support](#cpu-support)

## TensorFlow and PyTorch Interoperability

In addition to providing mechanisms for transforming the data to prepare it for deep learning models, we also have framework-specific dataloaders implemented to help optimize getting that data to the GPU. Under a traditional dataloading scheme, data is read item by item and collated into a batch. With PyTorch, multiple processes can create many batches at the same time. However, this still leads to many individual rows of tabular data being accessed independently, which impacts I/O, especially when this data is on the disk and not in the CPU memory. TensorFlow loads and shuffles TFRecords by adopting a windowed buffering scheme that loads data sequentially to a buffer, which it randomly samples batches and replenishes with the next sequential elements from the disk. Larger buffer sizes ensure more randomness, but can quickly bottleneck performance as TensorFlow tries to keep the buffer saturated. Smaller buffer sizes mean that datasets, which aren't uniformly distributed on the disk, lead to biased sampling and potentially degraded convergence.

## HugeCTR Interoperability

NVTabular is also capable of preprocessing datasets that can be passed to HugeCTR for training. For additional information, see the [HugeCTR Example Notebook](https://github.com/NVIDIA-Merlin/NVTabular/blob/stable/examples/scaling-criteo/03-Training-with-HugeCTR.ipynb) for details about how this works.

## Multi-GPU Support

NVTabular supports multi-GPU scaling with [Dask-CUDA](https://github.com/rapidsai/dask-cuda) and [dask.distributed](https://distributed.dask.org/en/latest/). To enable distributed parallelism, the NVTabular `Workflow` must be initialized with a `dask.distributed.Client` object as follows:

```python
import nvtabular as nvt
from dask.distributed import Client

# Deploy a new cluster
# (or specify the port of an existing scheduler)
cluster = "tcp://MachineA:8786"

client = Client(cluster)
workflow = nvt.Workflow(..., client=client)
...
```

Currently, there are many ways to deploy a "cluster" for Dask. This [article](https://blog.dask.org/2020/07/23/current-state-of-distributed-dask-clusters) gives a summary of all the practical options. For a single machine with multiple GPUs, the `dask_cuda.LocalCUDACluster` API is typically the most convenient option.

Since NVTabular already uses [Dask-CuDF](https://docs.rapids.ai/api/cudf/stable/) for internal data processing, there are no other requirements for multi-GPU scaling. With that said, the parallel performance can depend strongly on (1) the size of `Dataset` partitions, (2) the shuffling procedure used for data output, and (3) the specific arguments used for both global-statistics and transformation operations. For additional information, see [Multi-GPU](https://github.com/NVIDIA/NVTabular/blob/stable/examples/multi-gpu-toy-example/multi-gpu_dask.ipynb) for a simple step-by-step example.

## Multi-Node Support

NVTabular supports multi-node scaling with [Dask-CUDA](https://github.com/rapidsai/dask-cuda) and [dask.distributed](https://distributed.dask.org/en/latest/). To enable distributed parallelism, start a cluster and connect to it to run the application by doing the following:

1. Start the scheduler `dask-scheduler`.
2. Start the workers `dask-cuda-worker schedulerIP:schedulerPort`.
3. Run the NVTabular application where the NVTabular `Workflow` has been initialized as described in the Multi-GPU Support section.

For a detailed description of each existing method that is needed to start a cluster, please read this [article](https://blog.dask.org/2020/07/23/current-state-of-distributed-dask-clusters).

## Multi-Hot Encoding and Pre-Existing Embeddings

NVTabular supports the:

- processing of datasets with multi-hot categorical columns.
- passing of continuous vector features like pre-trained embeddings, which includes basic preprocessing and feature engineering, as well as full support in the dataloaders for training models with both TensorFlow and PyTorch.

Multi-hot lets you represent a set of categories as a single feature. For example, in a movie recommendation system, each movie might have a list of genres associated with it like comedy, drama, horror, or science fiction. Since movies can belong to more than one genre, we can't use single-hot encoding like we are doing for scalar
columns. Instead we train models with multi-hot embeddings for these features by having the deep learning model look up an embedding for each category in the list and then average all the embeddings for each row. Both multi-hot categoricals and vector continuous features are represented using list columns in our datasets. cuDF has recently added support for list columns, and we're leveraging that support in NVTabular to power this feature.

Our Categorify and HashBucket operators can map list columns down to small contiguous integers, which are suitable for use in an embedding lookup table. This is only possible if the dataset contains two rows like `[['comedy', 'horror'], ['comedy', 'sciencefiction']]` so that NVTabular can transform the strings for each row into categorical IDs like `[[0, 1], [0, 2]]` to be used in our embedding layers.

Our PyTorch and TensorFlow dataloaders have been extended to handle both categorical and continuous list columns. In TensorFlow, the KerasSequenceLoader class will transform each list column into two tensors representing the values and offsets into those values for each batch. These tensors can be converted into RaggedTensors for multi-hot columns, and for vector continuous columns where the offsets tensor can be safely ignored. We've provided a `nvtabular.framework_utils.tensorflow.layers.DenseFeatures` Keras layer that will automatically handle these conversions for both continuous and categorical columns. For PyTorch, there's support for multi-hot columns to our `nvtabular.framework_utils.torch.models.Model` class, which internally is using the PyTorch [EmbeddingBag](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html) layer to handle the multi-hot columns.

## Shuffling Datasets

NVTabular makes it possible to shuffle during dataset creation. This creates a uniformly shuffled dataset that allows the dataloader to load large contiguous chunks of data, which are already randomized across the entire dataset. NVTabular also makes it possible to control the number of chunks that are combined into a batch, providing flexibility when trading off between performance and true randomization. This mechanism is critical when dealing with datasets that exceed CPU memory and individual epoch shuffling is desired during training. Full shuffle of such a dataset can exceed training time for the epoch by several orders of magnitude.

## Cloud Integration

NVTabular offers cloud integration with Amazon Web Services (AWS) and Google Cloud Platform (GCP), giving you the ability to build, train, and deploy models on the cloud using datasets. For additional information, see [Amazon Web Services](./resources/cloud_integration.md#amazon-web-services) and [Google Cloud Platform](./resources/cloud_integration.md#google-cloud-platform).

## CPU Support

NVTabular supports CPU using [pandas](https://pandas.pydata.org/), [pyarrow](https://arrow.apache.org/docs/python/), and [dask dataframe](https://examples.dask.org/dataframe.html). To enable CPU, the Dataset class must be initialized with the `cpu` parameter as follows:

```
dataset = Dataset(path, cpu=True)
```

Processing will now take place on the CPU for that particular dataset, including feature engineering and preprocessing as well as TensorFlow and PyTorch training using NVTabular's dataloaders.

0 comments on commit c8152f9

Please sign in to comment.