Skip to content
Tao Luo edited this page Dec 9, 2019 · 1 revision

TenorFlow's Tensor

tensorflow::Tensor represents a n-dimensional array of values, like caffe2::Tensor.

Different from caffe2::Tensor<Context>, which is a template, tesnorflow::Tensor is a class.

caffe2::Tensor<Context>'s constructor doesn't allocate memory; instead, memory allocate is delayed till the mutable_data is called. Whereas tensorflow::Tensor allocates the memory.

caffe2::Tensor<Context>'s template methods data<T> and mutalbe_data<T> can return an array of any typed elements -- caffe2::Tensor::meta_ records the most recently returned (and allocated) element type. Whereas tensorflow::Tensor's constructor accepts a DataType typed parameter that specifies the element type.

caffe2::Tensor<Context> supports only numerical typed elements. Whereas tensorflow::Tensor supports string-typed elements.

caffe2::Tensor<Context> doesn't support accessing data in protobuf messages. Whereas tensorflow::Tensor does.

caffe2::Tensor<Context>'s destructor doesn't free memory; instead, its data member shared_ptr<T> data_ does. Whereas tensorflow::Tensor's destructor takes the responsibility to free memory. In addition, tensorflow::Tensor counts the reference of the memory by itself, whereas caffe2::Tensor<Context> utilizes shared_ptr for that.

TensorShape

The shape of a tensor is represented by tensorflow::TensorShape, which can be constructed from a list of int64 values, or from a protobuf message TensorShapeProto.

TensorShape supports various representations of a shape because most tensors are low dimensional. This brings more complexity than Caffe2's vector<int64_t>. Indeed, tensor_shape.h and tensor_shape.cc take 759 lines of C++ code in total -- more than the very candy majel::Dim that takes 498 lines.

Memory Management

The constructor of tensorflow::Tensor accepts a parameter Allocator* a and passes it to a newly created tensorflow::Buffer object tensorflow::Tensor::buf_:

Tensor::Tensor(Allocator* a, DataType type, const TensorShape& shape)
    : shape_(shape), buf_(nullptr) {
  set_dtype(type);
  CHECK_NOTNULL(a);
  if (shape_.num_elements() > 0 || a->ShouldAllocateEmptyTensors()) {
    CASES(type, buf_ = new Buffer<T>(a, shape.num_elements()));
  }

tensorflow::Buffer then saves a into its parent class tensorflow::BufferBase's alloc_ field, and it calls Allocator::Allocate<T>:

template <typename T>
Buffer<T>::Buffer(Allocator* a, int64 n)
    : BufferBase(a), data_(a->Allocate<T>(n)), elem_(n) {}

Allocator::Allocate<T> calls Allocator::AllocateRaw and then call type T's constructors via Allocator::RunCtor<T>:

  template <typename T>
  T* Allocate(size_t num_elements,
              const AllocationAttributes& allocation_attr) {
    ...
    void* p = AllocateRaw(kAllocatorAlignment, sizeof(T) * num_elements,
                          allocation_attr);
    T* typed_p = reinterpret_cast<T*>(p);
    if (typed_p) RunCtor<T>(typed_p, num_elements);
    return typed_p;
  }

By default, Allocator::RunCtor<T> is an no-op, so it doesn't construct basic types. A specialization runs string type's constructor:

template <>
inline void Allocator::RunCtor(string* p, size_t n) {
  RunStringCtor(p, n);
}

Similarly, there are corresponding Allocator::RunDtor<T> defines.

Allocator::AllocateRaw calls port::AlignedMalloc:

  void* AllocateRaw(size_t alignment, size_t num_bytes) override {
    void* p = port::AlignedMalloc(num_bytes, alignment);
    ...
    return p;
  }

and Allocator::DeallocateRaw calls port::AlignedFree:

  void DeallocateRaw(void* ptr) override {
    ...
    port::AlignedFree(ptr);
  }

port:AlignedMalloc, port::AlignedFree, and other platform-independent memory allocation are in tensorflow/core/platform/mem.h:

namespace tensorflow {
namespace port {

void* AlignedMalloc(size_t size, int minimum_alignment);
void AlignedFree(void* aligned_memory);

void* Malloc(size_t size);
void* Realloc(void* ptr, size_t size);
void Free(void* ptr);

}
}

There are two implemntations:

  1. POSIX implemenation in tensorflow/core/platform/posix/port.cc just calls POSIX C-runtime functions like malloc. For example:
void* Malloc(size_t size) {
#ifdef TENSORFLOW_USE_JEMALLOC
  return jemalloc_malloc(size);
#else
  return malloc(size);
#endif
}
  1. Windows implementation in tensorflow/core/platform/windows/port.cc is almost identical with the POSIX one, because the C-runtime functions are almost the same.

Question: GPU Memory

Above two implementation both allocates CPU memory, but not GPU memory.

TensorFlow codebase doesn't call cudaMalloc. Instead, there is one function, perftools::gputools::cuda::CUDADriver::DeviceAllocate, that calls cuMemAlloc:

/* static */ void *CUDADriver::DeviceAllocate(CudaContext *context,
                                              uint64 bytes) {
  ...
  CUresult res = cuMemAlloc(&result, bytes);

Class CUDADriver includes a set of static methods, each corresponds to a CUDA API. For example, CUDADriver::DeviceDeallocate calls cuMemFree:

/* static */ void CUDADriver::DeviceDeallocate(CudaContext* context,
                                               void *location) {
  ...
  CUresult res = cuMemFree(pointer);

Only CUDAExecutor::Allocate(uint64 size) calls CUDADriver::DeviceAllocate(context_, size):

void *CUDAExecutor::Allocate(uint64 size) {
  return CUDADriver::DeviceAllocate(context_, size);
}

And I haven't figured it out how/if Tensor calls CUDAExecutor::Allocate for GPU memory.

Clone this wiki locally