Arrow CSV reader peak memory is very large #5766

liujiayi771 · 2024-05-16T06:04:58Z

Backend

VL (Velox)

Bug description

When reading large CSV files, for example, when a single CSV file in a table is 300M, the peak memory usage of arrow memory pool during single-threaded reading can reach 500M. If the CSV is 2G, the peak memory usage can also increase to 1.7G. It looks like there is no memory leak, but the peak memory usage is very high.

From the code of Arrow Dataset, it seems that we are using the Streaming reader, theoretically the memory consumption may not increase proportionally with the size of the CSV file.

I have added some codes in the release method of ArrowNativeMemoryPool to check the peak memory.

@Override
public void release() throws Exception {
  System.out.println("peak=" + listener.peak() +", current=" + listener.current());
  if (arrowPool.getBytesAllocated() != 0) {
    LOGGER.warn(
        String.format(
            "Arrow pool still reserved non-zero bytes, "
                + "which may cause memory leak, size: %s. ",
            Utils.bytesToString(arrowPool.getBytesAllocated())));
  }
  arrowPool.close();
}

I also added some logs in arrow codes to check the peak memory.

Result<RecordBatchGenerator> CsvFileFormat::ScanBatchesAsync(
    const std::shared_ptr<ScanOptions>& scan_options,
    const std::shared_ptr<FileFragment>& file) const {
  auto this_ = checked_pointer_cast<const CsvFileFormat>(shared_from_this());
  auto source = file->source();
  auto reader_fut =
      OpenReaderAsync(source, *this, scan_options, ::arrow::internal::GetCpuThreadPool());
  auto generator = GeneratorFromReader(std::move(reader_fut), scan_options->batch_size);
  WRAP_ASYNC_GENERATOR_WITH_CHILD_SPAN(
      generator, "arrow::dataset::CsvFileFormat::ScanBatchesAsync::Next");
  std::cout << "memory=" << default_memory_pool()->bytes_allocated() << ", max=" << default_memory_pool()->max_memory() << std::endl;
  return generator;
}

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

liujiayi771 · 2024-05-16T06:05:20Z

cc @jinchengchenghh @zhztheplayer, thanks.

FelixYBW · 2024-05-16T21:19:33Z

I remember Arrow cached all record batches before it streams to Spark. In Gazelle we initially have the same issue, then have to customize some logic to do real streaming. @zhztheplayer do you remember?

zhztheplayer · 2024-05-17T05:36:32Z

@zhztheplayer do you remember?

I can't recall that. But it doesn't make sense to buffer all data for a reader.

I suppose @jinchengchenghh is looking into this.

jinchengchenghh · 2024-05-23T09:50:41Z

I could not reproduce this issue, I test TPCH Q6 with data 600G, and print the peak every time arrow reserve memory.

  public void reserve(long size) {
    synchronized (this) {
      sharedUsage.inc(size);
    }
    System.out.println(sharedUsage.peak());
  }

This is the test result

18350080
17825792:============================================>          (94 + 18) / 116]
18350080
17825792
18350080
17825792
18350080
17825792
18350080
17825792:=============================================>         (95 + 18) / 116]
18350080
17825792
18350080
17825792:=============================================>         (97 + 18) / 116]
18350080
17825792:==============================================>        (98 + 18) / 116]
18350080
17825792

After I change the --master from local[18] to local[2], same peak memory

liujiayi771 · 2024-05-23T12:10:39Z

@jinchengchenghh I will test the latest code.

FelixYBW · 2024-05-23T16:42:26Z

@jinchengchenghh can you print in the record batch construction and destruction function to confirm? there should be only 1 record batch alive, no more than 3.

liujiayi771 · 2024-05-24T07:30:44Z

@jinchengchenghh Have you checked the size of a single CSV file?

jinchengchenghh · 2024-05-27T23:43:34Z

I assume you use a middle commit of csv reader, there is redundant colVector.retain() in function ArrowUtil.loadBatch() in a middle version not the merged version, it may cause the vector does not release even if the column batch close. I delete colVector.retain for another issue, not sure if it is the root cause of this issue .@liujiayi771

jinchengchenghh · 2024-05-27T23:46:29Z

The printed information is each time we request memory from arrow memory pool, not the recordbatch.
The batch consists of ArrowWritableColumnVector in java side, it use ArrowArray to bridge to C++ side, and then convert to Velox Vector, release it immediately. @FelixYBW

liujiayi771 · 2024-05-28T01:27:06Z

@jinchengchenghh I will test the latest code today.

liujiayi771 · 2024-05-28T07:56:36Z

@jinchengchenghh I tested the latest code, and the peak memory usage is still relatively high. I did not add logs in ArrowReservationListener.reserve. Printing logs there did not output anything in my case. I added two methods in ArrowReservationListener, and printed peak and current in ArrowNativeMemoryPool.release.

public long peak() {
  return sharedUsage.peak();
}

public long current() {
  return sharedUsage.current();
}

@Override
public void release() throws Exception {
  System.out.println("peak=" + listener.peak() + ", current=" + listener.current());
  if (arrowPool.getBytesAllocated() != 0) {
    LOGGER.warn(
        String.format(
            "Arrow pool still reserved non-zero bytes, "
                + "which may cause memory leak, size: %s. ",
            Utils.bytesToString(arrowPool.getBytesAllocated())));
  }
  arrowPool.close();
}

I created a Parquet table and used spark-sql --local to read the data from a CSV table to insert overwrite into the Parquet table. My dataset is a 100GB TPC-DS. I first tested the store_sales table, where each CSV file is 700MB in size. The log output is as follows, the peak memory is about 920MB:

peak=964689920, current=8388608
peak=964689920, current=8388608
peak=964689920, current=8388608
peak=964689920, current=8388608
peak=964689920, current=8388608
peak=956301312, current=8388608

I continued testing the catalog_sales table, where each CSV file is 1.15GB in size. The log output is as follows, the peak memory is about 1064MB:

peak=1124073472, current=8388608
peak=1140850688, current=8388608
peak=1115684864, current=8388608
peak=1124073472, current=8388608
peak=1149239296, current=8388608
peak=1115684864, current=8388608

I constructed a larger catalog_sales table with a single 30GB CSV file. The log output is as follows, the peak memory is about 6GB:

peak=6601834496, current=8388608

The peak memory logs that I printed should only be used by the CSV reader. But this issue is not that urgent for me at the moment. After splitting the large CSV file into smaller files, it still works normally.

jinchengchenghh · 2024-05-28T23:31:17Z

I think it is because arrow does not support to add file start and length to split a file, so it's peak memory is high for a very big CSV file.

FelixYBW · 2024-05-29T19:56:02Z

Do you mean arrow csv doesn't support split? each partition must have one or more csv files, instead of part of a large csv file.

jinchengchenghh · 2024-05-30T00:24:53Z

Yes.

jinchengchenghh · 2024-05-31T06:24:51Z

Arrow is easy to support file offset and length, we just need to use RandomAccessFile to generate InputStream.
FileSource class constructor is

  using CustomOpen = std::function<Result<std::shared_ptr<io::RandomAccessFile>>()>;

  FileSource(std::shared_ptr<io::RandomAccessFile> file, int64_t size,
             Compression::type compression = Compression::UNCOMPRESSED)
      : custom_open_([=] { return ToResult(file); }),
        custom_size_(size),
        compression_(compression) {}

  static Result<std::shared_ptr<InputStream>> GetStream(
      std::shared_ptr<RandomAccessFile> file, int64_t file_offset, int64_t nbytes);

https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/file_base.cc#L110

I can help implement it on demand.

FelixYBW · 2024-05-31T06:27:00Z

Arrow is easy to support file offset and length, we just need to use RandomAccessFile to generate InputStream. FileSource class constructor is
  using CustomOpen = std::function<Result<std::shared_ptr<io::RandomAccessFile>>()>;

  FileSource(std::shared_ptr<io::RandomAccessFile> file, int64_t size,
             Compression::type compression = Compression::UNCOMPRESSED)
      : custom_open_([=] { return ToResult(file); }),
        custom_size_(size),
        compression_(compression) {}
  static Result<std::shared_ptr<InputStream>> GetStream(
      std::shared_ptr<RandomAccessFile> file, int64_t file_offset, int64_t nbytes);
https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/file_base.cc#L110

I can help implement it on demand.

Thank you, Chengcheng. Let's hold on until we get requests

liujiayi771 · 2024-05-31T10:30:37Z

@jinchengchenghh Spark will split a single CSV file into multiple partitions for reading. We need to pass start and length to Arrow. I have currently resolved this issue through some hacks; otherwise, it would cause the same CSV file to be read multiple times.

FelixYBW · 2024-05-31T16:20:32Z

@jinchengchenghh Spark will split a single CSV file into multiple partitions for reading. We need to pass start and length to Arrow. I have currently resolved this issue through some hacks; otherwise, it would cause the same CSV file to be read multiple times.

@jinchengchenghh Do we pass csv file multiple times to arrow if they are split by Spark?
@zhztheplayer how do we solve issue in Gazelle?

jinchengchenghh · 2024-06-01T12:04:56Z

I mark this format as spiltable false, so it should not split.
https://github.com/apache/incubator-gluten/blob/main/backends-velox/src/main/scala/org/apache/gluten/datasource/ArrowCSVFileFormat.scala#L60

liujiayi771 added bug Something isn't working triage labels May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow CSV reader peak memory is very large #5766

Arrow CSV reader peak memory is very large #5766

liujiayi771 commented May 16, 2024

liujiayi771 commented May 16, 2024

FelixYBW commented May 16, 2024

zhztheplayer commented May 17, 2024

jinchengchenghh commented May 23, 2024

liujiayi771 commented May 23, 2024

FelixYBW commented May 23, 2024

liujiayi771 commented May 24, 2024

jinchengchenghh commented May 27, 2024

jinchengchenghh commented May 27, 2024

liujiayi771 commented May 28, 2024

liujiayi771 commented May 28, 2024 •

edited

jinchengchenghh commented May 28, 2024

FelixYBW commented May 29, 2024

jinchengchenghh commented May 30, 2024

jinchengchenghh commented May 31, 2024

FelixYBW commented May 31, 2024

liujiayi771 commented May 31, 2024

FelixYBW commented May 31, 2024 •

edited

jinchengchenghh commented Jun 1, 2024

Arrow CSV reader peak memory is very large #5766

Arrow CSV reader peak memory is very large #5766

Comments

liujiayi771 commented May 16, 2024

Backend

Bug description

Spark version

Spark configurations

System information

Relevant logs

liujiayi771 commented May 16, 2024

FelixYBW commented May 16, 2024

zhztheplayer commented May 17, 2024

jinchengchenghh commented May 23, 2024

liujiayi771 commented May 23, 2024

FelixYBW commented May 23, 2024

liujiayi771 commented May 24, 2024

jinchengchenghh commented May 27, 2024

jinchengchenghh commented May 27, 2024

liujiayi771 commented May 28, 2024

liujiayi771 commented May 28, 2024 • edited

jinchengchenghh commented May 28, 2024

FelixYBW commented May 29, 2024

jinchengchenghh commented May 30, 2024

jinchengchenghh commented May 31, 2024

FelixYBW commented May 31, 2024

liujiayi771 commented May 31, 2024

FelixYBW commented May 31, 2024 • edited

jinchengchenghh commented Jun 1, 2024

liujiayi771 commented May 28, 2024 •

edited

FelixYBW commented May 31, 2024 •

edited