Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit indexing batches by data size in addition to the number of products #3239

Open
speller opened this issue Apr 6, 2024 · 0 comments
Open
Labels

Comments

@speller
Copy link

speller commented Apr 6, 2024

Problem
This is a sum of two problems that we have:

  1. We use cloud AWS OpenSearch and it has specific limits for the max HTTP request size: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/limits.html#network-limits . For example, for our workload, an m6g.large.search instance is enough. And it has a limit of 10 MiB. If our batches exceed this limit, indexing fails and we have to buy a bigger instance which increases cloud cost significantly, or decrease the batch size significantly to fit the limit.
  2. Our products may have very different data sizes for indexing: from 5 KiB to 0.5 MiB. So a batch of 100 products may easily exceed the limit. The spread is very unequal and a batch of 100 documents may have the size from 1 MiB to 50 MiB. Decreasing the batch size decreases the indexing performance so it doesn't make sense to keep small batch sizes on our dataset (which is large, more than 1M products).

Solution
Add the possibility to limit batch data size, not only batch row count. We're currently reaching this by applying the following patch:

--- a/src/module-elasticsuite-core/Indexer/GenericIndexerHandler.php
+++ b/src/module-elasticsuite-core/Indexer/GenericIndexerHandler.php
@@ -101,6 +101,7 @@
      */
     public function saveIndex($dimensions, \Traversable $documents)
     {
+        $maxBatchDataSize = $this->indexSettings->getBatchIndexingDataSize();
         foreach ($dimensions as $dimension) {
             $storeId   = $dimension->getValue();

@@ -120,8 +121,15 @@
                 }

                 if (!empty($batchDocuments)) {
-                    $bulk = $this->indexOperation->createBulk()->addDocuments($index, $batchDocuments);
-                    $this->indexOperation->executeBulk($bulk);
+                    if ($maxBatchDataSize !== null) {
+                        foreach (self::splitBatchByDataSize($batchDocuments, $maxBatchDataSize) as $subBatch) {
+                            $bulk = $this->indexOperation->createBulk()->addDocuments($index, $subBatch);
+                            $this->indexOperation->executeBulk($bulk);
+                        }
+                    } else {
+                        $bulk = $this->indexOperation->createBulk()->addDocuments($index, $batchDocuments);
+                        $this->indexOperation->executeBulk($bulk);
+                    }
                 }
             }

@@ -132,6 +140,48 @@

         return $this;
     }
+
+    private static function splitBatchByDataSize(&$batch, int $maxBatchDataSize): array
+    {
+        // measure the size of every batch in JSON format and split them until every batch is less than or equal to the max batch data size
+        $batches = [$batch];
+        $loopCount = 0;
+        for ($i = 0; $i < count($batches);) {
+            $subBatch = $batches[$i];
+            $jsonSize = strlen(json_encode($subBatch));
+            if ($jsonSize > $maxBatchDataSize) {
+                // If the batch is bigger, split it into two, replace the current one, append the second, and run the loop
+                // again on the same index to split again if needed.
+                $twoBatches = self::splitBatch($subBatch);
+                $batches[$i] = $twoBatches[0];
+                $batches[] = $twoBatches[1];
+                $loopCount++;
+                if ($loopCount > 100) {
+                    throw new \RuntimeException('Batch split loop limit reached');
+                }
+            } else {
+                $i++;
+                $loopCount = 0;
+            }
+        }
+
+        return $batches;
+    }
+
+    private static function splitBatch(array &$batch): array
+    {
+        if (count($batch) == 1) {
+            throw new \RuntimeException('Batch split failed. Batch size is 1');
+        }
+        $result = array_chunk($batch, (int)floor(count($batch) / 2));
+        if (count($result > 2)) {
+            $result[1] = array_merge($result[1], $result[2]);
+            unset($result[2]);
+        }
+
+        return $result;
+    }
+

     /**
      * {@inheritDoc}
--- a/src/module-elasticsuite-core/Helper/IndexSettings.php
+++ b/src/module-elasticsuite-core/Helper/IndexSettings.php
@@ -193,6 +193,15 @@
     }

     /**
+     * Get the max batch indexing data size from the configuration.
+     */
+    public function getBatchIndexingDataSize(): ?int
+    {
+        $value = $this->getIndicesSettingsConfigParam('batch_indexing_data_size');
+        return $value ? (int) $value : null;
+    }
+
+    /**
      * Get the indices pattern from the configuration.
      *
      * @return string
--- a/src/module-elasticsuite-core/Index/IndexSettings.php
+++ b/src/module-elasticsuite-core/Index/IndexSettings.php
@@ -188,6 +188,11 @@
         return $this->helper->getBatchIndexingSize();
     }

+    public function getBatchIndexingDataSize(): ?int
+    {
+        return $this->helper->getBatchIndexingDataSize();
+    }
+
     /**
      * {@inheritDoc}
      */
--- a/src/module-elasticsuite-core/Api/Index/IndexSettingsInterface.php
+++ b/src/module-elasticsuite-core/Api/Index/IndexSettingsInterface.php
@@ -90,6 +90,11 @@
     public function getBatchIndexingSize();

     /**
+     * Get the maximum batch data size for indexing.
+     */
+    public function getBatchIndexingDataSize(): ?int;
+
+    /**
      * Get dynamic index settings per store (language).
      *
      * @param integer|string|\Magento\Store\Api\Data\StoreInterface $store Store.

This patch evaluates the data size of the batch to be indexed and splits it into multiple ones until all of the sub-batches will not be smaller than the limit. The algorithm is probably not so ideal but it was developed in a limited time. Since the data size can be calculated only by converting the batch to JSON, I tried to minimize the number of json_encode calls to not affect the performance. I also didn't make enough performance tests to determine whether it's possible to calculate every row size and make batches utilizing the request size limit more efficiently. I think it would be nice to implement such functionality in the package.

@speller speller added the feature label Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant