

# ML-Based Optimum Number of CUDA Streams for the GPU Implementation of the Tridiagonal Partition Method

Milena Veneva\*, [milena.p.veneva@gmail.com](mailto:milena.p.veneva@gmail.com),

Toshiyuki Imamura\*, [imamura.toshiyuki@riken.jp](mailto:imamura.toshiyuki@riken.jp)

\*RIKEN Center for Computational Science, R-CCS, 7-1-26 Minatojima-minami-machi, Chuo-ku,  
Kobe, Hyogo 650-0047, Japan

## Abstract

This paper presents a heuristic for finding the optimum number of CUDA streams by using tools common to the modern AI-oriented approaches and applied to the parallel partition algorithm. A time complexity model for the GPU realization of the partition method is built. Further, a refined time complexity model for the partition algorithm being executed on multiple CUDA streams is formulated. Computational experiments for different SLAE sizes are conducted, and the optimum number of CUDA streams for each of them is found empirically. Based on the collected data a model for the sum of the times for the non-dominant GPU operations (that take part in the stream overlap) is formulated using regression analysis. A fitting non-linear model for the overhead time connected with the creation of CUDA streams is created. Statistical analysis is done for all the built models. An algorithm for finding the optimum number of CUDA streams is formulated. Using this algorithm, together with the two models mentioned above, predictions for the optimum number of CUDA streams are made. Comparing the predicted values with the actual data, the algorithm is deemed to be acceptably good.

## 1. Introduction

The parallel partition algorithm for solving systems of linear algebraic equations (SLAEs) suggested in [1] is an efficient numerical approach for solving SLAEs with tridiagonal coefficient matrices, splitting the matrix into sub-matrices and then solving smaller SLAEs in parallel. The algorithm was initially intended for a large number of processors and was implemented with the help of the MPI technology in [1].

The development of HPC applications involves two major steps: developing correct code, and improving the code for performance. We present one of the optimizations made to our CUDA [2] implementation, namely building a heuristic for finding the optimum number of CUDA streams by using tools common to the modern AI-oriented approaches.

CUDA devices contain engines for various tasks, e.g., memory copy and kernel execution. If memory transfers and kernels' execution are dispatched into different streams, these operations can be overlapped. The copy-compute overlap allows us to hide, partially or entirely, the memory transfers behind computations (or the other way around), and thus shorten the total time for the program. However, after a certain amount of streams, the performance would stop increasing, because the overhead of creating the additional streams would be bigger than the actual execution time boost. Thus, having a heuristic which predicts the optimum number of CUDA streams for each SLAE size would allow us to gain utmost of the GPU performance.

Most of the known heuristics (for instance, [9]) depend on the times for the memory transfers and the computational kernels being executed without copy-compute overlap which demands further trials constantly. This is impractical because it increases the computational time on the respective supercomputer or cluster. Therefore, we suggest a different approach.

## 2. Building a heuristic for the optimum number of CUDA streams

The heuristic processes of optimization in solving SLAEs on GPUs are summarized as of NVIDIA GPU RTX 2080 Ti [3], [4], for SLAE sizes  $10^i$ ,  $2.5 \times 10^i$ ,  $4 \times 10^i$ ,  $5 \times 10^i$ ,  $7.5 \times 10^i$ , and  $8 \times 10^i$ ,  $i = 3, 4, \dots, 7$ , sub-system size equal to 10; 256 CUDA threads per block; FP64 precision.

### 2.1. Hardware working queues and CUDA streams.

The Hyper-Q technology allows for multiple host threads to schedule work on the GPU simultaneously by using multiple host-device hardware connections. In case Hyper-Q is not supported by the device, the concurrency that can be achieved by copy-compute overlap is limited. If the number of created streams is bigger than the number of hardware working queues, multiple streams use one and the same queue. As the maximum number of hardware working queues for devices that support Hyper-Q is 32, using more than 32 streams is going to lead to serialization. Hence, we are considering only number of streams that are powers of 2, up to 32.

## 2.2. Time complexity model.

The time complexity model for the partition method (consisting of two stages on the GPU – Stage 1, and 3, and one on the CPU – Stage 2) is:

$$T_{non\_str} = (T_1^{H2D} + T_1^{COMP} + T_1^{D2H}) + T_2^{COMP} + (T_3^{H2D} + T_3^{COMP} + T_3^{D2H}), \quad (1)$$

where  $T_i^*$  is the memory transfer/kernel time of Stage  $i$ ,  $i = 1, 2, 3$ . Applying copy-compute overlap, and recalling that the algorithm is memory-bound, then the time for the computational kernels is going to be effectively hidden behind the memory transfers. Taking into account algorithm's nature, it is clear that  $T_1^{H2D} > T_1^{D2H}$ , and  $T_3^{H2D} < T_3^{D2H}$ . Up to an SLAE size  $\leq 10^5$ ,  $T_1^{COMP} > T_1^{D2H}$ , and  $T_3^{COMP} > T_3^{D2H}$ . Thus, the refined model is:

$$T_{str} = T_1^{H2D} + \frac{T_1^{COMP} + T_1^{D2H} + T_3^{H2D} + T_3^{COMP}}{\text{num\_str}} + T_2^{COMP} + T_3^{D2H} + T_{\text{overhead}}, \quad (2)$$

where  $T_{\text{overhead}}$  is the overhead from the creation of CUDA streams. This is a lower bound estimation for  $T_{str}$  ( $T_1^{H2D}$  is not always big enough to hide both  $T_1^{COMP}$ , and  $T_1^{D2H}$ ). Also, the GPU experiences idle times for SLAE sizes that do not saturate it. The same is true for Stage 3. This model is exact for big SLAE sizes though. The model is proven by the NVIDIA Nsight systems [5] profiles in Figure 1.



Figure 1: Nsight Systems profiles of the kernel responsible for Stage 1 (a), and the kernel responsible for Stage 3 (b); SLAE size  $10^8$ , sub-system size 10, 10 CUDA streams. Cyan/pink boxes denote H2D/D2H memory transfers; blue – kernels.

## 2.3. Overview of existing heuristics.

The overhead of forking CUDA streams in [6] is calculated as the product of the number of streams, and time for creating one stream  $\tau$  which should be computed for each GPU (it was found experimentally that  $\tau = 0.004448$  ms for NVIDIA RTX 2080 Ti). Thus, we can find the optimum number of streams by finding the first derivative of the time complexity model with regard to the number of streams. As a result, the optimum number of CUDA streams depends only on  $T_1^{COMP}$ ,  $T_1^{D2H}$ ,  $T_3^{H2D}$ ,  $T_3^{COMP}$  (all measured when no streams are used), and  $\tau$ . Applying this approach to SLAE sizes  $4 \times 10^i$ ,  $i = 3, 4, \dots, 7$ , it becomes clear that this model does not work well for the partition method, because it predicts a much higher number of streams than needed (for SLAE size  $4 \times 10^3$ : 8 vs. 1; for SLAE size  $4 \times 10^7$ : 140 vs. 32, etc.). The results can be seen in Table 1. Figure 3 also proves that this approximation does not satisfy the empirical data. We should consider a different model.

TABLE 1: Times for the GPU operations of the partition method which take part in the stream overlap (given in [ms]) for SLAE sizes  $4 \times 10^i$ ,  $i = 3, 4, \dots, 7$ , and comparison between the optimum number of CUDA streams according to [6] (column *optimum streams* [6]), and according to the computational experiments (column *actual optimum streams*).

| size            | $T_1^{COMP}$ | $T_1^{D2H}$ | $T_3^{H2D}$ | $T_3^{COMP}$ | sum       | optimum streams [6] | actual optimum streams |
|-----------------|--------------|-------------|-------------|--------------|-----------|---------------------|------------------------|
| $4 \times 10^3$ | 0.221312     | 0.014848    | 0.006592    | 0.030688     | 0.273440  | 7.8                 | 1                      |
| $4 \times 10^4$ | 0.216544     | 0.057312    | 0.015456    | 0.038112     | 0.327424  | 8.6                 | 1                      |
| $4 \times 10^5$ | 0.393184     | 0.402944    | 0.102784    | 0.205408     | 1.104320  | 15.8                | 4                      |
| $4 \times 10^6$ | 1.993980     | 3.897410    | 0.975392    | 2.130500     | 8.997282  | 45.0                | 32                     |
| $4 \times 10^7$ | 17.451500    | 38.836800   | 9.606720    | 20.981600    | 86.876620 | 139.8               | 32                     |

## 2.4. Building mathematical models for sum, and $T_{\text{overhead}}$ .

Let us define the sum of times of the GPU operations that take part in the stream overlap (see Eq. (2)) as:

$$\text{sum} = T_1^{\text{COMP}} + T_1^{\text{D2H}} + T_3^{\text{H2D}} + T_3^{\text{COMP}}. \quad (3)$$

According to the computational results (shown on Figure 2), sum is linearly dependent on the SLAE size. To build a heuristic for the optimum number of CUDA streams, we need to have a mechanism to predict the sum. We perform regression analysis (supervised machine learning), where the dependent variable is the sum, and the independent variable is the SLAE size. The data for the sum is split into training, and test set with the help of the `scikit-learn` [7] routine `train_test_split`, shuffle turned on, and splitting ratio 3 : 1. The obtained model (marked as *linear regressions model* on Figure 2) is:

$$\text{sum\_model} = 0.0000021890017149 \times \text{SLAE\_size} + 0.1470644998564126. \quad (4)$$

According to the R-squared coefficients (training: 0.9999813476643502, test: 0.9999942108504311) there is a perfect correlation between the observed and the predicted values. The mean squared error (MSE) is 0.02 (test), the residual being smaller for smaller sums of times. The model is acceptable.



Figure 2: Comparison of the sum of times for different SLAE sizes.



Figure 3:  $T_{\text{overhead}}$  for different SLAE sizes, on different number of CUDA streams.

Let us define:

$$T_{\text{overhead}} = (T_{\text{str}} - T_{\text{non\_str}}) + \frac{\text{num\_str} - 1}{\text{num\_str}} \times \text{sum}. \quad (5)$$

Figure 3 shows that  $T_{\text{overhead}}$  increases logarithmically with the increase in CUDA streams, following different patterns for SLAE sizes  $\leq 10^6$ , and  $> 10^6$  (when the GPU starts getting utilized better).

The optimum number of streams comes when (see Table 2 for an illustrative example):

$$T_{\text{overhead}} < \frac{\text{num\_str} - 1}{\text{num\_str}} \times \text{sum}, \quad (6)$$

TABLE 2: Times (given in [ms]) for the partition method for SLAE size  $10^6$ , for different number of streams.

| num_str | $T_{\text{str}}$ | $T_{\text{non\_str}}$ | sum      | $T_{\text{overhead}}$ | $\frac{\text{num\_str}-1}{\text{num\_str}} \times \text{sum} - T_{\text{overhead}}$ |
|---------|------------------|-----------------------|----------|-----------------------|-------------------------------------------------------------------------------------|
| 2       | 7.999136         | 8.817440              | 2.433568 | 0.398480              | 0.818304                                                                            |
| 4       | 7.533248         | 8.817440              | 2.433568 | 0.540984              | 1.284192                                                                            |
| 8       | 7.401472         | 8.817440              | 2.433568 | 0.713404              | 1.415968                                                                            |
| 16      | 7.445952         | 8.817440              | 2.433568 | 0.909982              | 1.371488                                                                            |
| 32      | 7.599968         | 8.817440              | 2.433568 | 1.140047              | 1.217472                                                                            |

and the difference between the term to the right-hand side of Eq. (6) and  $T_{\text{overhead}}$  is the biggest among the ones that fulfill the inequality. According to the computational experiments,  $T_{\text{overhead}}$  depends on both the SLAE size, and the number of CUDA streams (Figure 3), and the trend is non-linear. The equations of the models (*small* for SLAE sizes  $\leq 10^6$ , and *big* for SLAE sizes  $> 10^6$ ) are:

$$\begin{aligned}
 T_{\text{overhead\_model\_small}} &= 0.0000002245645331 \times \text{SLAE\_size} \\
 &+ 0.6009426920043296 \times \log_{10}(\text{num\_streams}) - 0.0605183610625299, \\
 T_{\text{overhead\_model\_big}} &= (0.0000000356594859 \times \text{SLAE\_size} \\
 &+ 0.0522781620855163) \times \log_2(\text{num\_streams}^{\frac{4}{3}}) + 0.3941472844770443.
 \end{aligned} \tag{7}$$

The two models are built with the help of the SciPy [8] routine `curve_fit`, and the form of the functions is preset (different fitting curves were tested). The data is split into training, and test set, shuffle turned on, and splitting ratio 3 : 1. The dependent variable is  $T_{\text{overhead}}$ , and the independent variables are the SLAE size, and num\_streams. The model metrics (Table 3) show that both the models are a good fit for the data. Further, Figure 4 shows that the fitted values for

TABLE 3:  $T_{\text{overhead}}$  models' metrics.

| set      | metric                     | model_small        | model_big          |
|----------|----------------------------|--------------------|--------------------|
| training | R-squared                  | 0.9531711290769591 | 0.9933780389080090 |
|          | MSE = mean squared error   | 0.0050126881205798 | 0.2451169015984794 |
|          | RMSE = $\sqrt{\text{MSE}}$ | 0.0708003398337877 | 0.4950928211946518 |
| test     | R-squared                  | 0.9549695579010460 | 0.9896761975222511 |
|          | MSE = mean squared error   | 0.0044441139999724 | 0.1447752928068124 |
|          | RMSE = $\sqrt{\text{MSE}}$ | 0.0666641882870588 | 0.3804934858927448 |

SLAE sizes  $\leq 10^6$  are reasonably close to the actual values since the two distributions overlap almost completely, while the distribution of the fitted values for SLAE sizes  $> 10^6$  has a smaller variance compared to the actual ones.

### 3. Discussion and Conclusions

Using the algorithm described earlier, and the models for sum (Eq. (4)), and  $T_{\text{overhead}}$  (Eq. (7)), the optimum number of CUDA streams for different SLAE sizes are predicted (the results are summarized in Table 4). For SLAE sizes  $\leq 10^5$  the optimum number of CUDA streams is 1, that is, the gain from using more streams is smaller than the overhead from creating them, while for SLAE sizes  $\geq 4 \times 10^6$  the optimum number is 32, because the SLAE size is big enough to require more GPU resources. There are only two wrongly predicted values: for SLAE size  $10^5$  the algorithm predicts 2 instead of 1 stream. The experiments show that the computational times for these number of streams are 1.239040 ms, and 1.230176 ms, respectively, so the difference is 0.008864 ms, which is negligible as a percentage of the computational time. The same is valid for size  $5 \times 10^5$  (0.016224 ms difference between the computational times). Thus, the algorithm is acceptably good. The performance improvement we achieved is up to 1.30 (for SLAE sizes  $8 \times 10^7$  and  $10^8$ ). The same ML-based approach could be applied to other applications as well.

#### 3.1. Experiments on other NVIDIA GPU cards.

The computational experiments that were performed on the basis of NVIDIA GPU card RTX A5000 [10], [11] showed that although this card has appox. 1.25 times bigger peak memory bandwidth than NVIDIA RTX 2080Ti [3], [4], the stream heuristic preserves. The reason for this invariance most probably could be explained by the fact that the amounts of registers and shared memory on both the cards are the same, and these are the most utilized GPU resources by the algorithm.



(a)



(b)

Figure 4: Distribution of the actual vs. predicted values for the model of  $T_{\text{overhead}}$  for SLAE sizes  $\leq 10^6$  (a), and  $> 10^6$  (b).

### 3.2. Experiments with FP32 precision.

Some additional tests with FP32 precision were made. The experimental results (summarized in Table 5) show that in 7 out of 16 SLAE sizes the optimum number of CUDA streams when using FP32 precision is half the optimum number of CUDA streams when using FP64 precision. This results were expected since when using single precision we need smaller number of memory transactions, and hence the memory utilization of the GPU is smaller than in comparison with the FP64 case. In the other 9 cases of SLAE sizes the optimum number of CUDA streams when using FP32 precision is the same as the optimum number of CUDA streams when using FP64 precision, and the difference between the computational times on these number of streams and half the number of streams is negligible as a percentage of the computational times. Overall, we recommend using the already built heuristic for FP64 precision, and to divide the optimum number of streams by two

TABLE 4: Optimum number of CUDA streams – experiments ( $N_{\text{act}}$  vs. model predictions ( $N_{\text{pre}}$ ).

| size            | $N_{\text{act}}$ | $N_{\text{pre}}$ | size              | $N_{\text{act}}$ | $N_{\text{pre}}$ | size              | $N_{\text{act}}$ | $N_{\text{pre}}$ |
|-----------------|------------------|------------------|-------------------|------------------|------------------|-------------------|------------------|------------------|
| $10^3$          | 1                | 1                | $4 \times 10^5$   | 4                | 4                | $10^7$            | 32               | 32               |
| $4 \times 10^3$ | 1                | 1                | $5 \times 10^5$   | 8                | <b>4</b>         | $2.5 \times 10^7$ | 32               | 32               |
| $5 \times 10^3$ | 1                | 1                | $8 \times 10^5$   | 8                | 8                | $4 \times 10^7$   | 32               | 32               |
| $8 \times 10^3$ | 1                | 1                | $10^6$            | 8                | 8                | $5 \times 10^7$   | 32               | 32               |
| $10^4$          | 1                | 1                | $2.5 \times 10^6$ | 16               | 16               | $7.5 \times 10^7$ | 32               | 32               |
| $4 \times 10^4$ | 1                | 1                | $4 \times 10^6$   | 32               | 32               | $8 \times 10^7$   | 32               | 32               |
| $5 \times 10^4$ | 1                | 1                | $5 \times 10^6$   | 32               | 32               | $10^8$            | 32               | 32               |
| $8 \times 10^4$ | 1                | 1                | $7.5 \times 10^6$ | 32               | 32               |                   |                  |                  |
| $10^5$          | 1                | <b>2</b>         | $8 \times 10^6$   | 32               | 32               |                   |                  |                  |

when using single precision.

TABLE 5: The optimum number of CUDA streams for certain SLAE sizes according to the experiments with FP32 precision, and comparison with the optimum number of CUDA streams for FP64 precision.

| SLAE size         | optimum<br>num_str<br>FP32 | optimum<br>num_str<br>FP64 | comparison  |
|-------------------|----------------------------|----------------------------|-------------|
| $\leq 10^5$       | 1                          | 1                          | same        |
| $4 \times 10^5$   | 2                          | <b>4</b>                   | <b>half</b> |
| $5 \times 10^5$   | 4                          | <b>8</b>                   | <b>half</b> |
| $8 \times 10^5$   | 8                          | 8                          | same        |
| $10^6$            | 4                          | <b>8</b>                   | <b>half</b> |
| $2.5 \times 10^6$ | 16                         | 16                         | same        |
| $4 \times 10^6$   | 16                         | <b>32</b>                  | <b>half</b> |
| $5 \times 10^6$   | 16                         | <b>32</b>                  | <b>half</b> |
| $7.5 \times 10^6$ | 32                         | 32                         | same        |
| $8 \times 10^6$   | 32                         | 32                         | same        |
| $10^7$            | 16                         | <b>32</b>                  | <b>half</b> |
| $2.5 \times 10^7$ | 16                         | <b>32</b>                  | <b>half</b> |
| $4 \times 10^7$   | 32                         | 32                         | same        |
| $5 \times 10^7$   | 32                         | 32                         | same        |
| $7.5 \times 10^7$ | 32                         | 32                         | same        |
| $8 \times 10^7$   | 32                         | 32                         | same        |
| $10^8$            | 32                         | 32                         | same        |

## Acknowledgement

The authors would like to thank Dr. Alexander Ayriyan (JINR) for their precious comments.

## References

- [1] Austin T.M., and Berndt M., and Moulton J.D. *A Memory Efficient Parallel Tridiagonal Solver*. Preprint LA-VR-03-4149, 13 p. (2004).
- [2] NVIDIA, *NVIDIA CUDA C++ Programming Guide*. <https://docs.nvidia.com/cuda/cuda-c-programming-guide/> (2024).
- [3] Techpowerup, *NVIDIA GeForce RTX 2080 Ti Specifications*, <https://www.techpowerup.com/gpu-specs/geforce-rtx-2080-ti.c3305> (2018).
- [4] NVIDIA, *GeForce RTX 20 Series*, <https://www.nvidia.com/en-eu/geforce/20-series/> (2018).
- [5] NVIDIA, *Nsight Systems User Guide*, <https://docs.nvidia.com/nsight-systems/UserGuide/index.html> (2024).
- [6] Gómez-Luna J., and González-Linares J.M., and Benavides J.I., and Guil N., *Performance models for CUDA streams on NVIDIA GeForce series*, J. Parallel Distrib. Comput., 72(9), pp. 1117–1126 (2011).
- [7] Pedregosa F., and Varoquaux G., and Gramfort A., and Michel V., and Thirion B., and Grisel O., and Blondel M., and Prettenhofer P., and Weiss R., and Dubourg V., and Vanderplas J., and Passos A., and Cournapeau D., and Brucher M., and Perrot M., and Duchesnay É., *Scikit-learn: Machine Learning in Python*, Journal of Machine Learning Research, 12, pp. 2825–2830, (2011).

- [8] Virtanen P., and Gommers R., and Oliphant T.E., and Haberland M., and Reddy T., and Cournapeau D., and Burovski E., and Peterson P., and Weckesser W., and Bright J., and van der Walt S. J., and Brett M., and Wilson J., and Millman K. J., and Mayorov N., and Nelson A. R. J., and Jones E., and Kern R., and Larson E., and Carey C. J., and Polat I., and Feng Y., and Moore E. W., and VanderPlas J., and Laxalde D., and Perktold J., and Cimrman R., and Henriksen I., and Quintero E. A., and Harris C. R., and Archibald A. M., and Ribeiro A. H., and Pedregosa F., and van Mulbregt P., *SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python*, Nature Methods, 17(3), pp. 261–272, (2020).
- [9] Werkhoven B. van, and Maassen J., and Steinstra F.J., and Bal H. E. , *Performance models for CPU-GPU data transfers*, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, (2014), pp. 11–20, doi: 10.1109/CCGrid.2014.16.
- [10] NVIDIA, *NVIDIA RTX A5000 Datasheet*, <https://nvdam.widen.net/s/wrqrqt75vh/nvidia-rtx-a5000-datasheet> (2021).
- [11] NVIDIA, *Introducing NVIDIA RTX A5000 Graphics Card*, <https://www.nvidia.com/en-us/design-visualization/rtx-a5000/> (2021).