



# Accelerating domain propagation: An efficient GPU-parallel algorithm over sparse matrices<sup>☆</sup>

Boro Sofranac <sup>a,b,\*</sup>, Ambros Gleixner <sup>c,b</sup>, Sebastian Pokutta <sup>a,b</sup>

<sup>a</sup> Berlin Institute of Technology, Strasse des 17. Juni 135, 10623 Berlin, Germany

<sup>b</sup> Zuse Institute Berlin, Takustrasse 7, 14195 Berlin, Germany

<sup>c</sup> HTW Berlin, Treskowallee 8, 10318 Berlin, Germany

## ARTICLE INFO

### Keywords:

Mixed integer linear programming

MIP

GPU

Domain propagation

Bound tightening

Parallel algorithms

## ABSTRACT

- Currently, domain propagation in state-of-the-art MIP solvers is single thread only.
- The paper presents a novel, efficient GPU algorithm to perform domain propagation.
- Challenges are dynamic algorithmic behavior, dependency structures, sparsity patterns.
- The algorithm is capable of running entirely on the GPU with no CPU involvement.
- We achieve speed-ups of around 10x to 20x, up to 180x on favorably-large instances.

## 1. Introduction

Given a matrix  $A \in \mathbb{R}^{m \times n}$ , vectors  $b \in \mathbb{R}^m$ ,  $c \in \mathbb{R}^n$ , and a subset  $I \subseteq N = \{1, \dots, n\}$ , a *mixed integer linear program* (MIP) is an optimization problem that can be written as

$$\min\{c^T x \mid Ax \leq b, x \in \mathbb{R}^n, x_j \in \mathbb{Z} \text{ for all } j \in I\}. \quad (1)$$

Typically, MIPs are  $\mathcal{NP}$ -hard to solve, but surprisingly fast algorithms exist in practice [1,2]. The most successful methods to solve MIPs are the *branch-and-bound* algorithm and its extensions. The main idea of this algorithm is to split the original problem into subproblems which are easier to solve (*branching*). By repeatedly branching on the subproblems, a search tree is obtained. At each subproblem, the *bounding* step uses relaxations to compute lower bounds and prune suboptimal nodes of the tree in order to avoid enumerating exponentially many subproblems. Relaxations are typically obtained by dropping integrality requirements on variables  $x_j$  for  $j \in I$ , at which point a *linear program* (LP) is obtained and solved by, e.g., the simplex algorithm [3].

One of the supplementary techniques used to improve the initial problem formulation and decrease the size of the branching tree is to limit the domains of the variables  $x_j$  to the value assignments which

can be completed to *feasible solutions* of a given (sub)problem. The process of computing these ranges is called *domain propagation* [4,5].

Whereas many areas of applied mathematics (e.g., Deep Learning) have strongly profited from the increased computing power of parallel coprocessors like *Graphics Processing Units* (GPUs), success in using these resources to solve MIPs has been very limited [6]. The main challenges for developing GPU-accelerated MIP techniques have been (a) highly sparse input data with irregular sparsity patterns; (b) inherent sequential properties of many of the best methods used in the field; and (c) their non-uniform algorithmic behavior. On top of this, many of the existing algorithms were designed with a sequential execution model in mind, which does not allow to transform them easily to an efficient GPU implementation.

In this paper, we focus on domain propagation and propose an efficient algorithm for GPUs. In addition to being able to obtain results faster with this algorithm under almost all circumstances, we hope that this work can pave the way for the implementation of other methods or motivate the development of new optimization methods on GPUs. Our algorithm is capable of running propagation rounds entirely on the GPU, without the involvement of the CPU. Consequently, this also presents an opportunity to investigate the use of domain propagation in conjunction with existing or future methods that also run completely on

<sup>☆</sup> A conference version of this paper has appeared in the proceedings of IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3), 2020. This research was partially conducted within the Research Campus MODAL funded by the German Federal Ministry of Education and Research (BMBF grant numbers 05M14ZAM, 05M20ZBM) as well as the DFG Cluster of Excellence MATH+ (EXC-2046/1, project id 390685689) funded by the Deutsche Forschungsgemeinschaft (DFG), Germany.

\* Corresponding author at: Zuse Institute Berlin, Takustrasse 7, 14195 Berlin, Germany.

E-mail addresses: [sofranac@zib.de](mailto:sofranac@zib.de) (B. Sofranac), [gleixner@zib.de](mailto:gleixner@zib.de) (A. Gleixner), [pokutta@zib.de](mailto:pokutta@zib.de) (S. Pokutta).

the GPU. This is especially interesting given the fact that state-of-the-art MIP solvers largely do not exploit GPUs during the solving process, leaving them idle.

We begin by formally introducing the necessary background and notation.

### 1.1. Domain propagation

Domain propagation is an iterative technique used to tighten the bounds of variables at each node of the branch-and-bound tree. It is also used as a pre-processing technique to improve the formulation of a given MIP model [7]. In the following, we consider linear constraints of the form

$$\underline{\beta} \leq a^T x \leq \bar{\beta}, \quad (2)$$

where  $\underline{\beta}, \bar{\beta} \in \mathbb{R} \cup \{-\infty, \infty\}$  are left and right hand sides, respectively, and  $a \in \mathbb{R}^n$  is the vector of constraint coefficients. We assume that variables  $x_j$  have lower and upper bounds  $\ell_j, u_j \in \mathbb{R} \cup \{-\infty, \infty\}$ . Before stating the formulas for updating variable bounds, we need the following definition.

**Definition 1 (Activity Bounds).** Given a constraint of form (2) and bounds  $\ell \leq x \leq u$ , we call

$$\underline{\alpha} := \sum_{i=0}^n a_i b_i \text{ with } b_i = \begin{cases} \ell_i & \text{if } a_i > 0 \\ u_i & \text{if } a_i \leq 0 \end{cases} \quad (3a)$$

the *minimum activity*, and

$$\bar{\alpha} := \sum_{i=0}^n a_i b_i \text{ with } b_i = \begin{cases} u_i & \text{if } a_i > 0 \\ \ell_i & \text{if } a_i \leq 0 \end{cases} \quad (3b)$$

the *maximum activity* of the constraint.

Then domain propagation is based on three observations [5, Sec. 7.1] and translates into the following algorithmic steps.

- 1: If  $\underline{\beta} \leq \underline{\alpha}$  and  $\bar{\beta} \leq \bar{\alpha}$ , then the constraint is redundant and can be removed.
- 2: If  $\underline{\alpha} > \bar{\beta}$  or  $\bar{\beta} > \bar{\alpha}$ , then the constraint cannot be satisfied and hence the entire (sub)problem is infeasible.
- 3: Let  $x$  satisfy (2), then for all  $j = 1, \dots, n$  with  $a_j > 0$ ,

$$\frac{\underline{\beta} - \bar{\alpha}}{a_j} + u_j \leq x_j \leq \frac{\bar{\beta} - \underline{\alpha}}{a_j} + \ell_j, \quad (4a)$$

and for all  $j = 1, \dots, n$  with  $a_j < 0$ ,

$$\frac{\bar{\beta} - \underline{\alpha}}{a_j} + u_j \leq x_j \leq \frac{\underline{\beta} - \bar{\alpha}}{a_j} + \ell_j. \quad (4b)$$

If  $x_j$  is integral, lower bounds can be rounded up and upper bounds can be rounded down to strengthen them further.

If Steps 1 and 2 are not applicable, the propagation algorithm computes new bounds  $\ell_j^{\text{new}}, u_j^{\text{new}}$  in Step 3. If  $\ell_j^{\text{new}} > \ell_j$  or  $u_j^{\text{new}} < u_j$  for some  $j$ , then the bounds of variable  $x_j$  are updated. To apply domain propagation to a MIP formulation, one simply applies these steps to all of its constraints.

An actual implementation may skip Steps 1 and 2 without changing the result. This is because for redundant constraints Step 3 correctly detects no bound tightenings, and for infeasible constraints, Step 3 leads to at least one variable with an empty domain, i.e.,  $\ell_j^{\text{new}} > u_j^{\text{new}}$ .

Let us consider what happens when the steps are applied to all constraints of the system and some bound changes have been found. First, note that the formulas for computing new variable bounds in (4a) and (4b) depend on the activity bounds  $\underline{\alpha}$  and  $\bar{\alpha}$ . The activity bounds themselves depend on  $\ell_j$  and  $u_j$ . Second, note that several constraints typically share the same variable  $x_j$ . In this case, updating the bounds of  $x_j$  during the processing of one constraint changes the bound and

activity values for all other constraints that contain  $x_j$ . Propagating the affected constraints again can then lead to further tightenings.

Hence, domain propagation is usually applied iteratively. Each iteration of applying Steps 1 to 3 to all or a subset of the constraints is also called a *propagation round*. If no bound updates are found during a given round, then no further improvements are possible, and the algorithm terminates.

As pointed out in [4], iterated domain propagation can be interpreted as a fixed-point iteration in the space of variable and activity bounds, and there exists a unique limit point of this fixed-point iteration. Iterated domain propagation converges to this well-defined result, however, not necessarily in finite time [4]. Also, convergence can be very slow in practice [5]. The most common way to deal with non-finite or slow convergence in practice is to terminate when the improvements made fall below a specified threshold. Such tolerance-based termination criteria yield finite termination, but may fail to compute the tightest bounds possible.

### 1.2. Graphics processing units

In this section we briefly introduce GPU concepts used throughout the paper. Although the key concepts are independent, we restrict our notation to NVIDIA's GPUs and the CUDA programming model. GPUs are massively parallel processors capable of running hundreds of thousands of threads. Threads are grouped into *warps* in hardware (usually 32 threads each) and forced to execute in a SIMD (Single Instruction Multiple Data) fashion and share the same resources. On the programming level, threads are divided into *thread blocks*. Different thread blocks are scheduled for execution independently, but they all have access to the same *global memory*. Threads inside a single thread block can be synchronized and a given amount of exclusive *shared memory* is available to each block. Accesses to shared memory are usually much faster than accesses to global memory.

### 1.3. Related work

While our motivation to study domain propagation are MIPs and we restrict ourselves to linear constraints, the concept is more general and has been rediscovered several times since the 1970s in different communities. First, it was used in the AI (Artificial Intelligence) and CP (Constraint Programming) communities, where its origins can be traced back to the Waltz algorithm [8]. Several other techniques related to domain propagation are widely used in AI and CP and known under the names *domain filtering*, *domain reduction*, *bound reduction*, *range reduction* and *constraint propagation*. In MIP, domain propagation was first discussed in the context of presolving [7,9]. In global optimization and mixed integer non-linear programming, it is most commonly known as *feasibility-based bounds tightening* (FBBT) and was first used in [10].

The literature on previous efforts to successfully exploit GPUs in exact MIP methods is scarce. We refer to the recent survey by Boyer et al. [6] and references therein and we only provide a very brief summary here. Most work in this direction has focused on the simplex method (especially its linear algebra), on dynamic programming approaches for solving the knapsack problem, and on branch-and-bound implementations for a few special problem classes. Somewhat more attention was paid to GPU implementations of metaheuristic methods which aid exact methods, as their compute tasks seem more amenable to an efficient implementation on GPUs.

### 1.4. Contribution

To the best of our knowledge, this paper presents the first algorithm for domain propagation on GPUs. We discuss and address two main sources of irregularity that challenge efficient GPU implementations: 1. the dynamic behavior of the existing CPU-based algorithms, and 2. the highly sparse and irregular structure of the constraint matrix  $A$ . We

propose a new algorithm which is better suited to the *throughput-based* model of GPUs.

To deal with the highly irregular structure of the input matrix  $A$ , we found inspiration in existing work on parallelizing sparse linear algebra and demonstrate that some of these ideas can be carried over to domain propagation. As a result, we obtain an efficient method that runs entirely on the GPU and yields significant speedups for domain propagation on the MIPLIB 2017 test bed, the *de-facto* standard for MIP solvers. The computational results are presented for both double- and single-precision arithmetic. Finally, we validate our baseline algorithms against the domain propagation implementation in the CPU-based PaPILO presolve library [11].

The rest of the paper is organized as follows. In Section 2 we discuss existing state-of-the-art implementations of domain propagation and the challenges they present for efficient GPU parallelization, alongside the main approach we take in overcoming these challenges. In Section 3 we focus on the handling of the irregular (sparse) structure of the constraint matrix and other implementational issues. Finally, in Section 4 we present computational results.

## 2. Parallelizing domain propagation

We strive to exploit two sources of parallelism from the domain propagation algorithm: First, we exploit the fact that we can apply the propagation reductions to each constraint, i.e., to each row of  $A$ , independently and thus do so fully in parallel. The repercussions of this on the algorithm are discussed in Section 2.2. Second, we exploit parallelism inside the processing of each constraint. This itself comes from two sources: (a) we parallelize the computation of minimum and maximum activities, see Sections 3.1 through 3.4, and (b) we parallelize the updating of bounds, see Section 3.5.

### 2.1. Features of sequential domain propagation

We first present the main features of the sequential domain propagation algorithm and discuss why this algorithm is not suited for efficient implementation on GPUs. CPU-based, sequential implementations of the domain propagation algorithm follow the *latency-based* sequential programming model. The main steps of this algorithm are summarized in Algorithm 1.

---

#### Algorithm 1 Sequential domain propagation

---

```

Input: System of linear constraints  $\underline{\beta} \leq Ax \leq \bar{\beta}$ ,  $\ell \leq x \leq u$ 
Output: Tightened variable bounds  $\underline{\ell}^{\text{new}} \leq x \leq u^{\text{new}}$ 
1: mark all constraints  $c$  in  $\underline{\beta} \leq Ax \leq \bar{\beta}$ 
2: bound_change_found  $\leftarrow$  true
3: while bound_change_found do
4:   bound_change_found  $\leftarrow$  false
5:   for each constraint  $c$  in  $\underline{\beta} \leq Ax \leq \bar{\beta}$  do
6:     if  $c$  marked then
7:       unmark  $c$ 
8:       compute  $\underline{\alpha}_c$ ,  $\bar{\alpha}_c$  via (3a) and (3b)
9:       if  $c$  propagate then
10:        for each variable  $v$  in  $c$  do
11:          if can  $v$  be tightened then
12:            compute  $\underline{\ell}_v^{\text{new}}$ ,  $u_v^{\text{new}}$  via (4a) and (4b)
13:            if  $\underline{\ell}_v^{\text{new}} > \underline{\ell}_v$  then
14:               $\underline{\ell}_v \leftarrow \underline{\ell}_v^{\text{new}}$ 
15:              bound_change_found  $\leftarrow$  true
16:            if  $u_v^{\text{new}} < u_v$  then
17:               $u_v \leftarrow u_v^{\text{new}}$ 
18:              bound_change_found  $\leftarrow$  true
19:            if bound_change_found then
20:              mark all constraints  $c$  with  $v$  in  $c$ 
21: return  $\underline{\ell}^{\text{new}}, u^{\text{new}}$ 
```

---

Informally speaking, this algorithm starts with the first non-zero entry of  $A$ , computes everything it can for the corresponding variable, tightens its bounds if possible, and then moves on to the next non-zero. It can stop processing a constraint or a variable early if the sufficient conditions in Line 9 and Line 11 are met. This avoids unnecessary work as these checks can be performed before the new bound candidates are computed and compared to the old bounds. In addition, this algorithm takes advantage of the fact that a constraint can trigger propagation of other constraints only if they share at least one variable by implementing a marking mechanism (Lines 1, 6, 7 and 20).

This irregular behavior poses challenges for an efficient implementation on a GPU. Threads assigned to units of work that are stopped early would remain idle during the rest of the execution in a given propagation round, leaving hardware underutilized. The remaining threads would potentially be accessing memory far apart, resulting in uncoalesced accesses. Notice that we do not know *a priori* which parts would terminate early and at which level, or which constraints will be marked in the next round. This induces highly dynamic behavior both inside a given propagation round and between different rounds as well.

Additionally, notice that the amount of work a given thread would perform heavily depends on where, if at all, it will terminate early. Computing  $\underline{\alpha}$  and  $\bar{\alpha}$  in Line 8, e.g., also requires a loop over the variables in the constraint, see (3a) and (3b). The marking operation in Line 20 involves iterating over the columns of  $A$  and finding all constraints that contain the variable in question. The dynamic behavior of the algorithm described in the previous paragraph would also make load balancing the work between different threads difficult.

### 2.2. The price of parallelism

As mentioned earlier, the constraints of a given system can be propagated independently. Exploiting this parallelism, however, comes at a cost; it will result in a less efficient algorithm in the worst-case scenario. The main reason for this is that a bound change found during the sequential execution of the algorithm becomes immediately available to the propagation of the subsequent constraints. If this bound change triggers propagation of one of the subsequent constraints, this propagation can be done during the same round. In the parallel case however, all constraints are propagated independently, hence propagation triggered in this way would have to wait until the next propagation round.

The worst case of such a propagation pattern is a *cascading propagation pattern*, where Constraint 1 triggers propagation of Constraint 2, then Constraint 2 triggers Constraint 3, and so forth. A sequential implementation could propagate this pattern in one round, while a parallel implementation that propagates each constraint independently would require  $m$  rounds, where  $m$  is the number of constraints in the system. In the best-case scenario, the parallel and sequential algorithm propagate the system in the same number of rounds.

To roughly gauge this “price of parallelism” effect, we conducted a preliminary experiment over 893 instances from the MIPLIB 2017 test set [12] on which both our parallel and sequential implementations converge to identical results. In this experiment, the average number of propagation rounds of the sequential implementation was 3.1, but increased to 4.4 rounds on average for the parallel implementation, hence an average increase by a factor 1.4. However, the maximum increase observed for an instance was as large as 22.0.

### 2.3. A GPU-targeted parallelization

In this section, we describe the proposed GPU-parallel domain propagation algorithm and discuss why it better matches the GPU model. Finally, we highlight how the missing features of this algorithm, compared to the sequential algorithm, are mostly remedied by the massively parallel GPU model.

The parallel algorithm we propose for implementation on the GPUs is shown in Algorithm 2. It does not contain the early termination checks and the marking mechanism of the Algorithm 1. This means that at the expense of performing more computations we avoid the irregular behavior of the CPU version induced by these checks. Taking constraint marking as an example, note that unmarked constraints would not be processed by the CPU algorithm, while the GPU algorithm would process such constraint even though it cannot yield improved bounds. What we gain though is an algorithm with a static sequence of computations that can exploit the massively parallel GPU model.

---

**Algorithm 2** GPU-parallel domain propagation (schematic)

---

**Input:** System of linear constraints  $\beta \leq Ax \leq \bar{\beta}$ ,  $\ell \leq x \leq u$   
**Output:** Tightened variable bounds  $\ell^{\text{new}} \leq x \leq u^{\text{new}}$

```

1: bound_change_found  $\leftarrow$  true
2: while bound_change_found do
3:   for constraint  $c$  in  $A$  parallel do
4:     compute  $\underline{\alpha}$ ,  $\bar{\alpha}$  via (3a) and (3b)
5:   for constraint  $c$  in  $A$  parallel do
6:     for variable  $v$  in  $c$  parallel do
7:       compute  $\ell_v^{\text{new}}$ ,  $u_v^{\text{new}}$  via (4a) and (4b)
8:       if  $\ell_v^{\text{new}} > \ell_v$  then
9:          $\ell_v \leftarrow \ell_v^{\text{new}}$ 
10:        bound_change_found  $\leftarrow$  true
11:       if  $u_v^{\text{new}} < u_v$  then
12:          $u_v \leftarrow u_v^{\text{new}}$ 
13:       bound_change_found  $\leftarrow$  true
14:   return  $\ell^{\text{new}}, u^{\text{new}}$ 

```

---

Notice the shift in the way the two algorithms progress through the computation. The CPU-based Algorithm 1 starts with the first constraint and its first variable and continues processing that variable until no more processing is possible. It then moves to the next variable and the process is repeated. Hence, the activities  $\underline{\alpha}$  and  $\bar{\alpha}$  are computed for the first constraint, then all the variables in that constraint are processed, then the process is repeated on the remaining constraints. The GPU-based Algorithm 2 instead starts with computing  $\underline{\alpha}$  and  $\bar{\alpha}$  for *all* constraints. Then it computes new bound candidates for *all* non-zeros of  $A$ . A useful analogy to graph algorithms here is that the CPU Algorithm 1 resembles a “depth-first search” and the GPU algorithm resembles a “breadth-first search” of a graph. In terms of the GPU model, the new algorithm allows for better load balancing, maximizing the *throughput*.

This parallel algorithm also allows us to access large parts of necessary memory in a coalesced way. However, to understand how the memory is accessed, we also need to take in account the sparsity pattern and the storage scheme of  $A$ . Thus, we discuss memory accesses in Section 3 which deals with the irregular structure of  $A$ .

For an example of why our proposed algorithm leads to better load balancing in the parallel case, notice that all variables in a given constraint share the same  $\underline{\alpha}$  and  $\bar{\alpha}$ . If we first use all available threads to precompute the activities, then synchronize and perform bound updates, no threads are left idle and no computation is repeated. Otherwise, either all the threads compute the same activity values (best-case scenario we get the result in the time a single thread takes to compute the activity) or some of them are idle and waiting for others to compute the activities before performing the bound updates. Additionally, having a group of threads cooperate on computation of activities will lead to a faster algorithm. This is because the computation of activities allows for some parallelism to be exploited; see Sections 3.1 through 3.4.

We now highlight how the *throughput-based* GPU programming model partly remedies the lack of early termination checks of Algorithm 1. Let us only consider the constraint marking feature; the conclusion drawn will be analogous for other early termination checks.

Assume we have  $m$  constraints in the system,  $k$  of which are marked for propagation, where  $k \leq m$ . Furthermore, assume that the costs of propagating one constraint on the hardware where the sequential and parallel algorithms are run are  $C_{\text{seq}}$  and  $C_{\text{par}}$ , respectively, and that the parallel algorithm has  $p$  processing units for parallel computation at its disposal. The sequential algorithm would only propagate the  $k$  marked constraints and thus finish the propagation with the cost of  $k \cdot C_{\text{seq}}$ . The parallel algorithm, on the other hand, would propagate the system with the cost of  $\lceil \frac{m}{p} \rceil \cdot C_{\text{par}}$ . For the case of  $m \leq p$ , it would not matter for the parallel algorithm whether it propagates all the  $m$  constraints or only the  $k$  marked constraints. For  $m > p$ , we pay the price for propagating one constraint, i.e.,  $C_{\text{par}}$  for each additional  $p$  constraints that are propagated.

In practice, it is hard to exactly quantify  $p$  for GPUs due to the complexity of their hardware and execution model. However, we can get a rough idea of what orders of magnitude are involved in the computation. The 987 instances from our MIPLIB 2017 test bed from Section 4.1 show on average 118,514 constraints, 64,611 variables, and 1,226,730 non-zeros, respectively. On the other hand, modern GPUs offer order of tens of thousands of threads running in parallel. Thus, given the size of practically relevant problems and the amount of parallelism modern GPUs offer, we can expect that the price for extra amounts of work in the parallel algorithm will be mostly negligible.

Further properties of GPUs help in reducing the cost we pay for extra computations. For example, the GPU hardware is optimized to combine multiple memory accesses into a single transaction (*memory coalescing*). As will be discussed in Section 3, the GPU algorithm allows us to access large parts of necessary memory in a fully coalesced manner. So the memory to be operated on by the extra threads will often be already loaded. In low arithmetic intensity algorithms like domain propagation memory accesses are typically dominant parts of GPU implementations in terms of runtime. Hence, we do not expect these extra arithmetic operations to have a significant effect on the total runtime of the algorithm; our computational results confirm this.

### 3. Handling irregular structure of the constraint matrix

In practice, constraint matrices  $A$  coming from MIPs are very sparse. To store  $A$ , we adopt a ubiquitously used *Compressed Sparse Row (CSR)* storage scheme [13]. Additionally, MIPs often contain so-called *connecting constraints* which are very dense. Consequently, even though matrix  $A$  might be very sparse overall, it may contain a few very dense rows. This poses a challenge for load balancing on GPUs. Consider two naive approaches of assigning a single thread per constraint and assigning a warp or block of threads per constraint. In the former case, threads assigned to dense rows would have much more work to do, leaving other threads idle. In the latter case, warps or blocks assigned to rows with a small number of non-zeros would remain underutilized.

We already discussed how GPU hardware is optimized around coalescing memory accesses of threads in a warp into one or as few as possible memory accesses. The approach of assigning one thread per constraint would perform poorly in this regard as neighboring threads running in parallel would access memory locations which are not adjacent in the row-major CSR format. Assigning a warp of threads per constraint, on the other hand, results in fully coalesced memory accesses: each thread of a warp takes one non-zero element, which are stored next to each other in the CSR format.

In this section, we address above-mentioned challenges and present an algorithm which combines good load balancing with coalesced memory accesses. To introduce the main idea, we first focus on the computation of the minimum and maximum activities  $\underline{\alpha}$  and  $\bar{\alpha}$  in Sections 3.1, 3.2, 3.3, and 3.4. The second part of the algorithm, which computes new bound candidates is discussed in Section 3.5.

### 3.1. Similarity between SpMV and computation of activities

The first step of the parallel algorithm computes minimum and maximum activities  $\underline{\alpha}$  and  $\bar{\alpha}$  for all constraints of the system via (3a) and (3b). Our main approach for implementing these formulas is the observation that their efficient implementation for a matrix  $A$  on GPUs is conditioned by the same factors that condition implementations of a sparse matrix–vector product (SpMV) of  $A$  and some right-hand side vector. SpMVs and their efficient implementations on GPUs have been extensively studied over the years as they play a crucial role in computational science. We will then be able to carry over ideas to overcome challenges described at the beginning of this section for computing activities.

To support our observation, we rely on the fact that SpMV is bandwidth bound (i.e., its arithmetic intensity is very low), so good memory access patterns greatly improve performance [13]. Notice that arithmetic intensity of matrix–vector products and computing  $\underline{\alpha}$  or  $\bar{\alpha}$  is the same. When it comes to memory needed to perform the operations to compute the activities, notice that each  $l_i$  and  $u_i$  is accessed exactly once: either for  $\underline{\alpha}$  or for  $\bar{\alpha}$ . This means that we need to access each element of  $A$ ,  $l$ , and  $u$  exactly once to implement (3a) and (3b). This is exactly the same memory needed to compute the matrix–vector product of  $A$  with two right hand sides  $l$  and  $u$ .

Greathouse and Daga introduced an algorithm in [13] which addresses issues discussed at the beginning of this section in the context of SpMVs. This algorithm is called *CSR-adaptive*. It handles well both structured and unstructured matrices and explicitly addresses the case of having very long rows in an otherwise sparse matrix. We now briefly introduce the main idea of this algorithm before discussing the changes made to facilitate computation of activities. For a detailed discussion of *CSR-adaptive* see [13].

### 3.2. The CSR-adaptive algorithm

The main idea of the algorithm is to divide the matrix  $A$  into *row blocks*, and have one CUDA threads block work on one row block. If the number of non-zeros in some rows is small, they will be grouped together in one row block. A so-called *CSR-stream* algorithm will then be applied to this row block. If a given row block consists of only one or few rows, a so-called *CSR-vector* variant will be applied.

The CSR-stream algorithm first assigns one thread to each non-zero in the row block and loads them into shared memory. This results in fully coalesced memory accesses. Afterwards, a number of threads is assigned per each row that is present in the row block. These threads then carry out the necessary computations and reductions. The CSR-vector algorithm assigns one warp of threads to the row block which then perform the computations and reductions.

### 3.3. Computing minimum and maximum activities

On the implementation level, we make the following changes to the CSR-adaptive algorithm as presented in [13]. First, instead of working on one right-hand side array, we adapt the algorithm to work on two arrays:  $\ell$  and  $u$ . As part of the same change, the algorithm now outputs two arrays, the minimum and maximum activities.

Second, the arithmetic operations are adjusted to (3a) and (3b). Note that this leaves the reductions (i.e., summations) unchanged as it only differs from SpMV in computing the local summands  $a_i b_i$ .

Third, we allow the CSR-vector algorithm to use all warps in a CUDA thread block if the rows are extremely long. In this case, each warp computes partial sums of its elements, after which the partial sums are reduced in shared memory. In our implementation, we use a (somewhat arbitrary) threshold value of 64 to switch between the two variants of CSR-vector.

Both the CSR-stream and the CSR-vector variant store the computed activities in local shared memory, as they will be utilized in forthcoming computations (see Section 3.5).

### 3.4. Numerical considerations for computing activities

Apart from the standard issues that numerical algorithms based on floating-point arithmetic have to deal with, one special case arises during domain propagation that warrants discussion. Eqs. (4a) and (4b) provide formulas for computing bound candidates of a variable  $j$  in a given constraint. Notice that these formulas always explicitly contain the value of  $\ell_j$  or  $u_j$ , even though they could be factored into the activity values:

$$\underline{\alpha}_j = \sum_{i=0, i \neq j}^n a_i b_i = \underline{\alpha} - a_j b_j \text{ with } b_k = \begin{cases} \ell_k & \text{if } a_k > 0 \\ u_k & \text{if } a_k \leq 0, \end{cases} \quad (5a)$$

and

$$\bar{\alpha}_j = \sum_{i=0, i \neq j}^n a_i b_i = \bar{\alpha} - a_j b_j \text{ with } b_k = \begin{cases} u_k & \text{if } a_k > 0 \\ \ell_k & \text{if } a_k \leq 0. \end{cases} \quad (5b)$$

The values  $\underline{\alpha}_j$  and  $\bar{\alpha}_j$  are called *residual activities*. Literature on domain propagation which does not focus on implementation usually defines the Eqs. (4a) and (4b) in terms of these values rather than the minimum and maximum activities (see [5] for example).

The main reason for not using residual activities directly in practical implementations of domain propagation is performance. Namely, residual activities need to be computed for each non-zero of the system. On the other hand, minimum and maximum activities are defined per constraint of the system. The artificial step of coming back to the residual activities costs almost nothing during the computation of new bound candidates (see Section 3.5), as the additional lower or upper bound value and the coefficient are readily available and already used at this point.

However, the case when infinite bounds are involved must be handled with care. Consider a constraint  $i$  with a number of variables whose bounds have finite values and one variable  $j$  whose bounds are infinite. The maximum and minimum activities of such a constraint are infinite. The residual activities of all but variable  $j$  are also infinite. However, the residual activities of variable  $j$  will be finite. A difficulty arises when computing the new bound candidates for variable  $j$ , where an infinite value would first need to be added and then subtracted from the result. (Notice that the case where constraint  $i$  contains more than one variable with infinite bounds is easier to handle, because the minimum and maximum activity and all the residual activities are infinite.)

Existing domain propagation implementations, such as the implementation in the PaPILO presolver [11], handle this case by keeping track of the number of infinite contributions to each of the minimum and maximum activities separately from the finite part of the activity sums. Then, the special case of having exactly one infinity contribution can be detected and the correct finite value computed. We take the same approach, but unlike sequential algorithms where keeping track of and aggregating a counter is a trivial problem, we have GPU threads accessing non-zeros of the constraint in parallel.

We observe that the problem of computing the number of infinity contributions in a constraint is a reduction problem equivalent to the computation of the activity value itself. Moreover, the memory that needs to be loaded from GPU global memory to compute these two reductions is exactly the same, we just need to use it in a different way: for activities, we compute the summands  $a_i b_i$  from Eqs. (3a) and (3b), while to compute the number of infinity contributions we check  $b_i$ , and set the summand value to 1 if  $b_i$  is infinite, and to 0 otherwise. These summands are then summed in an equal way. Hence we can compute the number of infinity contributions by extending the reductions we already use. This comes at the expense of additional shared and register memory and some computations, however, no additional expensive global memory loads are needed. As already discussed, the activity computation kernel is highly memory-bound, so this approach works in our favor.

### 3.5. Computing new bound candidates

The new bound candidates for each variable of a constraint  $\underline{\beta} \leq A^T x \leq \bar{\beta}$  are computed by (4a) and (4b). This operation maps each non-zero element  $a_{ij}$  of  $A$  to its lower and upper bound candidates for variable  $j$ .

Our variant of the *CSR-adaptive* algorithm for computing activities from Section 3.3 works on a granularity level of one thread per non-zero element of  $A$ , which is the same granularity we need to map non-zero elements of  $A$  to new bound candidates. Moreover, each thread loads the values  $a_{ij}$ ,  $l_j$  and  $u_j$  into shared memory, and the computed  $\underline{a}$  and  $\bar{a}$  are also saved in shared memory. Thus, we extend the activities kernel to also compute (4a) and (4b) after the computation of activities is complete. In most cases, this implementation benefits from reusing the values that are kept in shared memory and avoids expensive GPU global memory loads.

Finally, once individual threads have computed the corresponding lower and upper bound candidate, they are compared to the best current bounds and updated if necessary. Because a given variable can be present in several constraints, it is possible for multiple threads to hold bound candidates for the same variable. This can lead to race conditions if such threads attempt to update the same variable's bounds in parallel. To deal with these race conditions, we use CUDA's atomic operations to update bounds.

Having all threads of the GPU perform atomic operations at the same time can incur a significant performance cost. In order to limit the number of necessary atomic operations, we exploit the fact that the bounds are strictly improving during the course of the algorithm. For a variable  $j$  with old bounds  $\ell_j$  and  $u_j$  and new bounds  $\ell_j^{\text{new}}$  and  $u_j^{\text{new}}$ , the following holds:  $\ell_j^{\text{new}} \geq \ell_j$  and  $u_j^{\text{new}} \leq u_j$ . Each thread has its own new bound candidates  $\ell_{i,j}^{\text{cand}}$ ,  $u_{i,j}^{\text{cand}}$  for the constraint  $i$  and variable  $j$  it is processing, but all threads assigned to this variable have the same bound values  $\ell_j$  and  $u_j$  from the previous round. However, because  $\ell_j^{\text{new}} \geq \ell_j$  then the following holds:  $\ell_{i,j}^{\text{cand}} \geq \ell_j$ , or the bound candidate cannot become the new bound. An equivalent argument can be made for the upper bound. This means that we can check if the candidate improves on the bounds from previous round first, and perform an atomic operation to update the actual bound only if it does. In other words, the algorithm discards useless candidates directly and only uses atomics to choose the best of all improvements. The pseudocode in Algorithm 3 summarizes the final algorithm performing one propagation round.

### 3.6. Performance effects as a function of the input

Let us now consider the effect of the shape and the sparsity pattern of the constraint matrix  $A$  on the performance of the developed algorithm. We already discussed how separate constraints can be processed independently (fully parallel) to compute the activities and the new bound candidates. While computing activities, the algorithm also exploits parallelism inside each constraint. However, the amount of parallelism here is limited due to the summation operation.

Suppose now the number of non-zeros of  $A$  is fixed. Then, from the point of view of parallel performance and scalability, this step will favor matrices with (a) more rows and (b) fewer non-zeros per row. This is because different rows can be processed fully in parallel, while the computations inside a row have an inherent sequential part that increases with the non-zeros count.

By contrast, the discussed necessity to use atomic operations to update bounds from the same column means that matrices with more non-zeros per column will have more threads competing for hardware resources, reducing parallel performance and scalability. Different columns, on the other hand, can be processed independently and fully in parallel with no possibility for hardware resources conflict. Therefore, this step will favor matrices with (c) more columns and (d) fewer non-zeros per column. Observe that (a) and (b) are in opposition

**Algorithm 3** Kernel performing one propagation round in Algorithm 2.  $blockIdx$  is the index of the current CUDA block the thread executing the code belongs to.

---

```

1: start_row ← row_blocks[blockIdx]
2: end_row ← row_blocks[blockIdx + 1]
3: nnz_block ← row_ptrs[end_row] - row_ptrs[start_row]
4: if end_row - start_row > 1 then
5:   compute  $\underline{a}_{\text{row}}$ ,  $\bar{a}_{\text{row}}$  by CSR-stream
6: else
7:   if nnz_block < length_threshold then
8:     compute  $\underline{a}_{\text{row}}$ ,  $\bar{a}_{\text{row}}$  by CSR-vector with one warp
9:   else
10:    compute  $\underline{a}_{\text{row}}$ ,  $\bar{a}_{\text{row}}$  by CSR-vector with all warps
11: __syncthreads()
12: // each thread knows the row idx i and column idx j of the non-zeros
   it is processing
13: compute  $\ell_{i,j}^{\text{cand}}$ ,  $u_{i,j}^{\text{cand}}$  via (4a) and (4b)
14: if  $\ell_{i,j}^{\text{cand}} > \ell_j$  then
15:    $\ell_j \leftarrow \text{atomicMax}(\ell_j, \ell_{i,j}^{\text{cand}})$ 
16: if  $u_{i,j}^{\text{cand}} < u_j$  then
17:    $u_j \leftarrow \text{atomicMin}(u_j, u_{i,j}^{\text{cand}})$ 
```

---

to (c) and (d), and that the trade-off between them will define their final effect on performance.

Finally, observe that the above “static” analysis of constraint matrix shape and sparsity pattern does not take into account the dynamic behavior of domain propagation, which will also affect performance. Two instances with the same shape, the same number of non-zeros and sparsity pattern might still behave differently. For example, one instance might only require a few bound changes to reach the limit point, while the other one might require thousands in many propagation rounds, possibly changing the part of the algorithm with dominating effect on performance; the bound changes in one instance might all happen to be from the same column, resulting in a bottleneck due to many atomic operations, while the other might need no atomic operations at all because all bound changes happen to be from different columns. Even if the two instances have the same bound changes, the timing of the changes will define the amount of hardware resources conflicts the atomic operations will have, e.g., if they are evenly distributed throughout the propagation rounds this might reduce the amount of conflicts, while if the majority is clustered in the first propagation round this might result in a high amount of conflicts. To conclude, the true potential for parallelism can only be evaluated empirically over a sufficiently heterogeneous test set.

### 3.7. The final algorithm

Algorithm 3 performs one propagation round. The execution of rounds is iterated as long as at least one bound change is found. If no bound change is found, the limit point is reached and the algorithm terminates. We provide three implementations of this behavior.

The first approach launches a kernel with one thread block and one thread to iterate through propagation rounds. This kernel then uses CUDA's *dynamic parallelism* feature to spawn the kernel implementing Algorithm 3 for each round, as many times as necessary. The kernels communicate through a boolean variable stored in GPU global memory to check if at least one bound change was found during a particular propagation round. This way, the CPU does not have to communicate with the GPU during the execution at all and can continue doing other computations while the bounds propagation is running on the GPU. We denote this implementation as *gpu\_loop*.

The second approach is to move the loop iterating through propagation rounds to the CPU. This means that minimal communication between the GPU and the CPU is necessary between propagation rounds: a boolean variable holding the information if at least one bound change was found during a given round. With this algorithm, the CPU has to perform a minimal amount of work during the propagation, namely, check one boolean variable in a loop and either invoke another kernel execution or otherwise terminate. However, this is overall faster than the first variant for the following reason: GPUs are built to take advantage of massive parallelism through concurrent threads, but any of those individual threads is typically orders of magnitude slower than a CPU thread. Small as it is, the sequential point of iterating through the propagation rounds and checking if any bound changes have been found has a noticeable effect on performance. We denote this implementation as *cpu\_loop*.

The third approach is to avoid the situation where a kernel with one thread and one block has to be launched by launching only one kernel for the entire duration of the execution. This kernel is launched with a number of blocks and threads that allows full GPU occupancy but still does not need to synchronize with the CPU. On the implementation level, this is achieved by combining the *grid-stride loop* kernel design with the *cooperative groups* feature to achieve grid-wide synchronization within the kernel. However, this still does not eliminate the sequential point of the algorithm, as all threads but one remain idle during the synchronization phase. Even though this implementation did outperform the first version with a single-threaded kernel and dynamic parallelism by a measurable amount for some specific instances, it performed worse overall. We denote this implementation as *megakernel*.

In Appendix B we present results quantifying the difference in performance of the three implementations. In the main experiments of the paper presented in Section 4, we use the best-performing *cpu\_loop* variant.

#### 4. Computational experiments

In this section, we present our experimental setup for studying the behavior of the proposed algorithms and discuss the results obtained.

##### 4.1. Test set

To benchmark the algorithms, we use the MIPLIB 2017 collection set with 1065 MIP instances, which is currently the largest and most widely adopted general testbed for MIP algorithms [12]. The number of non-zeros in the constraint matrix  $A$  of this test set ranges from 3 to 256,963,402. Due to the errors coming from the open-source file reader and other issues, we were not able to obtain results for 78 instances, leaving the set at 987 instances.

As explained in Section 1.1, domain propagation may not converge to its limit point in finite time. We have set a limit on the maximum number of propagation rounds in our algorithm at 100. There are 30 instances that did not converge within this limit, and we do not use them in performance comparisons. Additionally, some instances may run into numerical difficulties due to floating-point arithmetic and other problems. This happened to 64 of the instances in this set. Out of 987 instances, this leaves us with 893 instances which terminate successfully and converge to the same limit point (see Section 4.3 for the definition of convergence to the same limit point).

Because small instances do not provide enough workload to justify their parallelization on GPUs, we further eliminate all 107 instances that have less than 1000 variables and 1000 constraints. Finally, this leaves us with a set of 786 instances that are used for performance comparisons.

To better capture the performance of the algorithms with respect to the size of the input problems, we partition this set into eight subsets of increasing size. We denote by  $[s, t)$  the set of instances that contain less than  $t$  variables and  $t$  constraints, but have at least  $s$  variables or

$s$  constraints. We consider the partitions  $[1k, 10k]$ ,  $[10k, 20k]$ ,  $[20k, 40k]$ ,  $[40k, 80k]$ ,  $[80k, 160k]$ ,  $[160k, 320k]$ ,  $[320k, 640k]$ , and  $[640k, \infty)$ , and refer to these sets as *Set-1* to *Set-8*. Number of instances in sets *Set-1* through *Set-8* are: 270, 129, 98, 91, 65, 57, 40, and 36, respectively.

The numbers presented in this section differ slightly for the execution on the *P400* GPU (see Section 4.2), which has much less resources than other machines we used. This results in insufficient memory and similar problems for a few instances. The differences in numbers are never larger than a single digit.

The numbers for all executions are compiled for runs with double-precision arithmetic. Differences for single-precision executions are discussed separately in Section 4.5.

##### 4.2. Algorithms and hardware

We compare three algorithms<sup>1</sup>:

1. *cpu\_seq* is a sequential implementation of Algorithm 1. It is always executed on a single-core, following closely the current state-of-the-art implementation [11].
2. *cpu\_omp* is a shared memory-parallel implementation of Algorithm 1. OpenMP is used to parallelize the loop at Line 5. To improve load balancing, the set of constraint indices is pre-processed and only those marked for propagation are assigned to available threads. OpenMP locks are used to prevent race conditions while updating bounds.
3. *gpu\_atomic* is the implementation of Algorithm 3 on GPUs in the variant which iterates through the propagation rounds on the CPU (see Section 3.7).

The algorithms are tested on the following architectures:

- **V100** NVIDIA Tesla V100 PCIe 32 GB GPU,
- **TITAN** NVIDIA Titan RTX 24 GB GPU,
- **RTXsuper** NVIDIA GEFORCE RTX 2080 SUPER 8 GB GPU,
- **P400** NVIDIA Quadro P400 2 GB GPU,
- **amdtr** 64-core AMD Ryzen Threadripper 3990X @ 3.30 GHz with 128 GB RAM CPU,
- **xeon** 24-core Intel Xeon Gold 6246 @ 3.30 GHz with 384 GB RAM CPU,
- **i7-9700K** 8-core Intel i7-9700K @ 3.60 GHz with 64 GB RAM

The list includes high-end GPUs, such as the data center-grade *V100* and *TITAN*, which is also suitable for desktop computing. The *P400*, on the other hand, is a very low-grade GPU that can often be found in personal use desktops. The *amdtr* and *xeon* are data center-grade CPU servers, while the *i7-9700K* machine is a desktop computer.

The *cpu\_seq* and *cpu\_omp* algorithms are executed in double-precision arithmetic, while the *gpu\_atomic* algorithm is executed in double-precision (discussed in Section 4.4) and single-precision (discussed in Section 4.5). The *cpu\_omp* algorithm is run on the *xeon* machine with 24 threads, on the *amdtr* machine with 64 threads, and on the *i7-9700K* machine with 8 threads.

##### 4.3. Evaluation methodology

The main metric we use to compare two executions is *speedup* defined over wall clock time. Average speedups are computed as geometric means. We choose the *cpu\_seq* algorithm running on the *xeon* machine as a base reference execution. All speedups are computed against this execution, unless stated otherwise.

Both the sequential and GPU versions have a certain amount of initialization work which only needs to be performed once. In the

<sup>1</sup> Our implementations can be found at: <https://github.com/Sofranac-Boro/gpu-domain-propagator>.



**Fig. 1.** The left graph (a) shows geometric means of speedups over the eight subsets of instances of increasing size. The right graph (b) shows the distribution of speedups by plotting the individual speedup for each instance, sorted in ascending order. Each curve belongs to one combination of algorithm and machine. All executions are in double-precision floating-point arithmetic. The baseline is the *cpu\_seq-xeon* execution.

sequential case, e.g., this is computing the column-major CSC storage of the matrix, which is needed for the marking mechanism (see Algorithm 1). For the GPU algorithm, the blocking of the matrix  $A$  is precomputed on the CPU (see Section 3.2), and the necessary memory is sent to the GPU. As in [13], we do not include these one-time initialization tasks in our time measurement. Timing starts just before the first propagation round and ends after the last propagation round is executed and the results are available (in CPU memory in the CPU case, in GPU memory in the GPU case).

When comparing the results of two executions, each individual bound is checked, and the results are deemed equal if all bounds are equal within tolerances as follows. Two values  $a$  and  $b$  are considered equal if  $|a - b| \leq (t_{abs} + t_{rel}|b|)$ , where  $t_{abs} = 10^{-8}$  and  $t_{rel} = 10^{-5}$ . The  $a$  value is always assigned to the bound of the reference *cpu\_seq* execution, while the  $b$  value is always assigned to the execution being evaluated, e.g., *cpu\_omp* or *gpu\_atomic*.

#### 4.4. Double-precision results

The speedups that can be observed for the seven algorithm-machine combinations executed with double-precision floating-point arithmetic are visualized in Fig. 1. Fig. 1(a) on the left shows the geometric means of speedups for the eight subsets *Set-1* to *Set-8* of instances with increasing size. Fig. 1(b) on the right additionally provides the distributions of the speedups over all instances sorted in ascending order. Table 1 provides a summary and reports the average speedup over all instances and over the subsets *Set-1* to *Set-8*, plus the 5th percentile, the median, and the 95th percentile speedup.

As can be seen immediately, the *gpu\_atomic* implementation running on the *V100*, *TITAN*, and *RTXsuper* GPUs drastically outperforms both the sequential and the CPU-parallel executions. Let us first evaluate the top-performing combination, *gpu\_atomic* on *V100*, in more detail. Its average speedup closely follows a linear trend in the size of the instances from *Set-1* to *Set-8*. It is always greater than 2.35, and for *Set-8*, which contains all instances with at least 640,000 variables or constraints, a speedup factor of 43.92 is achieved. This shows that, in general, the speedup of the GPU-parallel algorithms crucially depends on the size of the instances. Over the whole test set, the average speedup is a factor of 7.42, which reflects that a significant portion of the test set

are small instances that have limited potential for parallelism. For 5% of the instances, however, *gpu\_atomic* on *V100* is at least 51.65 times faster than the sequential implementation.

By contrast, on the low-end consumer-grade GPU *P400*, the *gpu\_atomic* kernel performs worse than *cpu\_seq* on all subsets, with an average speedup factor over the whole test set of 0.47. Compared to the other cards, the *P400* has less computational resources available to exploit the parallelism. Additionally, the Volta (*V100*) and Turing (*TITAN*, *RTXsuper*) architectures introduce changes which significantly improve performance [14] over the Pascal architecture (*P400*) cards. Nevertheless, keeping in mind that GPUs are currently a resource completely unused by MIP solvers and that the CPU can do other tasks while the GPU is computing, this result might prove to be interesting even in this very affordable setup, especially as the performance of consumer-grade GPUs will improve over the next years.

Also the shared memory-parallel algorithm *cpu\_omp* cannot compete with the GPU algorithms on *V100*, *TITAN*, and *RTXsuper*, which outperform it by a large margin. The comparatively high cost of managing CPU threads proves to not be justified by the low arithmetic intensity of the parallel units of work processed by the threads. This is especially visible for the 24-thread *xeon* and the 64-thread *amdr* executions, which are slower than the *cpu\_seq* on all subsets but *Set-8*. The 8-thread *i7-9700K* execution, however, outperforms the *cpu\_seq* execution on all subsets except for *Set-1*, but its average speedup over the subsets is never larger than 3.1.

The *gpu\_atomic* executions on *TITAN* and *RTXsuper* GPUs, which both consistently outperform the *cpu\_seq* execution by a large margin, perform quite similarly between themselves. Interestingly, their curves can be seen to cross on both plots. Their similarity in performance is practically relevant given that *RTXsuper* is a cheaper alternative to the high-end *TITAN* GPU, which still achieves comparable performance.

Fig. 1(b) allows two further observations. First, comparing the fastest GPU and CPU executions, the slope of the *V100* execution is noticeably steeper than the one of *i7-9700K*, showing that the GPU execution responds better to the growth in parallelism. The same effect can also be seen between the four GPU executions. Second, the break-even point from which on the *cpu\_omp* on *i7-9700K* becomes faster than the sequential implementation lies around the 41st percentile. By contrast, *gpu\_atomic* on *V100* breaks even at the 7th percentile.

**Table 1**

Geometric means of speedups across test sets and speedup percentiles for execution with double-precision arithmetic.

|                         | V100<br>gpu_atomic | TITAN<br>gpu_atomic | RTXsuper<br>gpu_atomic | P400<br>gpu_atomic | amdtr<br>cpu_omp | xeon<br>cpu_omp | i7-9700K<br>cpu_omp |
|-------------------------|--------------------|---------------------|------------------------|--------------------|------------------|-----------------|---------------------|
| Geometric mean speedups |                    |                     |                        |                    |                  |                 |                     |
| Set-1                   | 2.35               | 1.47                | 1.83                   | 0.45               | 0.01             | 0.04            | 0.48                |
| Set-2                   | 6.60               | 2.79                | 3.13                   | 0.45               | 0.03             | 0.11            | 1.00                |
| Set-3                   | 9.08               | 3.00                | 3.40                   | 0.43               | 0.05             | 0.19            | 1.53                |
| Set-4                   | 12.93              | 3.94                | 3.79                   | 0.39               | 0.09             | 0.30            | 1.90                |
| Set-5                   | 19.03              | 5.71                | 5.08                   | 0.53               | 0.17             | 0.43            | 2.24                |
| Set-6                   | 25.31              | 6.79                | 5.89                   | 0.56               | 0.27             | 0.57            | 2.30                |
| Set-7                   | 33.00              | 8.11                | 6.79                   | 0.56               | 0.50             | 0.63            | 2.00                |
| Set-8                   | 43.92              | 10.06               | 8.37                   | 0.74               | 1.20             | 1.15            | 3.05                |
| All                     | 7.42               | 2.98                | 3.19                   | 0.47               | 0.04             | 0.14            | 1.10                |
| Percentile speedups     |                    |                     |                        |                    |                  |                 |                     |
| 5%                      | 0.83               | 0.37                | 0.62                   | 0.12               | 0.003            | 0.01            | 0.20                |
| 50%                     | 7.50               | 3.35                | 3.38                   | 0.52               | 0.04             | 0.16            | 1.28                |
| 95%                     | 51.65              | 14.80               | 12.56                  | 1.15               | 0.80             | 1.22            | 3.91                |

It wins against the sequential implementation for about 93% of the instances. This highlights that the best GPU-parallel versions are not only significantly faster than the shared memory-parallel versions, but that they are also much more robust.

Additionally, for all GPU runs, we can observe a steep increase of the speedup distribution after the 95th percentile. This indicates that for particularly favorable structures GPU parallelization can have an even stronger impact beyond what our general observations over the heterogeneous MIPLIB 2017 test set suggest.

We also performed the *roofline analysis* [15] based on peak bandwidth and performance on the *V100* machine. The heterogeneous MIPLIB 2017 test set contains a number of smaller instances that have a low potential of reaching throughput limits on *V100*, making the roofline analysis unsuitable. Hence, for the purpose of this analysis, we remove all instances with less than 250,000 non-zeros from the test set. On the remaining 349 instances, the average recorded arithmetic intensity is 2.96, with the minimal and maximal recorded values 0.26 and 17.69, respectively. The machine balance on the *V100* is 8.53, indicating that the code is, on average, memory-bound. On average, we recorded 23.64% of attainable performance according to the roofline model, with the minimal and maximal recorded values at 1.5% and 89.14%, respectively. These results highlight the highly irregular and dynamic effects that the properties of the input have on the performance of the algorithm (see Section 3.6) and reveal both challenges and potential for future improvements.

#### 4.5. Single-precision results

For some algorithms running on GPUs, replacing double-precision by single-precision arithmetic may offer significant speedups, if the reduced precision is acceptable from the point of view of correctness of results for the application in question. Additionally, NVIDIA's fast math library (enabled by passing the `--use_fast_math` flag to the `nvcc` compiler) allows for usage of fast hardware implementations of specific operations at the expense of accuracy of results. In this section, we analyze the execution of our `gpu_atomic` algorithm using single-precision arithmetic both with and without the fast math option. In both cases (as well as for double-precision executions from Section 4.4), the `fused-multiply-add` instructions are enabled.

Fig. 2 shows the results for the three GPU machines defined in Section 4.2: *V100*, *TITAN*, and *P400*. *V100* and *P400* machines both have 64 FP32 and 32 FP64 cores per streaming multiprocessor [16,17]. *TITAN*, on the other hand, has 32 FP32 cores per streaming multiprocessor, with only 2 FP64 cores for correctness purposes [18]. First, we look into the executions without the "fast math" option. Out of 987 instances, 842 converged to the same limit point, 27 converged but not to the same limit point, while 118 instances hit the maximum number of rounds limit. By contrast, the double-precision executions converge

to the same limit point without reaching the maximum number of rounds for 893 instances (see Section 4.1). This shows the effects of reduced precision on the correctness of results. (As before, these numbers are taken from the *V100* machine execution and they might differ by a single-digit number across different machines.)

On the *V100*, we observe that the performance difference between the double- and single-precision executions is minimal: double-precision speedup over the whole test set was 7.42, while the single-precision speedup stands at 7.27. Executions on *TITAN* and *P400* gave a modest speedup: on *TITAN*, the average speedup over the whole set increased from 2.98 to 3.63, while on *P400*, it increased from 0.47 to 0.60.

The lack of a more significant speedup may be somewhat unexpected, but can be explained by the sparse input data and data structures of the implementations. Observe that a large part of the memory for the domain propagation kernels is not used for floating-point numbers. Without going into details, we note that the CSR storage format used to represent the sparse constraint matrix contains more integers than floating-point numbers, a memory that is accessed often to implement the indexing logic of the reductions. Due to the low arithmetic intensity of domain propagation, there are usually not many floating-point operations following the indexing. Additionally, apart from looking at global memory, internal parts of the computations also work on integers. For example, the reductions to keep track of the number of infinity contributions to activities works exclusively on integers, see Section 3.4. To summarize, we perform as many reductions on integers as we do on floating-point numbers to compute the activities, plus all the indexing accesses are on integers.

We performed the roofline analysis on single-precision executions on the *V100* machine over the same 349 instances used in Section 4.4. The results are similar to the double-precision executions. However, the average recorded percent of peak performance according to the model drops to 14.86%, with the maximal value recorded at 68.9%. Average arithmetic intensity was recorded at 2.74, with minimal and maximal values at 0.18 and 28.85, respectively. This makes the single-precision code even more memory-bound than the double-precision code. The machine balance value roughly doubles.

Unlike the standard single-precision execution, the combination of "fast math" and single-precision execution provided a speedup on the *V100* GPU. Overall, the speedup increased from 7.42 to 8.54. On *TITAN* and *P400*, the speedups increased from 2.98 to 4.38 and from 0.47 to 0.61, respectively. For this execution, out of 987 instances, 736 converged to the same limit point, 28 converged but not to the same limit point, while 223 instances hit the maximum number of rounds limit. (As before, these numbers are taken from the *V100* machine execution and they might differ by a single-digit number across different machines.) We can see that the numerics seems to be affected because significantly more instances hit the maximum number of rounds limit, however, the number of instances with inconsistent results only increased by one.



**Fig. 2.** The left graph (a) shows geometric means of speedups over the eight subsets of instances of increasing size. The right graph (b) shows the distribution of speedups by plotting the individual speedup for each instance, sorted in ascending order. Double-precision executions are shown in solid lines, standard single-precision executions are shown in dashed lines, while the “fast math” mode single-precision executions are shown in dotted lines. The baseline is *cpu\_seq-xeon* execution in double-precision arithmetic.

#### 4.6. Comparison with PaPILO

The *cpu\_seq* and *cpu\_omp* algorithms used as the baseline comparison were implemented by the authors for the purposes of this paper. In this section, we evaluate their performance and accuracy against an independent MIP presolver PaPILO, which is distributed as part of the SCIP Optimization Suite 7.0 [11]. PaPILO provides domain propagation implementation that can run either in single- or multi-thread setups. Our goal is not to provide a detailed and exact computational comparison of the two codes; rather, we want to show that our implementations are of “reasonable” performance compared to state-of-the-art domain propagation implementations.

Before interpreting results, we acknowledge that the comparison of the performance of our code against PaPILO is not on equal grounds. First, PaPILO is a much more generic framework, allowing interfacing with different MIP solvers and providing many additional presolving methods, apart from propagation of linear constraints as described in this paper. Even though we disable all other presolving methods but the domain propagation, PaPILO is tuned for the case of multiple methods being applied in combination. Second, PaPILO is meant to be used in conjunction with MIP solvers and will perform reductions useful to them while it performs domain propagation. Specifically, it will remove redundant constraints and variables that are no longer useful to MIP solvers. These operations cannot be turned off in PaPILO and will be included in the runtime. Our code, on the other hand, does not try to perform these reductions but focuses solely on domain propagation.

From the point of view of correctness and comparison of results, the above-mentioned behavior of PaPILO also presents difficulties. Because PaPILO will often reduce the problem formulation as described, we cannot directly compare the arrays of tightened bounds. Instead, we run our code and record the runtime, but afterwards, we pass the already propagated instance to PaPILO and let it process it. PaPILO will perform the necessary reductions such that the problem then (hopefully) matches the problem that is solved by PaPILO from scratch. However, apart from checking that the resulting tightened bounds are the same as described in Section 4.3, we now also check that PaPILO did not find any bound changes in the instance solved by our code first. We acknowledge that this is not a rigorous way to check if the results are the same, but we believe it suffices for our goal of establishing



**Fig. 3.** Geometric means of speedups over the eight subsets of instances of increasing size for the two PaPILO executions with 1 and 8 threads, and the *cpu\_omp* execution. All executions are in double-precision arithmetic. Baseline is *cpu\_seq* running on the *amdtr* machine.

that our baseline implementations are valid benchmarks in terms of accuracy and performance. Additionally, this approach re-reads and re-writes instances multiple times through two different MIP instance file readers, which in few cases results in errors and wrongly read values. These and other issues reduced the number of instances where PaPILO and our code produce the same results. Nevertheless, 701 instances did pass this check and they form the base for the results shown in Fig. 3.

All executions are done on the *amdtr* machine (see Section 4.2) in double-precision arithmetic. The base case is *cpu\_seq* running on the *amdtr* machine against which the speedups are computed for the two PaPILO executions with 1 and 8 threads, and *cpu\_omp* execution with 8 threads. On average, the single-threaded PaPILO run achieves a speedup of 0.08 compared to our single-threaded code. The PaPILO run with 8 threads performs slightly worse with a speedup of 0.07, on average. As is the case with our code, the multi-threaded PaPILO execution performs worse for subsets with smaller instances, while it eventually beats the single-threaded execution for the *Set-8* with

the largest instances. All in all, these results demonstrate that our implementations *cpu\_seq* and *cpu\_omp* provide a competitive baseline for the performance comparisons presented in the previous sections.

## 5. Conclusions and outlook

In this paper, we show that domain propagation is – with the right algorithmic design – amenable to efficient execution on GPUs. While this result is promising for the exploitation of GPUs in the MIP solving process, challenges still remain in this respect. Two primary use cases for domain propagation in MIP solvers are in *presolving* and after *branching* on variables during the main solving process. In the former case, the algorithm starts in the same setting as used in this paper — with no prior knowledge of the state of the constraint system. In the latter case, however, the system was usually already fully propagated before the branching took place. This means that the algorithm starts at a situation that is equivalent to just after a propagation round with a single bound change on the branching variable. Here, the *cpu\_seq* can in most cases avail of the constraint marking mechanism to significantly reduce the amount of work it needs to perform (see Section 2.3), to the point where there is not enough work to justify the cost of parallelization. On the other hand, the amount of time spent in domain propagation during presolving is usually small relative to the main solving time, and thus the motivation to speed it up “as is” is low.

In conclusion, while our new algorithm offers the ability to perform domain propagation on a much larger scale than before, new parent methods are required which will use domain propagation in a way suited for GPUs and to the benefit of MIP solvers. A particularly useful feature of our algorithm in this respect is that it is able to run without the need for synchronization with the CPU, making it possible to embed it in parent GPU-based algorithms as well as allowing the CPU to keep processing while the GPU propagation is taking place.

Execution of SpMVs on GPUs is an active research area and several improvements on the CSR-adaptive from [13] as well as new SpMV algorithms have been published since. For example, see [19–21]. A future research direction would be to analyze if the improvements of SpMV algorithms in the recent literature can be carried over to the computation of activities as described in this paper (see Section 3).

## Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

## Appendix A. Variability of baselines

The CPU architectures where we execute the *cpu\_seq* algorithm differ considerably among themselves (see Section 4.2). As explained, we chose the execution of *cpu\_seq* on the *xeon* machine as our baseline case and reported all speedups relative to this execution. However, one should keep in mind that executions of the *cpu\_seq* algorithm on other architectures will give different results due to architectural differences (e.g., amount of RAM or cache memory). This effect is shown in Fig. A.4, which plots the speedups of the *cpu\_seq* execution on *amdtr* and *i7-9700K* machines relative to the execution on the *xeon* machine on the instances from the test set defined in Section 4.1. We can see that the *amdtr* execution is faster on average than the *xeon* run, but not by a constant factor. In fact, there are instances where the *amdtr* is almost 4 times faster, and other instances where *xeon* beats it by a factor of 1.2. This shows that as the features of the instances change (e.g., size, shape of constraint matrix, etc.), the architecture of the machine affects performance by a different amount. Furthermore, the slopes on the curves are not linear, indicating that for some instances, the architectural differences have a disproportionately high effect.



**Fig. A.4.** Distribution of speedups by plotting the individual speedup for each instance, sorted in ascending order. All executions are in double-precision arithmetic. Baseline is *cpu\_seq* running on the *xeon* machine.



**Fig. B.5.** Geometric means of speedups over the eight subsets of instances of increasing size for the three algorithm variants *cpu\_loop*, *gpu\_loop*, and *megakernel*. Baseline is *cpu\_seq* running on the *xeon* machine. All executions are in double-precision arithmetic.

## Appendix B. GPU versus CPU synchronization

In Section 3.7 we presented three variants of kernel synchronization: *cpu\_loop*, *gpu\_loop*, and *megakernel*. We now present results quantifying the difference in their performance. The test set as defined in Section 4.1 is run on the *i7-9700K* CPU and the *RTXsuper* GPU (see Section 4.2). Results are shown in Fig. B.5.

Over the whole test set, the *cpu\_loop* is 1.72 times faster than the *gpu\_loop*, on average. However, most of this speedup comes from instances of smaller size, as we can see in Fig. B.5. The sequentialization point, which causes the slowdown on the GPU, remains constant as the sizes of instances increase, while the parallel part exploited by the GPU increases. Hence, as the sizes of the instances increase, the sequentialization point has less and less effect on the performance, an effect described in Amdahl’s law [22]. Indeed, we can see in Fig. B.5 that the two curves converge as the problem sizes increase.

On the other hand, the *megakernel* performs significantly worse than the other two variants over all subsets. Finally, keep in mind that our implementations were developed for the purpose of this paper, and while they contain all the described algorithmic steps, numerous performance optimizations are still possible. Due to the higher code complexity of the *megakernel* compared to the *cpu\_loop* and *gpu\_loop*, it could perhaps benefit the most from future optimization efforts.

## Appendix C. Differences to the conference version

A conference version of this paper was submitted and accepted for publication in the proceedings of IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3) [23]. Since then, the code has been significantly extended and improved, including bug fixes. Some of these additions are performance-neutral, while others affected it negatively. Most notably, we acknowledge a missing compiler flag during our test execution which negatively affected the performance of the *cpu\_seq* baseline used in [23]. On the other hand, many new additions improve performance, e.g., the reduced number of atomic operations described in Section 3.5, the possibility to run the rounds loop on the CPU described in Section 3.7, technical code improvements to reduce register usage and others. All these changes affect performance in a non-uniform way: in some cases, the performance is now superior to the results reported in [23], while in other cases it is inferior.

Also, the correctness of results was affected in some cases due to a missing feature described in Section 3.4. Moreover, numerical handling has been significantly improved, making the code more robust against round-off errors and other numerical difficulties. However, note that also in [23], performance comparisons were only reported over sets of instances for which competing implementations had converged to the same result.

## References

- [1] T. Achterberg, R. Wunderling, Mixed integer programming: Analyzing 12 years of progress, in: M. Jünger, G. Reinelt (Eds.), *Facets of Combinatorial Optimization: Festschrift for Martin Grötschel*, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 449–481, [http://dx.doi.org/10.1007/978-3-642-38189-8\\_18](http://dx.doi.org/10.1007/978-3-642-38189-8_18).
- [2] T. Koch, A. Martin, M.E. Pfetsch, Progress in academic computational integer programming, in: M. Jünger, G. Reinelt (Eds.), *Facets of Combinatorial Optimization: Festschrift for Martin Grötschel*, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 483–506, [http://dx.doi.org/10.1007/978-3-642-38189-8\\_19](http://dx.doi.org/10.1007/978-3-642-38189-8_19).
- [3] G. Nemhauser, L. Wolsey, *Integer and Combinatorial Optimization*, John Wiley & Sons, Inc., 1988, <http://dx.doi.org/10.1002/9781118627372>.
- [4] P. Belotti, S. Cafieri, J. Lee, L. Liberti, Feasibility-based bounds tightening via fixed points, in: W. Wu, O. Daescu (Eds.), *Combinatorial Optimization and Applications*, Proc. of COCOA 2010, Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, pp. 65–76, [http://dx.doi.org/10.1007/978-3-642-17458-2\\_7](http://dx.doi.org/10.1007/978-3-642-17458-2_7).
- [5] T. Achterberg, *Constraint Integer Programming* (Ph.D. thesis), TU Berlin, 2009.
- [6] V. Boyer, D.E. Baz, M. Salazar-Aguilar, GPU computing applied to linear and mixed-integer programming, *Adv. GPU Res. Pract.* (2017) 247–271, <http://dx.doi.org/10.1016/B978-0-12-803738-6.00010-0>.
- [7] M.W.P. Savelsbergh, Preprocessing and probing techniques for mixed integer programming problems, *ORSA J. Comput.* 6 (1994) 445–454.
- [8] D. Waltz, Understanding line drawings of scenes with shadows, in: *The Psychology of Computer Vision*, McGraw-Hill, 1975, p. pages.
- [9] E.D. Andersen, K.D. Andersen, Presolving in linear programming, *Math. Program.* 71 (1995) 221–245, <http://dx.doi.org/10.1007/BF01586000>.
- [10] J. Shetman, N.V. Sahinidis, A finite algorithm for global minimization of separable concave programs, *J. Global Optim.* 12 (1998) 1–36.
- [11] G. Gamrath, D. Anderson, K. Bestuzheva, W.-K. Chen, L. Eifler, M. Gasse, P. Gemander, A. Gleixner, L. Gottwald, K. Halbig, G. Hendel, C. Hojny, T. Koch, P. Le Bodic, S.J. Maher, F. Matter, M. Miltenberger, E. Mühmer, B. Müller, M.E. Pfetsch, F. Schlösser, F. Serrano, Y. Shinano, C. Tawfik, S. Vigerske, F. Wegscheider, D. Weninger, J. Witzig, *The SCIP Optimization Suite 7.0*, Technical Report, Optimization Online, 2020.
- [12] A. Gleixner, G. Hendel, G. Gamrath, T. Achterberg, M. Bastubbe, T. Berthold, P.M. Christophel, K. Jarck, T. Koch, J. Linderoth, M. Lübbecke, H.D. Mittelmann, D. Ozyurt, T.K. Ralphs, D. Salvagnin, Y. Shinano, *MIPLIB 2017: Data-Driven Compilation of the 6th Mixed-Integer Programming Library*, Technical Report, Optimization Online, 2019.
- [13] J.L. Greathouse, M. Daga, Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format, in: *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis*, in: SC '14, IEEE Press, 2014, pp. 769–780, <http://dx.doi.org/10.1109/SC.2014.68>.
- [14] Z. Jia, M. Maggioni, B. Staiger, D.P. Scarpazza, Dissecting the NVIDIA volta GPU architecture via microbenchmarking, 2018, <arXiv:1804.06826>.
- [15] S. Williams, A. Waterman, D. Patterson, Roofline: An insightful visual performance model for multicore architectures, *Commun. ACM* 52 (4) (2009) 65–76, <http://dx.doi.org/10.1145/1498765.1498785>.
- [16] Nvidia corporation, nvidia tesla v100 gpu architecture: the world's most advanced data center gpu, 2017, URL [https://images.nvidia.com/content/pdf/volta-architecture-whitepaper.pdf](https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf).
- [17] Nvidia corporation, nvidia tesla p100: the most advanced datacenter accelerator ever built featuring pascal gp100, the world's fastest gpu, 2016, URL <https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf>.
- [18] Nvidia corporation, nvidia turing gpu architecture: graphics reinvented, 2018, URL <https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf>.
- [19] M. Daga, J. Greathouse, Structural agnostic spmv: Adapting CSR-adaptive for irregular matrices, in: *2015 IEEE 22nd International Conference on High Performance Computing (HiPC)*, 2015, pp. 64–74.
- [20] M. Steinberger, R. Zayer, H.-P. Seidel, Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU, in: *Proceedings of the International Conference on Supercomputing*, in: ICS '17, Association for Computing Machinery, New York, NY, USA, 2017, <http://dx.doi.org/10.1145/3079079.3079086>.
- [21] M. Li, Y. Ao, C. Yang, Adaptive spmv/SpMSpV on GPUs for input vectors of varied sparsity, *IEEE Trans. Parallel Distrib. Syst.* 32 (2021) 1842–1853.
- [22] G.M. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, in: *Proceedings of the April 18-20, 1967, Spring Joint Computer Conference*, in: AFIPS '67 (Spring), Association for Computing Machinery, New York, NY, USA, 1967, pp. 483–485, <http://dx.doi.org/10.1145/1465482.1465560>.
- [23] B. Sofranac, A. Gleixner, S. Pokutta, Accelerating domain propagation: An efficient GPU-parallel algorithm over sparse matrices, in: *2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3)*, 2020, pp. 1–11, <http://dx.doi.org/10.1109/IA351965.2020.00007>.