Roadmap

NOTE: THIS IS NOW OBSOLETE

Prioritized tasks

Performance benchmarks
Add performance benchmarks to continuous integration. This can include adding an image capturing performance on benchmark problems to the top level.
Free surface boundary condition
Handle free surface boundary condition by smartening stencil behavior at boundary. Currently the free surface BC is handled by copying slices of wavefields per timestep into a mirror array with appropriate polarity reversal. This will is non optimal for performance and free surface modeling is critical to industry scale uptake.
GPU something something
Make it go fast
Handle complex arithmetic
Add capability to handle complex in Devito
Automated roofline model generation in TheMatrix for both CPUs and GPUs

Un-prioritized tasks

Eikonal solver hackathon
Use Devito to implement a fast sweeping method TTI anisotropic eikonal solver. This has many valuable use cases including improving illumination compensation (potential improvements for FWI workflows), enabling "expanding box" (potential improvements in throughput especially for large problems), and the ability to generate traveltime tables.
mpi collect function
A Data method to collect and return MPI distributed wavefields not requiring MPI calls directly. e.g. u.collect()
Test sincos for trig in TTI system
TTI systems often has sin(a) cos(a) nearby in mathematical expressions. The sincos function can be vectorized and is worth investigating for performance relative individual sin and cos calls.
No-copy operators Currently Devito's operators assume that they're always working on Devito-allocated memory. We could jump through a few hoops to repackage externally allocated memory as Function objects and have Devito operators work (partially) off them. This will:

Speed up checkpointing

Be useful in doing any kind of data movement, e.g. do the first time step while copying Data into Devito's memory from externally allocated memory

e.g. 2, do a timestep while doing CPU->GPU (and reverse) data transfers

e.g. 3, do a timestep while doing a data layout transformation, in the simplest case, even a transpose

remainder spatial loops
generated code has 4 calls to spatial loop functions per time step, for handling: x-interior, y-interior, x-interior, y-remainder, x-remainder, y-interior, x-remainder, y-remainder.
These could be replaced with a single call to the function that used a min or ternary function for the loop traversal.

old space loop (with 1 interior and 3 remainder calls):

#pragma omp parallel num_threads(nthreads) private(r36,r37,r38)
  {
    #pragma omp for collapse(2) schedule(dynamic,1)
    for (int x0_blk0 = x_m; x0_blk0 <= x_M; x0_blk0 += x0_blk0_size) {
      for (int y0_blk0 = y_m; y0_blk0 <= y_M; y0_blk0 += y0_blk0_size) {
        for (int x = x0_blk0 - 4, xs = 0; x <= x0_blk0 + x0_blk0_size + 2; x += 1, xs += 1) {
         for (int y = y0_blk0 - 4, ys = 0; y <= y0_blk0 + y0_blk0_size + 2; y += 1, ys += 1) {
            #pragma omp simd aligned(p0:32)
            for (int z = z_m - 4; z <= z_M + 3; z += 1) {}
          }
        }
        for (int x = x0_blk0, xs = 0; x <= x0_blk0 + x0_blk0_size - 1; x += 1, xs += 1) {
         for (int y = y0_blk0, ys = 0; y <= y0_blk0 + y0_blk0_size - 1; y += 1, ys += 1) {
            #pragma omp simd aligned(b,p0,vel,wOverQ:32)
            for (int z = z_m; z <= z_M; z += 1) {}
          }
        }
      }
    }
  }

new space loop (single call):

#pragma omp parallel num_threads(nthreads) private(r36,r37,r38)
  {
    int xmax, ymax;
    #pragma omp for collapse(2) schedule(dynamic,1)
    for (int x0_blk0 = x_m; x0_blk0 <= x_M; x0_blk0 += x0_blk0_size) {
      for (int y0_blk0 = y_m; y0_blk0 <= y_M; y0_blk0 += y0_blk0_size) {
        xmax = min((x0_blk0 + x0_blk0_size + 2), x_M);
        ymax = min((y0_blk0 + y0_blk0_size + 2), y_M);
        for (int x = x0_blk0 - 4, xs = 0; x <= xmax; x += 1, xs += 1) {
          for (int y = y0_blk0 - 4, ys = 0; y <= ymax; y += 1, ys += 1) {
            #pragma omp simd aligned(p0:32)
            for (int z = z_m - 4; z <= z_M + 3; z += 1) {}
          }
        }
        xmax = min((x0_blk0 + x0_blk0_size - 1), x_M);
        ymax = min((y0_blk0 + y0_blk0_size - 1), y_M);
        for (int x = x0_blk0, xs = 0; x <= xmax; x += 1, xs += 1) {
          for (int y = y0_blk0, ys = 0; y <= ymax; y += 1, ys += 1) {
            #pragma omp simd aligned(b,p0,vel,wOverQ:32)
            for (int z = z_m; z <= z_M; z += 1) {}
          }
        }
      }
    }
  }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap

NOTE: THIS IS NOW OBSOLETE

Prioritized tasks

Un-prioritized tasks

Clone this wiki locally