Release v0.15 · spcl/dace

What's Changed

Work-Depth / Average Parallelism Analysis by @hodelcl in #1363 and #1327

A new analysis engine allows SDFGs to be statically analyzed for work and depth / average parallelism. The analysis allows specifying a series of assumptions about symbolic program parameters that can help simplify and improve the analysis results. For an example on how to use the analysis, see the following example:

from dace.sdfg.work_depth_analysis import work_depth

# A dictionary mapping each SDFG element to a tuple (work, depth)
work_depth_map = {}
# Assumptions about symbolic parameters
assumptions = ['N>5', 'M<200', 'K>N']
work_depth.analyze_sdfg(mysdfg, work_depth_map, work_depth.get_tasklet_work_depth, assumptions)

# A dictionary mapping each SDFG element to its average parallelism
average_parallelism_map = {}
work_depth.analyze_sdfg(mysdfg, average_parallelism_map, work_depth.get_tasklet_avg_par, assumptions)

Symbol parameter reduction in generated code (#1338, #1344)

To improve our integration with external codes, we limit the symbolic parameters generated by DaCe to only the used symbols. Take the following code for example:

@dace
def addone(a: dace.float64[N]):
  for i in dace.map[0:10]:
    a[i] += 1

Since the internal code does not actually need N to process the array, it will not appear in the generated code. Before this release the signature of the generated code would be:

DACE_EXPORTED void __program_addone(addone_t *__state, double * __restrict__ a, int N);

After this release it is:

DACE_EXPORTED void __program_addone(addone_t *__state, double * __restrict__ a);

Note that this is a major, breaking change that requires users who manually interact with the generated .so files to adapt to.

Externally-allocated memory (workspace) support (#1294)

A new allocation lifetime, dace.AllocationLifetime.External, has been introduced into DaCe. Now you can use your DaCe code with external memory allocators (such as PyTorch) and ask DaCe for: (a) how much transient memory it will need; and (b) to use a specific pre-allocated pointer. Example:

@dace
def some_workspace(a: dace.float64[N]):
  workspace = dace.ndarray([N], dace.float64, lifetime=dace.AllocationLifetime.External)
  workspace[:] = a
  workspace += 1
  a[:] = workspace

csdfg = some_workspace.to_sdfg().compile()

sizes = csdfg.get_workspace_sizes()  # Returns {dace.StorageType.CPU_Heap: N*8}
wsp = # ...Allocate externally...
csdfg.set_workspace(dace.StorageType.CPU_Heap, wsp)

The same interface is available in the generated code:

size_t __dace_get_external_memory_size_CPU_Heap(programname_t *__state, int N);
void __dace_set_external_memory_CPU_Heap(programname_t *__state, char *ptr, int N);
// or GPU_Global...

Schedule Trees (EXPERIMENTAL, #1145)

An experimental feature that allows you to analyze your SDFGs in a schedule-oriented format. It takes in SDFGs (even after applying transformations) and outputs a tree of elements that can be printed out in a Python-like syntax. For example:

@dace.program
def matmul(A: dace.float32[10, 10], B: dace.float32[10, 10], C: dace.float32[10, 10]):
  for i in range(10):
   for j in dace.map[0:10]:
     atile = dace.define_local([10], dace.float32)
     atile[:] = A[i]
     for k in range(10):
       with dace.tasklet:
         # ...
sdfg = matmul.to_sdfg()

from dace.sdfg.analysis.schedule_tree.sdfg_to_tree import as_schedule_tree
stree = as_schedule_tree(sdfg)
print(stree.as_string())

will print:

for i = 0; (i < 10); i = i + 1:
  map j in [0:10]:
    atile = copy A[i, 0:10]
    for k = 0; (k < 10); k = (k + 1):
      C[i, j] = tasklet(atile[k], B(10) [k, j], C[i, j])

There are some new transformation classes and passes in dace.sdfg.analysis.schedule_tree.passes, for example, to remove empty control flow scopes:

class RemoveEmptyScopes(tn.ScheduleNodeTransformer):
  def visit_scope(self, node: tn.ScheduleTreeScope):
    if len(node.children) == 0:
      return None
    return self.generic_visit(node)

We hope you find new ways to analyze and optimize DaCe programs with this feature!

Other Major Changes

Support for tensor linear algebra (transpose, dot products) by @alexnick83 in #1309
(Experimental) support for nested data containers and structures by @alexnick83 in #1324
(Experimental) basic support for mpi4py syntax by @alexnick83 and @Com1t in #1070 and #1288
(Experimental) Added support for a subset of F77 and F90 language features by @acalotoiu and @mcopik #1275, #1293, #1349 and #1367

Minor Changes

Support for Python 3.12 by @alexnick83 in #1386
Support attributes in symbolic expressions by @tbennun in #1369
GPU User Experience Improvements by @tbennun in #1283
State Fusion Extension with happens before dependency edge by @acalotoiu in #1268
Add CPU_Persistent map schedule (OpenMP parallel regions) by @tbennun in #1330

Fixes and Smaller Changes:

Fix transient bug in test with array_equal of empty arrays by @tbennun in #1374
Fixes GPUTransform bug when data are already in GPU memory by @alexnick83 in #1291
Fixed erroneous parsing of data slices when the data are defined inside a nested scope by @alexnick83 in #1287
Disable OpenMP sections by default by @tbennun in #1282
Make SDFG.name a proper property by @phschaad in #1289
Refactor and fix performance regression with GPU runtime checks by @tbennun in #1292
Fixed RW dependency violation when accessing data attributes by @alexnick83 in #1296
Externally-managed memory lifetime by @tbennun in #1294
External interaction fixes by @tbennun in #1301
Improvements to RefineNestedAccess by @alexnick83 and @Sajohn-CH in #1310
Fixed erroneous parsing of while-loop conditions by @alexnick83 in #1313
Improvements to MapFusion when the Map bodies contain NestedSDFGs by @alexnick83 in #1312
Fixed erroneous code generation of indirected accesses by @alexnick83 in #1302
RefineNestedAccess take indices into account when checking for missing free symbols by @Sajohn-CH in #1317
Fixed SubgraphFusion erroneously removing/merging intermediate data nodes by @alexnick83 in #1307
Fixed SDFG DFS traversal missing InterstateEdges by @alexnick83 in #1320
Frontend now uses the AST nodes' context to infer read/write accesses by @alexnick83 in #1297
Added capability for non-strict shape validation by @alexnick83 in #1321
Fixes for persistent schedule and GPUPersistentFusion transformation by @tbennun in #1322
Relax test for inter-state edges in default schedules by @tbennun in #1326
Improvements to inference of an SDFGState's read and write sets by @Sajohn-CH in #1325 and #1329
Fixed ArrayElimination pass trying to eliminate data that were already removed in #1314
Bump certifi from 2023.5.7 to 2023.7.22 by @dependabot in #1332
Fix some underlying issues with tensor core sample by @computablee in #1336
Updated hlslib to support Xilinx Vitis >=2022.2 by @carljohnsen in #1340
Docs: mention FPGA backend tested with Intel Quartus PRO by @TizianoDeMatteis in #1335
Improved validation of NestedSDFG connectors by @alexnick83 in #1333
Remove unused global data descriptor shapes from arguments by @tbennun in #1338
Fixed Scalar data validation in NestedSDFGs by @alexnick83 in #1341
Fix for None set properties by @tbennun in #1345
Add Object to defined types in code generation and some documentation by @tbennun in #1343
Fix symbolic parsing for ternary operators by @tbennun in #1346
Fortran fix memlet indices by @Sajohn-CH in #1342
Have memory type as argument for fpga auto interleave by @TizianoDeMatteis in #1352
Eliminate extraneous branch-end gotos in code generation by @tbennun in #1355
TaskletFusion: Fix additional edges in case of none-connectors by @lukastruemper in #1360
Fix dynamic memlet propagation condition by @tbennun in #1364
Configurable GPU thread/block index types, minor fixes to integer code generation and GPU runtimes by @tbennun in #1357

New Contributors

@computablee made their first contribution in #1290
@Com1t made their first contribution in #1288
@mcopik made their first contribution in #1349

Full Changelog: v0.14.4...v0.15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.15