spcl · TizianoDeMatteis · Nov 7, 2023 · Nov 8, 2023 · Nov 9, 2023 · Nov 10, 2023
diff --git a/README.md b/README.md
@@ -68,6 +68,7 @@ For more information on how to use DaCe, see the [samples](samples) or tutorials
 * [SDFG API](https://nbviewer.jupyter.org/github/spcl/dace/blob/master/tutorials/sdfg_api.ipynb)
 * [Using and Creating Transformations](https://nbviewer.jupyter.org/github/spcl/dace/blob/master/tutorials/transformations.ipynb)
 * [Extending the Code Generator](https://nbviewer.jupyter.org/github/spcl/dace/blob/master/tutorials/codegen.ipynb)
+* [Targeting FPGAs with DaCe](https://nbviewer.jupyter.org/github/spcl/dace/blob/master/tutorials/fpga.ipynb)
 
 Publication
 -----------

diff --git a/tutorials/fpga.ipynb b/tutorials/fpga.ipynb
@@ -0,0 +1,324 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# FPGA programming with DaCe\n",
+    "\n",
+    "In this tutorial, we will see how a developer can write code using the python DaCe frontend and generate efficient code for FPGA.\n",
+    "We will discuss:\n",
+    "- how to parse, transform, and optimize the code for FPGA devices with maximal control (for experienced FPGA users)\n",
+    "- how to get this done automatically by DaCe auto-optimization heuristics (for non-experienced users or to quickly get a working example).\n",
+    "\n",
+    "Let's start with `ATAX`, a Matrix Transpose vector multiplication included in the Polybench suite: the case of ATAX, that computes $y = A^T Ax$.\n",
+    "\n",
+    "Following the [Numpy API](https://nbviewer.jupyter.org/github/spcl/dace/blob/master/tutorials/numpy_frontend.ipynb) tutorial, we start by writing the DaCe program as a regular python method, annotated with the `dace.program` annotation, with explicit type annotation. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import dace"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "M, N  = 24, 24\n",
+    "\n",
+    "@dace.program\n",
+    "def atax(A: dace.float32[M, N], x: dace.float32[N]):\n",
+    "    return (A @ x) @ A"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Vanilla execution"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's start by compiling the program for FPGA without applying any particular optimization. First, we can parse the program to build its SDFG and have a look at it:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sdfg = atax.to_sdfg()\n",
+    "sdfg"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "At this point, we need to transform it for FPGA execution: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dace.transformation.interstate import FPGATransformSDFG\n",
+    "sdfg.apply_transformations(FPGATransformSDFG)\n",
+    "sdfg"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This transformation takes care of creating create additional pre- and post-states to perform memory transfers between host and device performing memory transfers between host and device. \n",
+    "The actual computation is now scheduled to be executed on the FPGA as an FPGA kernel, and memories accessed by the transformed subgraph are replaced with their FPGA equivalents.\n",
+    "\n",
+    "Now we can compile and run it for execution (now commented for the sake of executing in the jupyter notebook)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# sdfg.compile()\n",
+    "# <generate/load input data A and x>\n",
+    "# y = sdfg(A,x, N=N, M=M)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Manually optmize for FPGA execution"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can notice how the current SDFG contains two library nodes. Library nodes are [high-level nodes](https://spcldace.readthedocs.io/en/latest/sdfg/ir.html#library-nodes) that can represent a wide variety of operators. In this case, the two matrix multiplications. During compilation and optimization, Library Nodes are expanded by replacing them with a subgraph, lowering them towards a concrete implementation of their behavior. For FPGA, it is convenient to do this explicitly. \n",
+    "\n",
+    "First, we specialize the two generic matrix multiplications. In this case, they are indeed two matrix-vector multiplications (one of which is transposed)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sdfg = atax.to_sdfg()\n",
+    "sdfg.apply_transformations(FPGATransformSDFG)\n",
+    "sdfg.expand_library_nodes(recursive=False)\n",
+    "sdfg"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For all matrix-vector multiplications (`gemv` and `gemvt`) we can use the `FPGA_Accumulate` expansion. This FPGA-oriented expansion iterates over the input matrix in simple row-major order (with optional tiling). The user can also specify a different expansion for each library node. Please refer to the documentation to see [all available FPGA expansions](https://spcldace.readthedocs.io/en/latest/optimization/fpga.html#available-fpga-expansions). We now choose the expansion and apply it (expanding it). Since this implementation makes use of BRAMs to store intermediate results whose size must be known at compile time, we need to \"specialize\" the size of our input data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dace.libraries.blas import Gemv\n",
+    "Gemv.default_implementation = \"FPGA_Accumulate\"\n",
+    "sdfg.expand_library_nodes()\n",
+    "sdfg.specialize(dict(M=M, N=N))\n",
+    "sdfg"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the resulting SDFG, we can notice how the two `gemv` have been replaced by the corresponding implementations. \n",
+    "We note how, in this computation, the memory access pattern (to the inputs `A` and `x` and output `return`) are known a priori. We can therefore decouple them from the computation creating streaming memory accessors, for the benifit of a simplified circuit implementation. DaCe offers the `StreamingMemory` transformation that automatically does this."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dace.transformation.dataflow import StreamingMemory\n",
+    "from dace.transformation.interstate import InlineSDFG\n",
+    "sdfg.apply_transformations_repeated([InlineSDFG, StreamingMemory],\n",
+    "                                                         [{}, {\n",
+    "                                                             'storage': dace.StorageType.FPGA_Local\n",
+    "                                                         }],\n",
+    "                                                         print_report=True)\n",
+    "sdfg"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As we can notice from the SDFG, the transformation is applied 3 times: for the reads from `A` (transposed and non-trasposed), for the reads from `x`, and for the writings of the final result in memory. While applying the transformation, we also Inlined (\"flattened\") the SDFG so that we can fully analyze data access patterns, and we specified that the resulting streams must be stored in FPGA local memory (BRAM).\n",
+    "\n",
+    "In more complicated use cases, this can be useful to make use of burst-mode in memory controller (see the [transformation documentation](https://spcldace.readthedocs.io/en/latest/source/dace.transformation.dataflow.html#dace.transformation.dataflow.streaming_memory.StreamingMemory)), or broadcasting off-chip memory to multiple processing elements. \n",
+    "\n",
+    "It could occur that subsequent computations share data through off-chip memory. If the memory access patterns are analyzable, we can avoid this undesirable situation by using the `StreamingComposition` transformation. Similar to `StreamingMemory`, this transformation will analyze data access patterns and, when applicable, converts two connected computations into two separate processing elements, with a stream connecting the results, removing the need for off-chip accesses and enabling the concurrent execution of the two components. This transformation does not apply in the considered use case, but the interested reader can refer to the related [documentation](https://spcldace.readthedocs.io/en/latest/source/dace.transformation.dataflow.html#dace.transformation.dataflow.streaming_memory.StreamingComposition)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, since in this case we have multiple memory buffers being accessed concurrently, we can distribute them on different memory banks (if the target device supports more than one memory bank)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dace.transformation.auto.auto_optimize import fpga_auto_opt\n",
+    "fpga_auto_opt.fpga_rr_interleave_containers_to_banks(sdfg, num_banks = 4, memory_type = \"DDR\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `fpga_auto_opt` module contains FPGA-specific optimizations. Another example of automatic optimization that can be applied is `fpga_global_to_local`, which changes the storage of containers allocated in global memory to local memory when this is possible.\n",
+    "\n",
+    "Finally, we can execute the program (here commented out)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# y = sdfg(A,x)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Auto-Optimization"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "While the discussion above enables an experienced programmer to tune the FPGA execution of their program, in many cases a good level of optimization can be achieved automatically by applying auto-optimization heuristics. If this targets FPGA devices, it will apply a set of simplification passes to the SDFG, and then applies the transformations discussed above, with the exception of the `StreamingMemory` (or `StreamingComposition` when applicable). Let's start again from parsing the SDFG:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sdfg = atax.to_sdfg()\n",
+    "from dace.transformation.auto.auto_optimize import auto_optimize\n",
+    "sdfg = auto_optimize(sdfg, dace.dtypes.DeviceType.FPGA)\n",
+    "sdfg.expand_library_nodes()\n",
+    "sdfg.specialize(dict(M=M, N=N))\n",
+    "sdfg"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that there is no need to explicitly expand library nodes. Here we did so to show the resulting SDFG. Then the program can be executed as before."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# y = sdfg(A,x)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Hardware Execution\n",
+    "\n",
+    "By default, DaCe is configured to execute FPGA programs in software emulation mode. This behavior can be changed through DaCe configuration settings, by setting the compilation mode either programmatically or via an environment variable. Hardware execution can be enabled via the command line, or other methods can be found in the [Configuring DaCe documentation](https://spcldace.readthedocs.io/en/latest/setup/config.html) and in the compilation configuration schema for [Xilinx](https://spcldace.readthedocs.io/en/latest/source/config_schema.html#envvar-compiler.xilinx.mode) and [Intel](https://spcldace.readthedocs.io/en/latest/source/config_schema.html#envvar-compiler.xilinx.mode) FPGAs).\n",
+    "\n",
+    "For example, to specify hardware execution via environment variable, the user can execute their DaCe program as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "shellscript"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "$ DACE_compiler_xilinx_mode=hardware python path_to_my_dace_program.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This will trigger the hardware compilation flow, which will generate the bitstream and execute the program on a FPGA equipped machine. Note that if the bitstream was not previously compiled (or there have been changes to the DaCe program), synthesis may require several hours, depending on the complexity of the generated FPGA program and machine capabilities."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}