-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FPGA tutorial #1438
Draft
TizianoDeMatteis
wants to merge
12
commits into
master
Choose a base branch
from
fpga_tutorial
base: master
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
FPGA tutorial #1438
Changes from 9 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
eaf26f0
FPGA tutorial, defined problem (ATAX)
TizianoDeMatteis 6cc264c
FPGA tutorial: lib node expansions
TizianoDeMatteis 0c90169
FPGA Tutorial: vanilla execution
TizianoDeMatteis 84c10a5
FPGA Tutorial, auto-opt
TizianoDeMatteis c9dd076
Merge branch 'master' into fpga_tutorial
TizianoDeMatteis ae29879
FPGA tutorial add hardware execution
TizianoDeMatteis 0124b92
FPGA tutorial executed
TizianoDeMatteis 5d504de
Merge branch 'master' into fpga_tutorial
TizianoDeMatteis ea2783d
Incorporate review comments
TizianoDeMatteis 045b192
Merge branch 'master' into fpga_tutorial
TizianoDeMatteis 25f0622
Execute the fpga tutorial
TizianoDeMatteis 8dda104
Merge branch 'master' into fpga_tutorial
TizianoDeMatteis File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,324 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# FPGA programming with DaCe\n", | ||
"\n", | ||
"In this tutorial, we will see how a developer can write code using the python DaCe frontend and generate efficient code for FPGA.\n", | ||
"We will discuss:\n", | ||
"- how to parse, transform, and optimize the code for FPGA devices with maximal control (for experienced FPGA users)\n", | ||
"- how to get this done automatically by DaCe auto-optimization heuristics (for non-experienced users or to quickly get a working example).\n", | ||
"\n", | ||
"Let's start with `ATAX`, a Matrix Transpose vector multiplication included in the Polybench suite: the case of ATAX, that computes $y = A^T Ax$.\n", | ||
"\n", | ||
"Following the [Numpy API](https://nbviewer.jupyter.org/github/spcl/dace/blob/master/tutorials/numpy_frontend.ipynb) tutorial, we start by writing the DaCe program as a regular python method, annotated with the `dace.program` annotation, with explicit type annotation. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import dace" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"M, N = 24, 24\n", | ||
"\n", | ||
"@dace.program\n", | ||
"def atax(A: dace.float32[M, N], x: dace.float32[N]):\n", | ||
" return (A @ x) @ A" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Vanilla execution" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Let's start by compiling the program for FPGA without applying any particular optimization. First, we can parse the program to build its SDFG and have a look at it:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"sdfg = atax.to_sdfg()\n", | ||
"sdfg" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"At this point, we need to transform it for FPGA execution: " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from dace.transformation.interstate import FPGATransformSDFG\n", | ||
"sdfg.apply_transformations(FPGATransformSDFG)\n", | ||
"sdfg" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This transformation takes care of creating create additional pre- and post-states to perform memory transfers between host and device performing memory transfers between host and device. \n", | ||
"The actual computation is now scheduled to be executed on the FPGA as an FPGA kernel, and memories accessed by the transformed subgraph are replaced with their FPGA equivalents.\n", | ||
"\n", | ||
"Now we can compile and run it for execution (now commented for the sake of executing in the jupyter notebook)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# sdfg.compile()\n", | ||
"# <generate/load input data A and x>\n", | ||
"# y = sdfg(A,x, N=N, M=M)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Manually optmize for FPGA execution" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We can notice how the current SDFG contains two library nodes. Library nodes are [high-level nodes](https://spcldace.readthedocs.io/en/latest/sdfg/ir.html#library-nodes) that can represent a wide variety of operators. In this case, the two matrix multiplications. During compilation and optimization, Library Nodes are expanded by replacing them with a subgraph, lowering them towards a concrete implementation of their behavior. For FPGA, it is convenient to do this explicitly. \n", | ||
"\n", | ||
"First, we specialize the two generic matrix multiplications. In this case, they are indeed two matrix-vector multiplications (one of which is transposed)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"sdfg = atax.to_sdfg()\n", | ||
"sdfg.apply_transformations(FPGATransformSDFG)\n", | ||
"sdfg.expand_library_nodes(recursive=False)\n", | ||
"sdfg" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"For all matrix-vector multiplications (`gemv` and `gemvt`) we can use the `FPGA_Accumulate` expansion. This FPGA-oriented expansion iterates over the input matrix in simple row-major order (with optional tiling). The user can also specify a different expansion for each library node. Please refer to the documentation to see [all available FPGA expansions](https://spcldace.readthedocs.io/en/latest/optimization/fpga.html#available-fpga-expansions). We now choose the expansion and apply it (expanding it). Since this implementation makes use of BRAMs to store intermediate results whose size must be known at compile time, we need to \"specialize\" the size of our input data." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from dace.libraries.blas import Gemv\n", | ||
"Gemv.default_implementation = \"FPGA_Accumulate\"\n", | ||
"sdfg.expand_library_nodes()\n", | ||
"sdfg.specialize(dict(M=M, N=N))\n", | ||
"sdfg" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In the resulting SDFG, we can notice how the two `gemv` have been replaced by the corresponding implementations. \n", | ||
"We note how, in this computation, the memory access pattern (to the inputs `A` and `x` and output `return`) are known a priori. We can therefore decouple them from the computation creating streaming memory accessors, for the benifit of a simplified circuit implementation. DaCe offers the `StreamingMemory` transformation that automatically does this." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from dace.transformation.dataflow import StreamingMemory\n", | ||
"from dace.transformation.interstate import InlineSDFG\n", | ||
"sdfg.apply_transformations_repeated([InlineSDFG, StreamingMemory],\n", | ||
" [{}, {\n", | ||
" 'storage': dace.StorageType.FPGA_Local\n", | ||
" }],\n", | ||
" print_report=True)\n", | ||
"sdfg" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"As we can notice from the SDFG, the transformation is applied 3 times: for the reads from `A` (transposed and non-trasposed), for the reads from `x`, and for the writings of the final result in memory. While applying the transformation, we also Inlined (\"flattened\") the SDFG so that we can fully analyze data access patterns, and we specified that the resulting streams must be stored in FPGA local memory (BRAM).\n", | ||
"\n", | ||
"In more complicated use cases, this can be useful to make use of burst-mode in memory controller (see the [transformation documentation](https://spcldace.readthedocs.io/en/latest/source/dace.transformation.dataflow.html#dace.transformation.dataflow.streaming_memory.StreamingMemory)), or broadcasting off-chip memory to multiple processing elements. \n", | ||
"\n", | ||
"It could occur that subsequent computations share data through off-chip memory. If the memory access patterns are analyzable, we can avoid this undesirable situation by using the `StreamingComposition` transformation. Similar to `StreamingMemory`, this transformation will analyze data access patterns and, when applicable, converts two connected computations into two separate processing elements, with a stream connecting the results, removing the need for off-chip accesses and enabling the concurrent execution of the two components. This transformation does not apply in the considered use case, but the interested reader can refer to the related [documentation](https://spcldace.readthedocs.io/en/latest/source/dace.transformation.dataflow.html#dace.transformation.dataflow.streaming_memory.StreamingComposition)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Finally, since in this case we have multiple memory buffers being accessed concurrently, we can distribute them on different memory banks (if the target device supports more than one memory bank)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from dace.transformation.auto.auto_optimize import fpga_auto_opt\n", | ||
"fpga_auto_opt.fpga_rr_interleave_containers_to_banks(sdfg, num_banks = 4, memory_type = \"DDR\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The `fpga_auto_opt` module contains FPGA-specific optimizations. Another example of automatic optimization that can be applied is `fpga_global_to_local`, which changes the storage of containers allocated in global memory to local memory when this is possible.\n", | ||
"\n", | ||
"Finally, we can execute the program (here commented out)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# y = sdfg(A,x)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Auto-Optimization" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"While the discussion above enables an experienced programmer to tune the FPGA execution of their program, in many cases a good level of optimization can be achieved automatically by applying auto-optimization heuristics. If this targets FPGA devices, it will apply a set of simplification passes to the SDFG, and then applies the transformations discussed above, with the exception of the `StreamingMemory` (or `StreamingComposition` when applicable). Let's start again from parsing the SDFG:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"sdfg = atax.to_sdfg()\n", | ||
"from dace.transformation.auto.auto_optimize import auto_optimize\n", | ||
"sdfg = auto_optimize(sdfg, dace.dtypes.DeviceType.FPGA)\n", | ||
"sdfg.expand_library_nodes()\n", | ||
"sdfg.specialize(dict(M=M, N=N))\n", | ||
"sdfg" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Note that there is no need to explicitly expand library nodes. Here we did so to show the resulting SDFG. Then the program can be executed as before." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# y = sdfg(A,x)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Hardware Execution\n", | ||
"\n", | ||
"By default, DaCe is configured to execute FPGA programs in software emulation mode. This behavior can be changed through DaCe configuration settings, by setting the compilation mode either programmatically or via an environment variable. Hardware execution can be enabled via the command line, or other methods can be found in the [Configuring DaCe documentation](https://spcldace.readthedocs.io/en/latest/setup/config.html) and in the compilation configuration schema for [Xilinx](https://spcldace.readthedocs.io/en/latest/source/config_schema.html#envvar-compiler.xilinx.mode) and [Intel](https://spcldace.readthedocs.io/en/latest/source/config_schema.html#envvar-compiler.xilinx.mode) FPGAs).\n", | ||
"\n", | ||
"For example, to specify hardware execution via environment variable, the user can execute their DaCe program as follows:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "shellscript" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"$ DACE_compiler_xilinx_mode=hardware python path_to_my_dace_program.py" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This will trigger the hardware compilation flow, which will generate the bitstream and execute the program on a FPGA equipped machine. Note that if the bitstream was not previously compiled (or there have been changes to the DaCe program), synthesis may require several hours, depending on the complexity of the generated FPGA program and machine capabilities." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.6" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No inline comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, these should be fixed now.