Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fpga benchmark support #14

Open
jzhoulon opened this issue Oct 20, 2022 · 7 comments
Open

fpga benchmark support #14

jzhoulon opened this issue Oct 20, 2022 · 7 comments

Comments

@jzhoulon
Copy link

Currently, npbench seems only support cpu and gpu, is there any support for fpga? thanks

@alexnick83
Copy link
Contributor

There is no support for automatically compiling the DaCe versions for FPGA through the run_framework and run_benchmark scripts. However, the capability exists in DaCe if you have the necessary toolchains installed. I will check if we can add experimental support in NPBench and get back to you.

@jzhoulon
Copy link
Author

thanks

@jzhoulon
Copy link
Author

@alexnick83 is there any experimental code that I can reproduce some fpga performance data shown in the paper? THanks

@alexnick83
Copy link
Contributor

alexnick83 commented Oct 28, 2022

@alexnick83 is there any experimental code that I can reproduce some fpga performance data shown in the paper? THanks

Yes, apart from the paper's artifact, there are tests in the DaCe repository. In the paper, the samples under polybench were run. Note that the FPGA tests may have some new transformations compared to the paper, but I suppose you are looking for the latest developments.

@jzhoulon
Copy link
Author

@alexnick83 thanks for the info, how ever, I tried to benchmark the test under polybench, it seems dace_cpu and dace_gpu is mush slower than numpy(8-10x slower), such as the following code(cholesky_test.py), I have precompile sdfg with sdfg.compile. do you have any suggestions? thanks very much

'''
if name == "main":

parser = argparse.ArgumentParser()
parser.add_argument("-t", "--target", default='cpu', choices=['cpu', 'gpu', 'fpga'], help='Target platform')

args = vars(parser.parse_args())
target = args["target"]
sdfg = None
if target == "cpu":
    sdfg = run_cholesky(dace.dtypes.DeviceType.CPU)
elif target == "gpu":
    sdfg = run_cholesky(dace.dtypes.DeviceType.GPU)
elif target == "fpga":
    sdfg = run_cholesky(dace.dtypes.DeviceType.FPGA)

N = sizes["medium"]
A = init_data(N)
gt_A = np.copy(A)
sdfg_binary=sdfg.compile()
start = time.time()
for i in range(10):
  sdfg_binary(A=A, N = N)
end = time.time()
print("acclerator ", target, " time is ", (end - start)*100, "ms")

start = time.time()
for i in range(10):
    ground_truth(N, gt_A)
end = time.time()
print("numpy time is ",(end-start)*100, "ms")

'''

@alexnick83
Copy link
Contributor

alexnick83 commented Oct 31, 2022

For performance runs on CPU and GPU, I would use NPBench and not the DaCe tests. Their purpose is to keep track of functional regressions in the auto-optimizer for parameters controlled by the CI (for example, use simplify when generating the initial SDFG or not). Furthermore, the tests use, by default, a very small dataset size to finish execution fast. Therefore, you may be measuring library overheads on some of them. Still, the CPU being 8-10x slower seems strange. From the latest NPBench data (latest master branch), DaCe CPU is 14.3x faster than NumPy, and GPU is 4.6x slower than NumPy, on the same hardware and dataset as in the paper, which matches more or less the results shown. Another thing to note is that you must have optimized BLAS libraries installed for CPU execution. For example, if a test has matrix multiplication, but DaCe cannot find MKL (or OpenBLAS), it will generate the equivalent of the naive algorithm, which will run painfully slow.

@alexnick83
Copy link
Contributor

alexnick83 commented Oct 31, 2022

I just ran the modified Cholesky test you posted on my main machine (i7 7700). This is what I got for CPU and different parameters:

automatic_simplication=False

acclerator  cpu  time is  2.161860466003418 ms
numpy time is  202.2195816040039 ms

automatic_simplication=False, OMP_NUM_THREADS=4

acclerator  cpu  time is  2.292203903198242 ms
numpy time is  137.15152740478516 ms

automatic_simplication=True

acclerator  cpu  time is  1.9000768661499023 ms
numpy time is  134.70430374145508 ms

automatic_simplication=True, OMP_NUM_THREADS=4

acclerator  cpu  time is  1.9898653030395508 ms
numpy time is  136.15012168884277 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants