Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De1-SoC Performance Different from Example Given ! #143

Open
mingyi136 opened this issue Jul 2, 2020 · 8 comments
Open

De1-SoC Performance Different from Example Given ! #143

mingyi136 opened this issue Jul 2, 2020 · 8 comments

Comments

@mingyi136
Copy link

mingyi136 commented Jul 2, 2020

Hai @doonny , I have run inference on De1-SoC board with VEC_SIZE=8 and LANE_NUM=8 (other parameters remain unchanged).

However, the Total kernel runtime is 236.344 ms instead of 149.988 ms given in the example. Is there any additional changes in parameters / coding compared to old version?

Here is my inference result:

***************************************************
PipeCNN: An OpenCL-Based FPGA Accelerator for CNNs
***************************************************

Platform: Intel(R) FPGA SDK for OpenCL(TM)
Totally 1 device(s) are found
  Using Device 0: de1soc_sharedonly : Cyclone V SoC Development Kit
Device OpenCL Version: OpenCL 1.0 Intel(R) FPGA SDK for OpenCL(TM), Version 16.1
Device Max Compute Units: 1
Device Max WorkGroup Size: 2147483647
Device Max WorkItem Size: 2147483647
Device Global Memory Size: 512 MBytes
Device Local Memory Size: 16 KBytes
Device Max Clock Freq: 1000 Mhz

Loading kernel/binary from file conv.aocx
Reprogramming device [0] with handle 1

61063552 total weights read
1024 total output reference read


154587 bytes image data read from binary files

Executing Layer 1:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 55, 55, 96)

Launching single work-item kernel Pool

Launching kernel lrn with local size: 1, 1, 12  (global size: 27, 27, 12)

Executing Layer 2:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 27, 27, 256)

Launching single work-item kernel Pool

Launching kernel lrn with local size: 1, 1, 32  (global size: 13, 13, 32)

Executing Layer 3:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 13, 13, 384)

Executing Layer 4:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 13, 13, 384)

Executing Layer 5:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 13, 13, 256)

Launching single work-item kernel Pool

Executing Layer 6:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 1, 1, 4096)

Executing Layer 7:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 1, 1, 4096)

Executing Layer 8:

Launching single work-item kernel winbuffer

Launching single work-item kernel Conv

Launching kernel MemWr with local size: 1, 1, 8  (global size: 1, 1, 1024)

Copyed all batched results from fc_1 buffers.
Selected item = 0 from the combined batch results in fc buffers

Start verifying results ...

Check Pass !!!

The inference result is n02123045 tabby, tabby cat   (the prob is 56.00)


PipeCNN exited !!!


-------------------

Performance Summary

Kernel runtime summary:
  Layer-1:
    MemRd: 70.738 ms
    Conv : 70.578 ms
    Pool : 70.346 ms
    MemWr: 70.486 ms
    Lrn  : 1.383 ms
  Layer-2:
    MemRd: 56.435 ms
    Conv : 56.304 ms
    Pool : 56.106 ms
    MemWr: 56.241 ms
    Lrn  : 0.456 ms
  Layer-3:
    MemRd: 39.022 ms
    Conv : 38.899 ms
    Pool : 0.000 ms
    MemWr: 38.827 ms
    Lrn  : 0.000 ms
  Layer-4:
    MemRd: 28.978 ms
    Conv : 28.854 ms
    Pool : 0.000 ms
    MemWr: 28.788 ms
    Lrn  : 0.000 ms
  Layer-5:
    MemRd: 19.408 ms
    Conv : 19.272 ms
    Pool : 19.081 ms
    MemWr: 19.209 ms
    Lrn  : 0.000 ms
  Layer-6:
    MemRd: 14.490 ms
    Conv : 14.371 ms
    Pool : 0.000 ms
    MemWr: 14.262 ms
    Lrn  : 0.000 ms
  Layer-7:
    MemRd: 6.538 ms
    Conv : 6.423 ms
    Pool : 0.000 ms
    MemWr: 6.344 ms
    Lrn  : 0.000 ms
  Layer-8:
    MemRd: 1.758 ms
    Conv : 1.642 ms
    Pool : 0.000 ms
    MemWr: 1.562 ms
    Lrn  : 0.000 ms

Total kernel runtime 236.344 ms
Batch size = 1, average process time per batch: 236.344 ms

Total runtime: 0.241783s

@mingyi136
Copy link
Author

As referring to this issue: #46 (comment)

Total kernel runtime of 149.988 ms on DE1-Soc Board seem like have been achieved with parameter VEC_SIZE=8 and LANE_NUM=8, too

@mingyi136
Copy link
Author

@doonny
I have regenerated both run.exe & conv.aocx using PipeCNN github of 2018 (VEC_SIZE=8 and LANE_NUM=8).

This time I managed to get Total kernal runtime of 157.928 ms. Just wondering, why the latest version of PipeCNN github give a slower inference performance for DE1-SoC board?

@doonny
Copy link
Owner

doonny commented Jul 3, 2020

May I ask which version of SDK are you using for compilation ?

@sergio14890
Copy link

@mingyi136 where you download the bsp for de1soc board?

@mingyi136
Copy link
Author

May I ask which version of SDK are you using for compilation ?

@doonny I have compiled conv.aocx using OpenCL SDK 17.1 on Windows, whereas run.exe has been compiled using OpenCL SDK 16.1 on DE1-SoC board (Linux).

@sergio14890
Copy link

May I ask which version of SDK are you using for compilation ?

@doonny I have compiled conv.aocx using OpenCL SDK 17.1 on Windows, whereas run.exe has been compiled using OpenCL SDK 16.1 on DE1-SoC board (Linux).

Ahhh okk, i try compile using openCL SDK 17,1, but i have this error:
https://github.com/doonny/PipeCNN/issues/135

You have a license for 16.1 SDK?

@mingyi136
Copy link
Author

mingyi136 commented Jul 3, 2020

@sergio14890 , I downloaded Linux SD Card Image (inside has OpenCL 16.1) from here:
https://software.intel.com/content/www/us/en/develop/topics/fpga-academic/learn/tutorials.html
and use it to compile run.exe.

Whereas conv.aocx has been compiled using OpenCL 17.1 & pointing towards DE1-SoC BSP (OpenCL 16.0) which available here:
https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&No=836&PartNo=4

@doonny
Copy link
Owner

doonny commented Jul 5, 2020

The latest code is optimized for SDK v19.1, in which some features are not supported by older version, like v16.1 and v17.1. We suggest upgrade the SDK to v19.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants