FPGA: add state_machine code sample (WIP)#2
Conversation
whitepau
left a comment
There was a problem hiding this comment.
great start!
Some high-level remarks:
- I think this sample belongs in 'Design Patterns' rather than 'Features'.
- there's a comment in-line asking for some figures, please don't forget :)
- I noticed when simulating that the effective II is still not 1, please speak with Johnny (bowen.xue@intel.com) about this.
| - **Reduced Initiation Interval:** The feature isolates unnecessary dependencies between states, which minimizes the initiation interval and maximizes the FPGA's computational efficiency. This enables the compiler to hide the high II from loading the coefficients in the `State::LD_COEFF` state and hence achieves a better overall II for this design. | ||
|
|
||
|  | ||
|
|
There was a problem hiding this comment.
I'd like you to also include some simulation waveform screenshots to drive home what the impacts look like. Particularly, you should be able to show that the Optimized kernel is able to access its DataIn and DataOut stream every clock cycle, while the Naive kernel is not.
There was a problem hiding this comment.
I tried simulating it, and this is not what we are looking for; i expect there to be no gap between the successive pipe reads. Please speak with Johnny (bowen.xue@intel.com) about how to resolve this, he claimed to be able to get II=1.
From the HSD ES Case https://hsdes.intel.com/appstore/article/#/18033853137
I did encounter a failure in a different pass when compiling the design (LowerPipes). It was complaining about the datatype of StreamingBeat and expected a struct, so I had to do:
struct fake_float { float a; }; using MyStreamingBeat = sycl::ext::intel::experimental::StreamingBeat<fake_float, true, false>;I will be opening a case about this to the memory team.
| | Hardware | Intel® Agilex® 7, Arria® 10, and Stratix® 10 FPGAs | ||
| | Software | Intel® oneAPI DPC++/C++ Compiler | ||
| | What you will learn | Best practices for creating and managing a oneAPI FPGA project | ||
| | Time to complete | 10 minutes |
There was a problem hiding this comment.
it took me a lot longer, lol
|
|
||
| > **Note**: In oneAPI full systems, kernels that use SYCL Unified Shared Memory (USM) host allocations or USM shared allocations (and therefore the code in this tutorial) are only supported by Board Support Packages (BSPs) with USM support. Kernels that use these types of allocations can always be used to generate standalone IPs. | ||
|
|
||
| ## Key Implementation Details |
There was a problem hiding this comment.
Do we want to separates naive and optimized design into different sources files, similar to some of the other samples like task_sequence and hls_interfaces?
There was a problem hiding this comment.
that's a good idea because it lets people diff and compare the code before/after the optimiziation.
If you make this change, you should make two regtests: one to test 'naive' and one to test 'optimized'. This lets you copy existing regtests and avoids debugging extra control flows in the regtest itself.
|
|
||
| This code sample demonstrates two different implementations of a state machine using SYCL High-Level Synthesis (HLS) on Intel FPGAs: the naive version and the optimized version. We will compare and analyze the Quality of Result (QoR) differences between them. | ||
|
|
||
| ### Naive Implementation |
There was a problem hiding this comment.
Since the design is a state machine, a simple chart here would be nice
| - **Inefficient State Management:** State transitions and data processing are tightly coupled, leading to increased latency and reduced efficiency in state management. | ||
| - **Dependency Bottlenecks:** Each state depends linearly on the completion of the previous state, creating bottlenecks and increasing the total execution time. | ||
|
|
||
|  |
There was a problem hiding this comment.
There are a lot of information in this screenshot, could use some explanation
| state_machine.fpga_sim.exe | ||
| set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA= | ||
| ``` | ||
| 3. Alternatively, run the sample on the FPGA device (only if you ran `cmake` with `-DFPGA_DEVICE=<board-support-package>:<board-variant>`). |
There was a problem hiding this comment.
I was told to remove instruction for running full acceleration on windows because we no longer supports it
| State my_state = | ||
| (init_coeff_before_starting) ? State::LD_COEFF : State::PROCESS; | ||
| float coeff = 1.0f; | ||
|
|
There was a problem hiding this comment.
In the optimized design, this loop has [[intel::initiation_interval(1)]], I think we can add a comment here explain this loop cannot achieve II=1.
| ### Improvements brought by `task_sequence` | ||
|
|
||
| Utilizing `task_sequence` in the optimized implementation offers significant enhancements: | ||
| - **Enhanced Parallelism:** By decoupling computational dependencies, `task_sequence` allows for more parallel operations, improving overall execution speed and throughput. |
There was a problem hiding this comment.
Can you explain what operations won't executing in parallel in naive design but were executing in parallel in optimized design?
|
|
||
| Utilizing `task_sequence` in the optimized implementation offers significant enhancements: | ||
| - **Enhanced Parallelism:** By decoupling computational dependencies, `task_sequence` allows for more parallel operations, improving overall execution speed and throughput. | ||
| - **Reduced Initiation Interval:** The feature isolates unnecessary dependencies between states, which minimizes the initiation interval and maximizes the FPGA's computational efficiency. This enables the compiler to hide the high II from loading the coefficients in the `State::LD_COEFF` state and hence achieves a better overall II for this design. |
There was a problem hiding this comment.
Should also explain that it is beneficial to "hide" the high II state from the compiler when the high II state is invoked infrequently compare to the process state
|
|
||
| > **Note**: In oneAPI full systems, kernels that use SYCL Unified Shared Memory (USM) host allocations or USM shared allocations (and therefore the code in this tutorial) are only supported by Board Support Packages (BSPs) with USM support. Kernels that use these types of allocations can always be used to generate standalone IPs. | ||
|
|
||
| ## Key Implementation Details |
There was a problem hiding this comment.
that's a good idea because it lets people diff and compare the code before/after the optimiziation.
If you make this change, you should make two regtests: one to test 'naive' and one to test 'optimized'. This lets you copy existing regtests and avoids debugging extra control flows in the regtest itself.
| float Compute(float coeff, float data) { | ||
| return coeff * data; | ||
| } | ||
|
|
There was a problem hiding this comment.
| float Compute(float coeff, float data) { | |
| return coeff * data; | |
| } | |
| template<typename StreamIn, typename StreamOut> | |
| float Process(float coeff) { | |
| MyStreamingBeat beat = StreamIn::read(); | |
| // multiplication of input data with a coefficient can occur with II=1 | |
| beat.data = beat.data * coeff; | |
| StreamOut::write(beat); | |
| } |
If we move all the processing from the state machine into the task function, then we can remove the get() call. Does this allow II=1 during simulation?
| MyStreamingBeat beat = StreamIn_OptimizedStateMachine::read(); | ||
| // use task_sequence to hide long II from compiler | ||
| compute_task.async(coeff, beat.data); | ||
| beat.data = compute_task.get(); | ||
| StreamOut_OptimizedStateMachine::write(beat); |
There was a problem hiding this comment.
| MyStreamingBeat beat = StreamIn_OptimizedStateMachine::read(); | |
| // use task_sequence to hide long II from compiler | |
| compute_task.async(coeff, beat.data); | |
| beat.data = compute_task.get(); | |
| StreamOut_OptimizedStateMachine::write(beat); | |
| // This state should achieve II=1 | |
| compute_task.async(coeff); |
If we move all the processing from the state machine into the task function, then we can remove the get() call. Does this allow II=1 during simulation?
There was a problem hiding this comment.
Paul's suggestion should theoretically improve the II since you will not have the async-to-get dependence that will create the problem with capacity balancing.
There was a problem hiding this comment.
Understood. Thanks for pointing that out. However, Johnny suggested the same and we actually worked on several workarounds. Some of the ways went too hacky and introduced too much extra code just for this simple state machine. The II issue persisted regardless, which is very sad.
| // function for task_sequence | ||
| float Compute(float coeff, float data) { | ||
| return coeff * data; | ||
| } |
There was a problem hiding this comment.
| // function for task_sequence | |
| float Compute(float coeff, float data) { | |
| return coeff * data; | |
| } | |
| // function for task_sequence | |
| template<typename PipeIn, typename PipeOut> | |
| float Compute(float coeff) { | |
| MyStreamingBeat beat = PipeIn::read(); | |
| beat.data = beat.data * coeff; | |
| PipeOut::write(beat); | |
| } |
Can we get II=1 by moving the pipe interactions to the task_sequence?

Adding a New Sample(s)
Description
This pull request introduces a new code sample for implementing a state machine in SYCL HLS targeting Intel FPGAs. The sample showcases two different implementations: a naive version and an optimized (proper) version using task_sequence. The optimized implementation demonstrates how to use this feature for improved control of parallelism and finer granularity in dependency management, resulting in reduced initiation intervals and better performance.
Checklist
Administrative
Code Development
Security and Legal
Review