FPGA: add state_machine code sample (WIP) by haoyanwa · Pull Request #2 · haoyanwa/oneAPI-samples

haoyanwa · 2024-04-11T22:57:34Z

Adding a New Sample(s)

Description

This pull request introduces a new code sample for implementing a state machine in SYCL HLS targeting Intel FPGAs. The sample showcases two different implementations: a naive version and an optimized (proper) version using task_sequence. The optimized implementation demonstrates how to use this feature for improved control of parallelism and finer granularity in dependency management, resulting in reduced initiation intervals and better performance.

Checklist

Administrative

Review sample design with the appropriate Domain Expert:
If you have any new dependencies/binaries, inform the oneAPI Code Samples Project Manager

Code Development

Implement coding guidelines and ensure code quality. see wiki for details
Adhere to readme template
Enforce format via clang-format config file
Adhere to sample.json specification. https://github.com/oneapi-src/oneAPI-samples/wiki/sample-json-specification
Ensure/create CI test configurations for sample (ciTests field) https://github.com/oneapi-src/oneAPI-samples/wiki/sample-json-ci-test-object
Run jsonlint on sample.json to verify json syntax. www.jsonlint.com

Security and Legal

OSPDT Approval (see Project Manager for assistance)
Compile using the following compiler flags and fix any warnings, the falgs are: "/Wall -Wformat-security -Werror=format-security"
Bandit Scans (Python only)
Virus scan

Review

Review DPC++ code with Paul Peterseon. (GitHub User: pmpeter1)
Review readme with Tom Lenth(@tomlenth) and/or Project Manager
Tested using Dev Cloud when applicable

whitepau

great start!

Some high-level remarks:

I think this sample belongs in 'Design Patterns' rather than 'Features'.
there's a comment in-line asking for some figures, please don't forget :)
I noticed when simulating that the effective II is still not 1, please speak with Johnny (bowen.xue@intel.com) about this.

whitepau · 2024-04-12T16:18:03Z

+- **Reduced Initiation Interval:** The feature isolates unnecessary dependencies between states, which minimizes the initiation interval and maximizes the FPGA's computational efficiency. This enables the compiler to hide the high II from loading the coefficients in the `State::LD_COEFF` state and hence achieves a better overall II for this design.
+
+![](assets/report_screenshot.png)
+


I'd like you to also include some simulation waveform screenshots to drive home what the impacts look like. Particularly, you should be able to show that the Optimized kernel is able to access its DataIn and DataOut stream every clock cycle, while the Naive kernel is not.

I tried simulating it, and this is not what we are looking for; i expect there to be no gap between the successive pipe reads. Please speak with Johnny (bowen.xue@intel.com) about how to resolve this, he claimed to be able to get II=1.

From the HSD ES Case https://hsdes.intel.com/appstore/article/#/18033853137

I did encounter a failure in a different pass when compiling the design (LowerPipes). It was complaining about the datatype of StreamingBeat and expected a struct, so I had to do:

struct fake_float { float a; }; using MyStreamingBeat = sycl::ext::intel::experimental::StreamingBeat<fake_float, true, false>;

I will be opening a case about this to the memory team.

KevinUTAT · 2024-04-17T20:54:56Z

+| Hardware                          | Intel® Agilex® 7, Arria® 10, and Stratix® 10 FPGAs
+| Software                          | Intel® oneAPI DPC++/C++ Compiler
+| What you will learn               | Best practices for creating and managing a oneAPI FPGA project
+| Time to complete                  | 10 minutes


it took me a lot longer, lol

KevinUTAT · 2024-04-18T18:47:24Z

+
+> **Note**: In oneAPI full systems, kernels that use SYCL Unified Shared Memory (USM) host allocations or USM shared allocations (and therefore the code in this tutorial) are only supported by Board Support Packages (BSPs) with USM support. Kernels that use these types of allocations can always be used to generate standalone IPs.
+
+## Key Implementation Details


Do we want to separates naive and optimized design into different sources files, similar to some of the other samples like task_sequence and hls_interfaces?

that's a good idea because it lets people diff and compare the code before/after the optimiziation.

If you make this change, you should make two regtests: one to test 'naive' and one to test 'optimized'. This lets you copy existing regtests and avoids debugging extra control flows in the regtest itself.

KevinUTAT · 2024-04-18T19:26:28Z

+
+This code sample demonstrates two different implementations of a state machine using SYCL High-Level Synthesis (HLS) on Intel FPGAs: the naive version and the optimized version. We will compare and analyze the Quality of Result (QoR) differences between them.
+
+### Naive Implementation


Since the design is a state machine, a simple chart here would be nice

KevinUTAT · 2024-04-18T20:14:43Z

+- **Inefficient State Management:** State transitions and data processing are tightly coupled, leading to increased latency and reduced efficiency in state management.
+- **Dependency Bottlenecks:** Each state depends linearly on the completion of the previous state, creating bottlenecks and increasing the total execution time.
+
+![](assets/bottleneck.png)


There are a lot of information in this screenshot, could use some explanation

KevinUTAT · 2024-04-18T20:17:19Z

+   state_machine.fpga_sim.exe
+   set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=
+   ```
+3. Alternatively, run the sample on the FPGA device (only if you ran `cmake` with `-DFPGA_DEVICE=<board-support-package>:<board-variant>`).


I was told to remove instruction for running full acceleration on windows because we no longer supports it

KevinUTAT · 2024-04-18T21:16:47Z

+    State my_state = 
+        (init_coeff_before_starting) ? State::LD_COEFF : State::PROCESS;
+    float coeff = 1.0f;
+


In the optimized design, this loop has [[intel::initiation_interval(1)]], I think we can add a comment here explain this loop cannot achieve II=1.

KevinUTAT · 2024-04-18T21:36:41Z

+### Improvements brought by `task_sequence`
+
+Utilizing `task_sequence` in the optimized implementation offers significant enhancements:
+- **Enhanced Parallelism:** By decoupling computational dependencies, `task_sequence` allows for more parallel operations, improving overall execution speed and throughput.


Can you explain what operations won't executing in parallel in naive design but were executing in parallel in optimized design?

KevinUTAT · 2024-04-18T21:39:51Z

+
+Utilizing `task_sequence` in the optimized implementation offers significant enhancements:
+- **Enhanced Parallelism:** By decoupling computational dependencies, `task_sequence` allows for more parallel operations, improving overall execution speed and throughput.
+- **Reduced Initiation Interval:** The feature isolates unnecessary dependencies between states, which minimizes the initiation interval and maximizes the FPGA's computational efficiency. This enables the compiler to hide the high II from loading the coefficients in the `State::LD_COEFF` state and hence achieves a better overall II for this design.


Should also explain that it is beneficial to "hide" the high II state from the compiler when the high II state is invoked infrequently compare to the process state

whitepau · 2024-04-26T09:46:45Z

+
+> **Note**: In oneAPI full systems, kernels that use SYCL Unified Shared Memory (USM) host allocations or USM shared allocations (and therefore the code in this tutorial) are only supported by Board Support Packages (BSPs) with USM support. Kernels that use these types of allocations can always be used to generate standalone IPs.
+
+## Key Implementation Details


that's a good idea because it lets people diff and compare the code before/after the optimiziation.

If you make this change, you should make two regtests: one to test 'naive' and one to test 'optimized'. This lets you copy existing regtests and avoids debugging extra control flows in the regtest itself.

whitepau · 2024-04-26T10:04:07Z

+float Compute(float coeff, float data) {
  return coeff * data;
 }



Suggested change

float Compute(float coeff, float data) {

return coeff * data;

}

template<typename StreamIn, typename StreamOut>

float Process(float coeff) {

MyStreamingBeat beat = StreamIn::read();

// multiplication of input data with a coefficient can occur with II=1

beat.data = beat.data * coeff;

StreamOut::write(beat);

}

If we move all the processing from the state machine into the task function, then we can remove the get() call. Does this allow II=1 during simulation?

whitepau · 2024-04-26T10:06:37Z

+          MyStreamingBeat beat = StreamIn_OptimizedStateMachine::read();
+          // use task_sequence to hide long II from compiler
+          compute_task.async(coeff, beat.data);
+          beat.data = compute_task.get();
+          StreamOut_OptimizedStateMachine::write(beat);


Suggested change

MyStreamingBeat beat = StreamIn_OptimizedStateMachine::read();

// use task_sequence to hide long II from compiler

compute_task.async(coeff, beat.data);

beat.data = compute_task.get();

StreamOut_OptimizedStateMachine::write(beat);

// This state should achieve II=1

compute_task.async(coeff);

If we move all the processing from the state machine into the task function, then we can remove the get() call. Does this allow II=1 during simulation?

Paul's suggestion should theoretically improve the II since you will not have the async-to-get dependence that will create the problem with capacity balancing.

Understood. Thanks for pointing that out. However, Johnny suggested the same and we actually worked on several workarounds. Some of the ways went too hacky and introduced too much extra code just for this simple state machine. The II issue persisted regardless, which is very sad.

whitepau · 2024-04-26T19:14:11Z

+// function for task_sequence
+float Compute(float coeff, float data) {
+  return coeff * data;
+}


Suggested change

// function for task_sequence

float Compute(float coeff, float data) {

return coeff * data;

}

// function for task_sequence

template<typename PipeIn, typename PipeOut>

float Compute(float coeff) {

MyStreamingBeat beat = PipeIn::read();

beat.data = beat.data * coeff;

PipeOut::write(beat);

}

Can we get II=1 by moving the pipe interactions to the task_sequence?

FPGA: state_machine code sample (WIP)

26cc75a

haoyanwa requested review from KevinUTAT and whitepau April 11, 2024 22:57

haoyanwa self-assigned this Apr 11, 2024

whitepau requested changes Apr 12, 2024

View reviewed changes

Incremental update based on comments.

6cc1ef5

haoyanwa requested a review from whitepau April 15, 2024 17:14

KevinUTAT reviewed Apr 18, 2024

View reviewed changes

whitepau reviewed Apr 26, 2024

View reviewed changes

		- Reduced Initiation Interval: The feature isolates unnecessary dependencies between states, which minimizes the initiation interval and maximizes the FPGA's computational efficiency. This enables the compiler to hide the high II from loading the coefficients in the `State::LD_COEFF` state and hence achieves a better overall II for this design.

		![](assets/report_screenshot.png)


		> Note: In oneAPI full systems, kernels that use SYCL Unified Shared Memory (USM) host allocations or USM shared allocations (and therefore the code in this tutorial) are only supported by Board Support Packages (BSPs) with USM support. Kernels that use these types of allocations can always be used to generate standalone IPs.

		## Key Implementation Details


		This code sample demonstrates two different implementations of a state machine using SYCL High-Level Synthesis (HLS) on Intel FPGAs: the naive version and the optimized version. We will compare and analyze the Quality of Result (QoR) differences between them.

		### Naive Implementation

-float Compute(float coeff, float data) {
-  return coeff * data;
-}
+template<typename StreamIn, typename StreamOut>
+float Process(float coeff) {
+          MyStreamingBeat beat = StreamIn::read();
+          // multiplication of input data with a coefficient can occur with II=1
+          beat.data = beat.data * coeff;
+          StreamOut::write(beat);
+}

Conversation

haoyanwa commented Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding a New Sample(s)

Description

Checklist

Uh oh!

whitepau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

whitepau Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

whitepau Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

haoyanwa commented Apr 11, 2024 •

edited

Loading

whitepau Apr 26, 2024 •

edited

Loading

whitepau Apr 26, 2024 •

edited

Loading