Skip to content

FPGA: add state_machine code sample (WIP)#2

Open
haoyanwa wants to merge 2 commits into
developmentfrom
haoyanwa.state-machine
Open

FPGA: add state_machine code sample (WIP)#2
haoyanwa wants to merge 2 commits into
developmentfrom
haoyanwa.state-machine

Conversation

@haoyanwa
Copy link
Copy Markdown
Owner

@haoyanwa haoyanwa commented Apr 11, 2024

Adding a New Sample(s)

Description

This pull request introduces a new code sample for implementing a state machine in SYCL HLS targeting Intel FPGAs. The sample showcases two different implementations: a naive version and an optimized (proper) version using task_sequence. The optimized implementation demonstrates how to use this feature for improved control of parallelism and finer granularity in dependency management, resulting in reduced initiation intervals and better performance.

Checklist

Administrative

  • Review sample design with the appropriate Domain Expert:
  • If you have any new dependencies/binaries, inform the oneAPI Code Samples Project Manager

Code Development

Security and Legal

  • OSPDT Approval (see Project Manager for assistance)
  • Compile using the following compiler flags and fix any warnings, the falgs are: "/Wall -Wformat-security -Werror=format-security"
  • Bandit Scans (Python only)
  • Virus scan

Review

  • Review DPC++ code with Paul Peterseon. (GitHub User: pmpeter1)
  • Review readme with Tom Lenth(@tomlenth) and/or Project Manager
  • Tested using Dev Cloud when applicable

@haoyanwa haoyanwa requested review from KevinUTAT and whitepau April 11, 2024 22:57
@haoyanwa haoyanwa self-assigned this Apr 11, 2024
Copy link
Copy Markdown
Collaborator

@whitepau whitepau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great start!

Some high-level remarks:

  1. I think this sample belongs in 'Design Patterns' rather than 'Features'.
  2. there's a comment in-line asking for some figures, please don't forget :)
  3. I noticed when simulating that the effective II is still not 1, please speak with Johnny (bowen.xue@intel.com) about this.

Comment thread DirectProgramming/C++SYCL_FPGA/Tutorials/Features/state_machine/CMakeLists.txt Outdated
Comment thread DirectProgramming/C++SYCL_FPGA/Tutorials/Features/state_machine/README.md Outdated
Comment thread DirectProgramming/C++SYCL_FPGA/Tutorials/Features/state_machine/README.md Outdated
Comment thread DirectProgramming/C++SYCL_FPGA/Tutorials/Features/state_machine/README.md Outdated
Comment thread DirectProgramming/C++SYCL_FPGA/Tutorials/Features/state_machine/README.md Outdated
Comment thread DirectProgramming/C++SYCL_FPGA/Tutorials/Features/state_machine/README.md Outdated
- **Reduced Initiation Interval:** The feature isolates unnecessary dependencies between states, which minimizes the initiation interval and maximizes the FPGA's computational efficiency. This enables the compiler to hide the high II from loading the coefficients in the `State::LD_COEFF` state and hence achieves a better overall II for this design.

![](assets/report_screenshot.png)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like you to also include some simulation waveform screenshots to drive home what the impacts look like. Particularly, you should be able to show that the Optimized kernel is able to access its DataIn and DataOut stream every clock cycle, while the Naive kernel is not.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

I tried simulating it, and this is not what we are looking for; i expect there to be no gap between the successive pipe reads. Please speak with Johnny (bowen.xue@intel.com) about how to resolve this, he claimed to be able to get II=1.

From the HSD ES Case https://hsdes.intel.com/appstore/article/#/18033853137

I did encounter a failure in a different pass when compiling the design (LowerPipes). It was complaining about the datatype of StreamingBeat and expected a struct, so I had to do:

struct fake_float {
 float a;
};
using MyStreamingBeat = sycl::ext::intel::experimental::StreamingBeat<fake_float, true, false>;

I will be opening a case about this to the memory team.

Comment thread DirectProgramming/C++SYCL_FPGA/Tutorials/Features/state_machine/README.md Outdated
@haoyanwa haoyanwa requested a review from whitepau April 15, 2024 17:14
| Hardware | Intel® Agilex® 7, Arria® 10, and Stratix® 10 FPGAs
| Software | Intel® oneAPI DPC++/C++ Compiler
| What you will learn | Best practices for creating and managing a oneAPI FPGA project
| Time to complete | 10 minutes
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it took me a lot longer, lol


> **Note**: In oneAPI full systems, kernels that use SYCL Unified Shared Memory (USM) host allocations or USM shared allocations (and therefore the code in this tutorial) are only supported by Board Support Packages (BSPs) with USM support. Kernels that use these types of allocations can always be used to generate standalone IPs.

## Key Implementation Details
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to separates naive and optimized design into different sources files, similar to some of the other samples like task_sequence and hls_interfaces?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good idea because it lets people diff and compare the code before/after the optimiziation.

If you make this change, you should make two regtests: one to test 'naive' and one to test 'optimized'. This lets you copy existing regtests and avoids debugging extra control flows in the regtest itself.


This code sample demonstrates two different implementations of a state machine using SYCL High-Level Synthesis (HLS) on Intel FPGAs: the naive version and the optimized version. We will compare and analyze the Quality of Result (QoR) differences between them.

### Naive Implementation
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the design is a state machine, a simple chart here would be nice

- **Inefficient State Management:** State transitions and data processing are tightly coupled, leading to increased latency and reduced efficiency in state management.
- **Dependency Bottlenecks:** Each state depends linearly on the completion of the previous state, creating bottlenecks and increasing the total execution time.

![](assets/bottleneck.png)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of information in this screenshot, could use some explanation

state_machine.fpga_sim.exe
set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=
```
3. Alternatively, run the sample on the FPGA device (only if you ran `cmake` with `-DFPGA_DEVICE=<board-support-package>:<board-variant>`).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was told to remove instruction for running full acceleration on windows because we no longer supports it

State my_state =
(init_coeff_before_starting) ? State::LD_COEFF : State::PROCESS;
float coeff = 1.0f;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the optimized design, this loop has [[intel::initiation_interval(1)]], I think we can add a comment here explain this loop cannot achieve II=1.

### Improvements brought by `task_sequence`

Utilizing `task_sequence` in the optimized implementation offers significant enhancements:
- **Enhanced Parallelism:** By decoupling computational dependencies, `task_sequence` allows for more parallel operations, improving overall execution speed and throughput.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what operations won't executing in parallel in naive design but were executing in parallel in optimized design?


Utilizing `task_sequence` in the optimized implementation offers significant enhancements:
- **Enhanced Parallelism:** By decoupling computational dependencies, `task_sequence` allows for more parallel operations, improving overall execution speed and throughput.
- **Reduced Initiation Interval:** The feature isolates unnecessary dependencies between states, which minimizes the initiation interval and maximizes the FPGA's computational efficiency. This enables the compiler to hide the high II from loading the coefficients in the `State::LD_COEFF` state and hence achieves a better overall II for this design.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also explain that it is beneficial to "hide" the high II state from the compiler when the high II state is invoked infrequently compare to the process state


> **Note**: In oneAPI full systems, kernels that use SYCL Unified Shared Memory (USM) host allocations or USM shared allocations (and therefore the code in this tutorial) are only supported by Board Support Packages (BSPs) with USM support. Kernels that use these types of allocations can always be used to generate standalone IPs.

## Key Implementation Details
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good idea because it lets people diff and compare the code before/after the optimiziation.

If you make this change, you should make two regtests: one to test 'naive' and one to test 'optimized'. This lets you copy existing regtests and avoids debugging extra control flows in the regtest itself.

Comment on lines +56 to 59
float Compute(float coeff, float data) {
return coeff * data;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float Compute(float coeff, float data) {
return coeff * data;
}
template<typename StreamIn, typename StreamOut>
float Process(float coeff) {
MyStreamingBeat beat = StreamIn::read();
// multiplication of input data with a coefficient can occur with II=1
beat.data = beat.data * coeff;
StreamOut::write(beat);
}

If we move all the processing from the state machine into the task function, then we can remove the get() call. Does this allow II=1 during simulation?

Comment on lines +112 to +116
MyStreamingBeat beat = StreamIn_OptimizedStateMachine::read();
// use task_sequence to hide long II from compiler
compute_task.async(coeff, beat.data);
beat.data = compute_task.get();
StreamOut_OptimizedStateMachine::write(beat);
Copy link
Copy Markdown
Collaborator

@whitepau whitepau Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MyStreamingBeat beat = StreamIn_OptimizedStateMachine::read();
// use task_sequence to hide long II from compiler
compute_task.async(coeff, beat.data);
beat.data = compute_task.get();
StreamOut_OptimizedStateMachine::write(beat);
// This state should achieve II=1
compute_task.async(coeff);

If we move all the processing from the state machine into the task function, then we can remove the get() call. Does this allow II=1 during simulation?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paul's suggestion should theoretically improve the II since you will not have the async-to-get dependence that will create the problem with capacity balancing.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Thanks for pointing that out. However, Johnny suggested the same and we actually worked on several workarounds. Some of the ways went too hacky and introduced too much extra code just for this simple state machine. The II issue persisted regardless, which is very sad.

Comment on lines +55 to +58
// function for task_sequence
float Compute(float coeff, float data) {
return coeff * data;
}
Copy link
Copy Markdown
Collaborator

@whitepau whitepau Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// function for task_sequence
float Compute(float coeff, float data) {
return coeff * data;
}
// function for task_sequence
template<typename PipeIn, typename PipeOut>
float Compute(float coeff) {
MyStreamingBeat beat = PipeIn::read();
beat.data = beat.data * coeff;
PipeOut::write(beat);
}

Can we get II=1 by moving the pipe interactions to the task_sequence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants