Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions docs/README_bench.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# iris.bench - Unified Benchmarking Harness

A standardized benchmarking infrastructure for Iris using a decorator-based approach.

## Quick Start

```python
from iris.bench import benchmark

@benchmark(name="my_kernel", warmup=5, iters=50)
def run_benchmark(shmem, size=1024):
# shmem is automatically created by the decorator

@setup
def allocate():
buffer = shmem.zeros(size, size)
return buffer

@measure
def kernel_launch(buffer):
my_kernel[grid](buffer)

result = run_benchmark(size=2048)
result.print_summary()
```

## Key Features

- ✅ **Automatic iris instance creation** - The decorator creates and manages the iris instance
- ✅ **Code annotation** - Use @setup, @preamble, and @measure to organize your code
- ✅ **Rich statistics** - mean, median, p50, p99, min, max automatically computed
- ✅ **Automatic barrier synchronization** - Built-in multi-GPU support
- ✅ **JSON export** - Structured results for CI/CD integration
- ✅ **Utility functions** - `torch_dtype_from_str`, `compute_bandwidth_gbps`

## Code Annotations

The benchmarking decorator uses three function annotations:

### @setup
Runs **once** before any timing starts. Use for:
- Tensor allocation
- Initial data setup
- One-time configuration

Returns values are passed to @preamble and @measure functions.

### @preamble
Runs **before each timed iteration**. Use for:
- Resetting output buffers
- Clearing flags/state
- Per-iteration setup

Receives the values returned by @setup.

### @measure
The code that gets **actually timed**. Use for:
- Kernel launches
- The operation you want to benchmark

Receives the values returned by @setup.

## Full Example

```python
from iris.bench import benchmark

@benchmark(name="gemm", warmup=5, iters=50, heap_size=1<<33)
def run_gemm(shmem, m=8192, n=4608, k=36864):

@setup
def allocate_matrices():
# Runs once - allocate tensors
A = shmem.randn(m, k, dtype=torch.float16)
B = shmem.randn(k, n, dtype=torch.float16)
C = shmem.zeros(m, n, dtype=torch.float16)
return A, B, C

@preamble
def reset_output(A, B, C):
# Runs before each iteration - clear output
C.zero_()

@measure
def compute(A, B, C):
# This gets timed - run kernel
gemm_kernel[grid](A, B, C, m, n, k)

result = run_gemm(m=8192, n=4608, k=36864)
result.print_summary()
result.to_json("results.json") # Export to JSON
```

## Documentation

- 📖 [Full API Documentation](bench_harness.md)
- 📖 [Migration Guide](bench_migration_example.md)
- 💻 [Complete Examples](../examples/benchmark/bench_harness_example.py)

## Testing

```bash
# Run basic tests (no GPU required)
python3 tests/unittests/test_bench_basic.py

# Run full test suite (requires GPU)
pytest tests/unittests/test_bench.py
```

## API Overview

### @benchmark decorator
Main decorator for benchmarking with automatic iris instance management.

**Parameters:**
- `name` - Benchmark name
- `warmup` - Number of warmup iterations (default: 25)
- `iters` - Number of timing iterations (default: 100)
- `heap_size` - Iris heap size (default: 1<<33)
- `auto_print` - Auto-print results (default: False)

### BenchmarkResult
Stores benchmark results with automatic statistics.

**Methods:**
- `print_summary()` - Human-readable output
- `to_dict()` - Convert to dictionary
- `to_json()` - Convert to JSON string

### Utilities
- `torch_dtype_from_str(dtype_str)` - Convert string to torch.dtype
- `compute_bandwidth_gbps(bytes, time_ms)` - Calculate bandwidth

## License

MIT License - Copyright (c) 2025-2026 Advanced Micro Devices, Inc.
244 changes: 244 additions & 0 deletions docs/bench_harness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
# Benchmarking Harness (iris.bench)

The `iris.bench` module provides a unified, decorator-based infrastructure for benchmarking Iris operations.

## Overview

The benchmarking harness eliminates code duplication by providing:

- **Automatic iris instance management**: The decorator creates and manages the iris instance
- **Code organization**: Use @setup, @preamble, @measure annotations
- **Automatic statistics**: mean, median, p50, p99, min, max
- **Barrier synchronization**: Built-in multi-GPU support
- **Structured output**: JSON export for CI/CD

## Quick Start

```python
from iris.bench import benchmark

@benchmark(name="my_kernel", warmup=5, iters=50)
def run_benchmark(shmem, size=1024):
# shmem is automatically created by the decorator

@setup
def allocate():
buffer = shmem.zeros(size, size)
return buffer

@measure
def kernel_launch(buffer):
my_kernel[grid](buffer)

result = run_benchmark(size=2048)
result.print_summary()
```

## API Reference

### @benchmark Decorator

Main decorator for benchmarking with automatic iris instance management.

```python
@benchmark(
name: str,
warmup: int = 25,
iters: int = 100,
heap_size: int = 1 << 33,
auto_print: bool = False,
)
```

**Parameters:**
- `name` - Benchmark name
- `warmup` - Number of warmup iterations (default: 25)
- `iters` - Number of timing iterations (default: 100)
- `heap_size` - Iris symmetric heap size (default: 1<<33)
- `auto_print` - Automatically print results (default: False)

**Returns:** BenchmarkResult

### Code Annotations

Within your benchmark function, use these decorators to organize code:

#### @setup
Runs **once** before any timing starts.

**Use for:**
- Tensor allocation
- Initial data setup
- One-time configuration

**Returns:** Values passed to @preamble and @measure

#### @preamble
Runs **before each timed iteration**.

**Use for:**
- Resetting output buffers
- Clearing flags/state
- Per-iteration setup

**Parameters:** Receives values from @setup

#### @measure (Required)
The code that gets **timed**.

**Use for:**
- Kernel launches
- The operation you want to benchmark

**Parameters:** Receives values from @setup

### BenchmarkResult

Dataclass storing benchmark results.

**Attributes:**
- `name: str` - Benchmark name
- `mean_ms: float` - Mean time in milliseconds
- `median_ms: float` - Median time
- `p50_ms: float` - 50th percentile
- `p99_ms: float` - 99th percentile
- `min_ms: float` - Minimum time
- `max_ms: float` - Maximum time
- `n_warmup: int` - Number of warmup iterations
- `n_repeat: int` - Number of timing iterations
- `params: Dict` - Benchmark parameters
- `raw_times: List[float]` - Raw timing measurements

**Methods:**
- `to_dict(include_raw_times=False)` - Convert to dictionary
- `to_json(include_raw_times=False, indent=2)` - Convert to JSON
- `print_summary()` - Print formatted summary

### Utility Functions

#### torch_dtype_from_str

```python
dtype = torch_dtype_from_str("fp16") # -> torch.float16
```

Supported: `"int8"`, `"fp16"`, `"bf16"`, `"fp32"`

#### compute_bandwidth_gbps

```python
bandwidth = compute_bandwidth_gbps(total_bytes, time_ms)
```

Computes bandwidth in GiB/s.

## Examples

### Example 1: Simple Benchmark

```python
from iris.bench import benchmark

@benchmark(name="vector_add", warmup=5, iters=50)
def bench_add(shmem, size=1024):

@setup
def allocate():
a = shmem.randn(size)
b = shmem.randn(size)
c = shmem.zeros(size)
return a, b, c

@measure
def compute(a, b, c):
c.copy_(a + b)

result = bench_add(size=1024)
result.print_summary()
```

### Example 2: With Preamble

```python
@benchmark(name="gemm", warmup=5, iters=50, heap_size=1<<33)
def bench_gemm(shmem, m=8192, n=4608, k=36864):

@setup
def allocate():
A = shmem.randn(m, k, dtype=torch.float16)
B = shmem.randn(k, n, dtype=torch.float16)
C = shmem.zeros(m, n, dtype=torch.float16)
return A, B, C

@preamble
def reset(A, B, C):
C.zero_()

@measure
def compute(A, B, C):
gemm_kernel[grid](A, B, C, m, n, k)

result = bench_gemm()
```

### Example 3: Bandwidth Calculation

```python
from iris.bench import benchmark, compute_bandwidth_gbps

@benchmark(name="copy", warmup=5, iters=50)
def bench_copy(shmem, size=1024*1024*256):

@setup
def allocate():
src = shmem.randn(size, dtype=torch.float16)
dst = shmem.zeros(size, dtype=torch.float16)
return src, dst

@measure
def copy(src, dst):
dst.copy_(src)

result = bench_copy()

# Compute bandwidth
element_size = 2 # float16
total_bytes = size * element_size
bandwidth = compute_bandwidth_gbps(total_bytes, result.mean_ms)
print(f"Bandwidth: {bandwidth:.2f} GiB/s")
```

### Example 4: JSON Export

```python
result = bench_gemm(m=8192, n=4608, k=36864)

# Export to JSON
with open("results.json", "w") as f:
f.write(result.to_json(include_raw_times=True))

# Or use to_dict for custom processing
data = result.to_dict()
print(f"Mean: {data['mean_ms']:.2f} ms")
```

## Integration

The harness uses `iris.do_bench` internally for timing, ensuring consistency with existing code. The @benchmark decorator:
- Creates the iris instance
- Manages barrier synchronization automatically
- Handles warmup and iteration loops
- Computes statistics automatically

## Notes

- The `shmem` parameter is automatically injected by the decorator
- `@setup`, `@preamble`, and `@measure` are injected at runtime
- At least one `@measure` decorated function is required
- `@setup` and `@preamble` are optional

## See Also

- [Quick Start Guide](README_bench.md)
- [Migration Examples](bench_migration_example.md)
- [Working Examples](../examples/benchmark/bench_harness_example.py)
Loading