Skip to content

Add scale_i32_bf16 operator #99

@albiol2004

Description

@albiol2004

Problem

INT8 GEMM outputs int32 accumulators that need converting back to bf16 with a scale factor. The existing dequant_i32_bf16 (#96) operator uses per-group packed buffer formats that don't compose directly with GEMM output. CPU-side dequantization requires expensive NPU↔CPU round trips (16MB i32 download + 8MB bf16 upload per GEMM call).

Solution

New scale_i32_bf16 operator that takes:

  • Input 1: plain (size,) int32 buffer, directly from GEMM output
  • Input 2: tiny (num_cores × 16,) bf16 scale buffer (~256 bytes)
  • Output: plain (size,) bf16 buffer

No packed formats. The scale ObjectFIFO is acquired once per core and reused across all tile iterations. Wins over CPU dequant at prompt lengths ≥~3500 tokens.

Tests

7 non-extensive parameter combinations, all passing. Configurations: 1-8 columns, 1-2 channels, tile sizes 256-8192.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions