Add D-PACE loss option to DFlash training by KilJaeeun · Pull Request #578 · sgl-project/SpecForge

KilJaeeun · 2026-06-11T21:59:40Z

Summary

This PR adds D-PACE as an optional training objective for DFlash in SpecForge.

The goal is to make it easy to compare the original DFlash loss against the D-PACE loss proposed in D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting without changing the drafter architecture or inference pipeline.

What changed

Added a new training argument:

--loss-type

Supported values:

dflash
dpace
dpace-cumulative-confidence-only
dpace-continuation-value-only

Added:

--dpace-alpha

for confidence smoothing.

Extended OnlineDFlashModel with:

loss_type
dpace_alpha

The existing DFlash behavior remains the default:

--loss-type dflash

so existing training workflows are unchanged.

The D-PACE implementation computes detached position weights from:

cumulative confidence
continuation value

and applies them to the per-position cross-entropy loss.

Also included:

examples/run_qwen3_8b_dpace_online.sh
README updates
CPU-only unit tests

Motivation

DFlash uses a fixed position-dependent weighting schedule.

D-PACE replaces this with adaptive per-sample position weights derived from the drafter's current confidence estimates.

The intuition is that training signal should focus more on positions that currently limit accepted draft length rather than following a fixed position schedule throughout training.

This PR exposes that alternative objective while preserving the existing DFlash objective as the default path.

Validation

Unit tests

python -m pytest tests/test_utils/test_dflash_losses.py -q

Result:

9 passed

Coverage includes:

DFlash compatibility
D-PACE equivalence against naive references
cumulative-confidence-only ablation
continuation-value-only ablation
batch reduction behavior
alpha sensitivity
argument validation

Loss-path benchmark

Synthetic forward + backward benchmark on H200 GPUs.

loss_type	mean step time (ms)	relative to dflash
dflash	3.794	baseline
dpace	3.838	+1.16%
dpace-cumulative-confidence-only	3.784	-0.26%
dpace-continuation-value-only	3.816	+0.58%

Observed overhead of the D-PACE loss path is small and consistent with the paper's claim that the objective introduces only modest training-time cost.

Surrogate sanity check

For a synthetic acceptance-length simulation:

Pearson correlation  = 0.990
Spearman correlation = 0.988

The confidence-prefix surrogate closely tracks simulated emitted/accepted length and preserves sample ranking, which is the intended signal used by D-PACE for adaptive weighting.

gemini-code-assist · 2026-06-11T21:59:43Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Add `--loss-type dpace` (with `--dpace-alpha`, default 0.5) to scripts/train_dflash.py so DFlash training can switch to D-PACE (Dynamic Position-Aware Cross-Entropy) with a single flag. - OnlineDFlashModel gains loss_type / dpace_alpha; the default ("dflash") preserves the existing weighted-mean CE exactly, including loss-decay-gamma behavior. - D-PACE weights each draft position by detached smoothed cumulative confidence (prefix product) times continuation value (suffix sum); component ablations are exposed as dpace-cumulative-confidence-only and dpace-continuation-value-only. - Add examples/run_qwen3_8b_dpace_online.sh and a short note in examples/README.md. - Add formula-level unit tests covering dflash backward compatibility, naive-reference equivalence for dpace and both ablations, batch reduction, alpha sensitivity, and argument validation (9 tests, CPU-only).

jiapingW · 2026-06-12T11:03:37Z

Hi, it's an interesting work. Can you show your draft model performance with D-PACE use sglang benchmark? We want to know how much it improves in terms of acceptance.

KilJaeeun · 2026-06-12T11:39:45Z

Now I'm training now.. if it finish, I will cover it on

KilJaeeun · 2026-06-12T13:19:40Z

End-to-end training comparison: does D-PACE actually improve acceptance length?

To validate the core claim of the D-PACE paper beyond unit tests and surrogate correlation, I trained two draft models identical in every way except --loss-type and compared them on a held-out set and in real sglang serving.

Setup


Target model	`Qwen/Qwen3-8B`
Draft	`configs/qwen3-8b-dflash.json` (1 draft layer, block_size 16)
Data	PerfectBlend 52k prompts, responses regenerated by the target via `scripts/regenerate_train_data.py` (temp 0.7, top-p 0.8, thinking disabled, max 2048 tok) → 50k train / 2k held-out
Training	3 epochs (4,623 steps), batch 4 × 8 GPUs (H200), lr 6e-4, max-length 3072, `--chat-template qwen`, seed 42, identical data order for both runs
Command	`examples/run_qwen3_8b_dpace_online.sh` recipe; only `--loss-type` differs (`dpace` uses default `--dpace-alpha 0.5`)

Held-out acceptance length (2k samples, ~760k draft blocks)

Deterministic anchors at every loss-masked position; a block's accepted length = 1 + number of leading draft positions whose argmax matches the target token (greedy verify).

loss	sim. accept length	accepted draft tokens	position acc	full-block acc (15/15)	eval CE
`dflash`	4.610	3.610	0.373	0.038	3.118
`dpace`	5.602 (+21.5%)	4.602 (+27.5%)	0.407	0.074 (2.0×)	3.394

Per-position top-1 accuracy improves at all 15 draft positions:

pos	1	2	3	4	5	8	11	15
dflash	0.778	0.656	0.566	0.495	0.438	0.319	0.243	0.174
dpace	0.817	0.701	0.614	0.545	0.488	0.358	0.267	0.177

End-to-end serving benchmark (sglang, `--speculative-algorithm DFLASH`)

Each trained checkpoint served with DFLASH speculative decoding on 1× H200 (--mem-fraction-static 0.8, bf16). Prompts are first user turns from the held-out set, max_new_tokens 1024. accept_length = Σ completion_tokens / Σ spec_verify_ct from server meta info.

config	baseline (no spec)	`dflash` draft	`dpace` draft	dpace vs dflash
bs1, greedy — accept len	1.00	2.183	2.497	+14.4%
bs1, greedy — tok/s (speedup)	197.3 (1.00×)	278.6 (1.41×)	322.0 (1.63×)	+15.6%
bs1, t0.7/top-p 0.8 — accept len	1.00	2.135	2.420	+13.4%
bs1, t0.7 — tok/s (speedup)	194.2 (1.00×)	265.9 (1.37×)	302.9 (1.56×)	+13.9%
bs8, greedy — accept len	1.00	2.171	2.473	+13.9%
bs8, greedy — tok/s (speedup)	1364.9 (1.00×)	1696.4 (1.24×)	1921.0 (1.41×)	+13.2%
bs8, t0.7 — accept len	1.00	2.148	2.439	+13.5%
bs8, t0.7 — tok/s (speedup)	1349.3 (1.00×)	1581.2 (1.17×)	1814.7 (1.35×)	+14.8%

Observations

The held-out acceptance-length gain (+21.5%) carries through to real autoregressive drafting: +13–14% accepted length and +13–16% output throughput, consistently across greedy/sampling and single/batched serving.
Unweighted eval CE is worse for D-PACE (3.12 → 3.39) while acceptance length is better — exactly the loss/acceptance mismatch the paper argues for: D-PACE reallocates training signal toward positions that extend the accepted prefix, at the cost of mean CE. This also indicates the gain is not explained by generically better optimization.
Full-block accuracy doubles (3.8% → 7.4%), which is particularly relevant for wall-clock speedup since whole-block acceptance amortizes verification best.

Training curves (same W&B project, identical batches per step):

dflash

dpace

jiapingW · 2026-06-12T15:23:45Z

Great! I'll check it.

KilJaeeun requested review from FlamingoPg, FrankLeeeee, shuaills and sleepcoo as code owners June 11, 2026 21:59

KilJaeeun force-pushed the feat/dpace-loss branch from d201214 to ffec0c3 Compare June 11, 2026 22:13

KilJaeeun force-pushed the feat/dpace-loss branch from ffec0c3 to 9a30540 Compare June 11, 2026 23:02

jiapingW merged commit 8bd735c into sgl-project:main Jun 13, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add D-PACE loss option to DFlash training#578

Add D-PACE loss option to DFlash training#578
jiapingW merged 1 commit into
sgl-project:mainfrom
KilJaeeun:feat/dpace-loss

KilJaeeun commented Jun 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jun 11, 2026

Uh oh!

jiapingW commented Jun 12, 2026

Uh oh!

KilJaeeun commented Jun 12, 2026 •

edited

Loading

Uh oh!

KilJaeeun commented Jun 12, 2026 •

edited

Loading

Uh oh!

jiapingW commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KilJaeeun commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Motivation

Validation

Unit tests

Loss-path benchmark

Surrogate sanity check

Uh oh!

gemini-code-assist Bot commented Jun 11, 2026

Uh oh!

jiapingW commented Jun 12, 2026

Uh oh!

KilJaeeun commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KilJaeeun commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

End-to-end training comparison: does D-PACE actually improve acceptance length?

Setup

Held-out acceptance length (2k samples, ~760k draft blocks)

End-to-end serving benchmark (sglang, --speculative-algorithm DFLASH)

Observations

Uh oh!

jiapingW commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KilJaeeun commented Jun 11, 2026 •

edited

Loading

KilJaeeun commented Jun 12, 2026 •

edited

Loading

KilJaeeun commented Jun 12, 2026 •

edited

Loading

End-to-end serving benchmark (sglang, `--speculative-algorithm DFLASH`)