Skip to content

Add D-PACE loss option to DFlash training#578

Merged
jiapingW merged 1 commit into
sgl-project:mainfrom
KilJaeeun:feat/dpace-loss
Jun 13, 2026
Merged

Add D-PACE loss option to DFlash training#578
jiapingW merged 1 commit into
sgl-project:mainfrom
KilJaeeun:feat/dpace-loss

Conversation

@KilJaeeun

@KilJaeeun KilJaeeun commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds D-PACE as an optional training objective for DFlash in SpecForge.

The goal is to make it easy to compare the original DFlash loss against the D-PACE loss proposed in D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting without changing the drafter architecture or inference pipeline.

What changed

Added a new training argument:

--loss-type

Supported values:

dflash
dpace
dpace-cumulative-confidence-only
dpace-continuation-value-only

Added:

--dpace-alpha

for confidence smoothing.

Extended OnlineDFlashModel with:

loss_type
dpace_alpha

The existing DFlash behavior remains the default:

--loss-type dflash

so existing training workflows are unchanged.

The D-PACE implementation computes detached position weights from:

  • cumulative confidence
  • continuation value

and applies them to the per-position cross-entropy loss.

Also included:

  • examples/run_qwen3_8b_dpace_online.sh
  • README updates
  • CPU-only unit tests

Motivation

DFlash uses a fixed position-dependent weighting schedule.

D-PACE replaces this with adaptive per-sample position weights derived from the drafter's current confidence estimates.

The intuition is that training signal should focus more on positions that currently limit accepted draft length rather than following a fixed position schedule throughout training.

This PR exposes that alternative objective while preserving the existing DFlash objective as the default path.

Validation

Unit tests

python -m pytest tests/test_utils/test_dflash_losses.py -q

Result:

9 passed

Coverage includes:

  • DFlash compatibility
  • D-PACE equivalence against naive references
  • cumulative-confidence-only ablation
  • continuation-value-only ablation
  • batch reduction behavior
  • alpha sensitivity
  • argument validation

Loss-path benchmark

Synthetic forward + backward benchmark on H200 GPUs.

loss_type mean step time (ms) relative to dflash
dflash 3.794 baseline
dpace 3.838 +1.16%
dpace-cumulative-confidence-only 3.784 -0.26%
dpace-continuation-value-only 3.816 +0.58%

Observed overhead of the D-PACE loss path is small and consistent with the paper's claim that the objective introduces only modest training-time cost.

Surrogate sanity check

For a synthetic acceptance-length simulation:

Pearson correlation  = 0.990
Spearman correlation = 0.988

The confidence-prefix surrogate closely tracks simulated emitted/accepted length and preserves sample ranking, which is the intended signal used by D-PACE for adaptive weighting.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Add `--loss-type dpace` (with `--dpace-alpha`, default 0.5) to
scripts/train_dflash.py so DFlash training can switch to D-PACE
(Dynamic Position-Aware Cross-Entropy) with a single flag.

- OnlineDFlashModel gains loss_type / dpace_alpha; the default
  ("dflash") preserves the existing weighted-mean CE exactly,
  including loss-decay-gamma behavior.
- D-PACE weights each draft position by detached smoothed
  cumulative confidence (prefix product) times continuation value
  (suffix sum); component ablations are exposed as
  dpace-cumulative-confidence-only and dpace-continuation-value-only.
- Add examples/run_qwen3_8b_dpace_online.sh and a short note in
  examples/README.md.
- Add formula-level unit tests covering dflash backward
  compatibility, naive-reference equivalence for dpace and both
  ablations, batch reduction, alpha sensitivity, and argument
  validation (9 tests, CPU-only).
@jiapingW

Copy link
Copy Markdown
Collaborator

Hi, it's an interesting work. Can you show your draft model performance with D-PACE use sglang benchmark? We want to know how much it improves in terms of acceptance.

@KilJaeeun

KilJaeeun commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Now I'm training now.. if it finish, I will cover it on

@KilJaeeun

KilJaeeun commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

End-to-end training comparison: does D-PACE actually improve acceptance length?

To validate the core claim of the D-PACE paper beyond unit tests and surrogate correlation, I trained two draft models identical in every way except --loss-type and compared them on a held-out set and in real sglang serving.

Setup

Target model Qwen/Qwen3-8B
Draft configs/qwen3-8b-dflash.json (1 draft layer, block_size 16)
Data PerfectBlend 52k prompts, responses regenerated by the target via scripts/regenerate_train_data.py (temp 0.7, top-p 0.8, thinking disabled, max 2048 tok) → 50k train / 2k held-out
Training 3 epochs (4,623 steps), batch 4 × 8 GPUs (H200), lr 6e-4, max-length 3072, --chat-template qwen, seed 42, identical data order for both runs
Command examples/run_qwen3_8b_dpace_online.sh recipe; only --loss-type differs (dpace uses default --dpace-alpha 0.5)

Held-out acceptance length (2k samples, ~760k draft blocks)

Deterministic anchors at every loss-masked position; a block's accepted length = 1 + number of leading draft positions whose argmax matches the target token (greedy verify).

loss sim. accept length accepted draft tokens position acc full-block acc (15/15) eval CE
dflash 4.610 3.610 0.373 0.038 3.118
dpace 5.602 (+21.5%) 4.602 (+27.5%) 0.407 0.074 (2.0×) 3.394

Per-position top-1 accuracy improves at all 15 draft positions:

pos 1 2 3 4 5 8 11 15
dflash 0.778 0.656 0.566 0.495 0.438 0.319 0.243 0.174
dpace 0.817 0.701 0.614 0.545 0.488 0.358 0.267 0.177

End-to-end serving benchmark (sglang, --speculative-algorithm DFLASH)

Each trained checkpoint served with DFLASH speculative decoding on 1× H200 (--mem-fraction-static 0.8, bf16). Prompts are first user turns from the held-out set, max_new_tokens 1024. accept_length = Σ completion_tokens / Σ spec_verify_ct from server meta info.

config baseline (no spec) dflash draft dpace draft dpace vs dflash
bs1, greedy — accept len 1.00 2.183 2.497 +14.4%
bs1, greedy — tok/s (speedup) 197.3 (1.00×) 278.6 (1.41×) 322.0 (1.63×) +15.6%
bs1, t0.7/top-p 0.8 — accept len 1.00 2.135 2.420 +13.4%
bs1, t0.7 — tok/s (speedup) 194.2 (1.00×) 265.9 (1.37×) 302.9 (1.56×) +13.9%
bs8, greedy — accept len 1.00 2.171 2.473 +13.9%
bs8, greedy — tok/s (speedup) 1364.9 (1.00×) 1696.4 (1.24×) 1921.0 (1.41×) +13.2%
bs8, t0.7 — accept len 1.00 2.148 2.439 +13.5%
bs8, t0.7 — tok/s (speedup) 1349.3 (1.00×) 1581.2 (1.17×) 1814.7 (1.35×) +14.8%

Observations

  • The held-out acceptance-length gain (+21.5%) carries through to real autoregressive drafting: +13–14% accepted length and +13–16% output throughput, consistently across greedy/sampling and single/batched serving.
  • Unweighted eval CE is worse for D-PACE (3.12 → 3.39) while acceptance length is better — exactly the loss/acceptance mismatch the paper argues for: D-PACE reallocates training signal toward positions that extend the accepted prefix, at the cost of mean CE. This also indicates the gain is not explained by generically better optimization.
  • Full-block accuracy doubles (3.8% → 7.4%), which is particularly relevant for wall-clock speedup since whole-block acceptance amortizes verification best.

Training curves (same W&B project, identical batches per step):

  • dflash
image
  • dpace
image

@jiapingW

Copy link
Copy Markdown
Collaborator

Great! I'll check it.

@jiapingW jiapingW merged commit 8bd735c into sgl-project:main Jun 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants