Add D-PACE loss option to DFlash training#578
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
d201214 to
ffec0c3
Compare
Add `--loss-type dpace` (with `--dpace-alpha`, default 0.5) to
scripts/train_dflash.py so DFlash training can switch to D-PACE
(Dynamic Position-Aware Cross-Entropy) with a single flag.
- OnlineDFlashModel gains loss_type / dpace_alpha; the default
("dflash") preserves the existing weighted-mean CE exactly,
including loss-decay-gamma behavior.
- D-PACE weights each draft position by detached smoothed
cumulative confidence (prefix product) times continuation value
(suffix sum); component ablations are exposed as
dpace-cumulative-confidence-only and dpace-continuation-value-only.
- Add examples/run_qwen3_8b_dpace_online.sh and a short note in
examples/README.md.
- Add formula-level unit tests covering dflash backward
compatibility, naive-reference equivalence for dpace and both
ablations, batch reduction, alpha sensitivity, and argument
validation (9 tests, CPU-only).
ffec0c3 to
9a30540
Compare
Collaborator
|
Hi, it's an interesting work. Can you show your draft model performance with D-PACE use sglang benchmark? We want to know how much it improves in terms of acceptance. |
Contributor
Author
|
Now I'm training now.. if it finish, I will cover it on |
Contributor
Author
Collaborator
|
Great! I'll check it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Summary
This PR adds D-PACE as an optional training objective for DFlash in SpecForge.
The goal is to make it easy to compare the original DFlash loss against the D-PACE loss proposed in D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting without changing the drafter architecture or inference pipeline.
What changed
Added a new training argument:
Supported values:
Added:
for confidence smoothing.
Extended
OnlineDFlashModelwith:The existing DFlash behavior remains the default:
so existing training workflows are unchanged.
The D-PACE implementation computes detached position weights from:
and applies them to the per-position cross-entropy loss.
Also included:
examples/run_qwen3_8b_dpace_online.shMotivation
DFlash uses a fixed position-dependent weighting schedule.
D-PACE replaces this with adaptive per-sample position weights derived from the drafter's current confidence estimates.
The intuition is that training signal should focus more on positions that currently limit accepted draft length rather than following a fixed position schedule throughout training.
This PR exposes that alternative objective while preserving the existing DFlash objective as the default path.
Validation
Unit tests
Result:
Coverage includes:
Loss-path benchmark
Synthetic forward + backward benchmark on H200 GPUs.
Observed overhead of the D-PACE loss path is small and consistent with the paper's claim that the objective introduces only modest training-time cost.
Surrogate sanity check
For a synthetic acceptance-length simulation:
The confidence-prefix surrogate closely tracks simulated emitted/accepted length and preserves sample ranking, which is the intended signal used by D-PACE for adaptive weighting.