Auto device fallback on provider failure and tokenizer_options passthrough in Text2TextGenerationPipeline by Suh0161 · Pull Request #1670 · huggingface/transformers.js

Suh0161 · 2026-05-01T18:02:54Z

Closes #425, closes #633, closes #1643, closes #1666, closes #1642, closes #1096.

Supersedes #1668 and #1669 (same changes, single PR).

Summary

[Feature request] Return offset mapping using tokenizer #425 / Is 'aggregation_strategy' parameter available for token classification pipeline? #633 — return_offsets_mapping on the tokenizer; token-classification outputs can include start / end character spans; GPT-2 ByteLevel offset_mapping includes the leading space where appropriate.
get_available_devices() #1643 / Gemma 4 generation passes num_logits_to_keep=0 in decoder_forward, causing full-prompt logits memory blowup #1666 — Adds get_available_devices(); fixes default num_logits_to_keep in decoder_forward so it matches generation behavior and avoids computing full-prompt logits (memory / OOM risk on large vocab + long prompts).
Auto device on Linux x64 fails hard when CUDA shared library is unavailable instead of falling back #1642 / Allow configuring the tokenizer in Text2TextGenerationPipeline #1096 — When device: 'auto', tries execution providers sequentially instead of crashing if a higher-priority provider (e.g. CUDA) fails to load; tokenizer_options in Text2TextGenerationPipeline is forwarded to the tokenizer and stripped before generate().

Fix: Auto device falls back gracefully when a provider fails to load (#1642)

When device: 'auto' is used on a machine where a high-priority provider (e.g. CUDA on Linux without libcuda.so) is unavailable, the session previously crashed hard. constructSessions in session.js now tries each execution provider individually, logs a warning for each failure, and continues until one succeeds or all are exhausted. createInferenceSession is unchanged.

await pipeline('text-generation', 'onnx-community/LFM2-1.2B-ONNX', { device: 'auto' });

Feature: tokenizer_options passthrough in Text2TextGenerationPipeline (#1096)

const output = await generator('Translate to French: Hello world', {
  max_new_tokens: 100,
  tokenizer_options: { max_length: 64, truncation: true },
});

…ing to tokenizer - TokenClassificationPipeline now populates start/end character offsets on every raw token result by scanning forward through the original text. Grouped results (aggregation_strategy='simple') carry the span of the first-to-last token in the group. - PreTrainedTokenizer._call now accepts return_offsets_mapping: true, which adds an offset_mapping field ([start, end) per token) to the encoding. Works for single strings and batched input; handles padding with [0,0] and strips the field before tensor conversion so it is never tensorized. - Adds computeOffsets() helper with case-insensitive fallback for uncased tokenizers (e.g. bert-base-uncased). Closes huggingface#425, closes huggingface#633.

- Fix decoder_forward() defaulting num_logits_to_keep to 0n instead of 1n. The comment correctly stated the value should be 1 to avoid computing logits for the entire prompt sequence, but the code contradicted it. For models like Gemma 4 with long contexts and large vocabularies this caused ~20 GB of unnecessary memory allocation during generation. Closes huggingface#1666. - Add get_available_devices() to the public API. The underlying supportedDevices list already existed in the ONNX backend but was not accessible to users. Returns a copy of the device list sorted by priority/performance for the current environment (Node.js, browser, Electron). Closes huggingface#1643.

…rough - createInferenceSession() now retries with progressively shorter provider lists when session creation fails. Previously, using device: 'auto' on Linux x64 would crash with a hard error if CUDA shared libraries were missing, even though WebGPU and CPU were available as fallbacks. The retry loop drops the failing provider and logs a warning before retrying. Closes huggingface#1642. - Text2TextGenerationPipeline._call() now accepts a tokenizer_options key in generate_kwargs, which is merged on top of the pipeline defaults (padding: true, truncation: true). This gives callers control over tokenizer settings like max_length and truncation side without needing to subclass the pipeline. Closes huggingface#1096.

…ration of concerns Move auto-device fallback logic from createInferenceSession into constructSessions, where device selection context is available. Each execution provider is now tried individually so a missing accelerator (e.g. CUDA on Linux without libcuda.so) falls back cleanly with a warning, without touching the web init chain. Strip the fake Session creation fallback unit tests (they only verified array-slicing, not actual fallback behavior) and keep the real device utility tests. Clean up tokenizer_options handling in Text2TextGenerationPipeline: tokenizer_options is destructured out of generate_kwargs so it never leaks into GenerationFunctionParameters. Comment updated to explain the contract.

xenova · 2026-05-03T14:40:56Z

Hi 👋 thanks for the PR. Could you maybe make separate PRs for each issue you are trying to fix? This makes it a lot easier to review.

Thanks

Suh0161 · 2026-05-04T09:19:30Z

Hi 👋 thanks for the PR. Could you maybe make separate PRs for each issue you are trying to fix? This makes it a lot easier to review.

Thanks

Hi, thanks for the feedback! I've split everything into separate PRs:

feat(qa): return start/end character offsets for question answering (#1245) #1671 — Fix num_logits_to_keep default and add get_available_devices() (closes Gemma 4 generation passes num_logits_to_keep=0 in decoder_forward, causing full-prompt logits memory blowup #1666, get_available_devices() #1643)
transformers.js can't be used in a Bun --compile single binary without patches #1672 — Fix auto device falling back when a provider fails to load (closes Auto device on Linux x64 fails hard when CUDA shared library is unavailable instead of falling back #1642)
Fix num_logits_to_keep default and add get_available_devices() #1673 — Allow callers to pass tokenizer_options in Text2TextGenerationPipeline (closes Allow configuring the tokenizer in Text2TextGenerationPipeline #1096)

The offsets PR is still open separately as well. Closing this one.

Suh0161 added 4 commits May 1, 2026 18:16

This was referenced May 1, 2026

Add start/end character offsets to token classification and return_offsets_mapping to tokenizer #1668

Closed

Fix num_logits_to_keep default in decoder_forward and add get_available_devices() #1669

Closed

fix(tokenizer): align GPT-2 offset_mapping with ByteLevel space prefix

72d4c78

Suh0161 closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto device fallback on provider failure and tokenizer_options passthrough in Text2TextGenerationPipeline#1670

Auto device fallback on provider failure and tokenizer_options passthrough in Text2TextGenerationPipeline#1670
Suh0161 wants to merge 5 commits into
huggingface:mainfrom
Suh0161:feat/cuda-fallback-tokenizer-options

Suh0161 commented May 1, 2026 •

edited

Loading

Uh oh!

xenova commented May 3, 2026

Uh oh!

Suh0161 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Suh0161 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix: Auto device falls back gracefully when a provider fails to load (#1642)

Feature: tokenizer_options passthrough in Text2TextGenerationPipeline (#1096)

Uh oh!

xenova commented May 3, 2026

Uh oh!

Suh0161 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Suh0161 commented May 1, 2026 •

edited

Loading