Skip to content

Auto device fallback on provider failure and tokenizer_options passthrough in Text2TextGenerationPipeline#1670

Closed
Suh0161 wants to merge 5 commits into
huggingface:mainfrom
Suh0161:feat/cuda-fallback-tokenizer-options
Closed

Auto device fallback on provider failure and tokenizer_options passthrough in Text2TextGenerationPipeline#1670
Suh0161 wants to merge 5 commits into
huggingface:mainfrom
Suh0161:feat/cuda-fallback-tokenizer-options

Conversation

@Suh0161
Copy link
Copy Markdown

@Suh0161 Suh0161 commented May 1, 2026

Closes #425, closes #633, closes #1643, closes #1666, closes #1642, closes #1096.

Supersedes #1668 and #1669 (same changes, single PR).

Summary

Fix: Auto device falls back gracefully when a provider fails to load (#1642)

When device: 'auto' is used on a machine where a high-priority provider (e.g. CUDA on Linux without libcuda.so) is unavailable, the session previously crashed hard. constructSessions in session.js now tries each execution provider individually, logs a warning for each failure, and continues until one succeeds or all are exhausted. createInferenceSession is unchanged.

await pipeline('text-generation', 'onnx-community/LFM2-1.2B-ONNX', { device: 'auto' });

Feature: tokenizer_options passthrough in Text2TextGenerationPipeline (#1096)

const output = await generator('Translate to French: Hello world', {
  max_new_tokens: 100,
  tokenizer_options: { max_length: 64, truncation: true },
});

Suh0161 added 4 commits May 1, 2026 18:16
…ing to tokenizer

- TokenClassificationPipeline now populates start/end character offsets on
  every raw token result by scanning forward through the original text.
  Grouped results (aggregation_strategy='simple') carry the span of the
  first-to-last token in the group.

- PreTrainedTokenizer._call now accepts return_offsets_mapping: true, which
  adds an offset_mapping field ([start, end) per token) to the encoding.
  Works for single strings and batched input; handles padding with [0,0] and
  strips the field before tensor conversion so it is never tensorized.

- Adds computeOffsets() helper with case-insensitive fallback for uncased
  tokenizers (e.g. bert-base-uncased).

Closes huggingface#425, closes huggingface#633.
- Fix decoder_forward() defaulting num_logits_to_keep to 0n instead of 1n.
  The comment correctly stated the value should be 1 to avoid computing logits
  for the entire prompt sequence, but the code contradicted it. For models like
  Gemma 4 with long contexts and large vocabularies this caused ~20 GB of
  unnecessary memory allocation during generation.
  Closes huggingface#1666.

- Add get_available_devices() to the public API. The underlying supportedDevices
  list already existed in the ONNX backend but was not accessible to users.
  Returns a copy of the device list sorted by priority/performance for the
  current environment (Node.js, browser, Electron).
  Closes huggingface#1643.
…rough

- createInferenceSession() now retries with progressively shorter provider
  lists when session creation fails. Previously, using device: 'auto' on
  Linux x64 would crash with a hard error if CUDA shared libraries were
  missing, even though WebGPU and CPU were available as fallbacks. The
  retry loop drops the failing provider and logs a warning before retrying.
  Closes huggingface#1642.

- Text2TextGenerationPipeline._call() now accepts a tokenizer_options key
  in generate_kwargs, which is merged on top of the pipeline defaults
  (padding: true, truncation: true). This gives callers control over
  tokenizer settings like max_length and truncation side without needing
  to subclass the pipeline.
  Closes huggingface#1096.
…ration of concerns

Move auto-device fallback logic from createInferenceSession into
constructSessions, where device selection context is available. Each
execution provider is now tried individually so a missing accelerator
(e.g. CUDA on Linux without libcuda.so) falls back cleanly with a
warning, without touching the web init chain.

Strip the fake Session creation fallback unit tests (they only verified
array-slicing, not actual fallback behavior) and keep the real device
utility tests.

Clean up tokenizer_options handling in Text2TextGenerationPipeline:
tokenizer_options is destructured out of generate_kwargs so it never
leaks into GenerationFunctionParameters. Comment updated to explain
the contract.
@xenova
Copy link
Copy Markdown
Collaborator

xenova commented May 3, 2026

Hi 👋 thanks for the PR. Could you maybe make separate PRs for each issue you are trying to fix? This makes it a lot easier to review.

Thanks

@Suh0161
Copy link
Copy Markdown
Author

Suh0161 commented May 4, 2026

Hi 👋 thanks for the PR. Could you maybe make separate PRs for each issue you are trying to fix? This makes it a lot easier to review.

Thanks

Hi, thanks for the feedback! I've split everything into separate PRs:

The offsets PR is still open separately as well. Closing this one.

@Suh0161 Suh0161 closed this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment