Auto device fallback on provider failure and tokenizer_options passthrough in Text2TextGenerationPipeline#1670
Closed
Suh0161 wants to merge 5 commits into
Closed
Conversation
…ing to tokenizer - TokenClassificationPipeline now populates start/end character offsets on every raw token result by scanning forward through the original text. Grouped results (aggregation_strategy='simple') carry the span of the first-to-last token in the group. - PreTrainedTokenizer._call now accepts return_offsets_mapping: true, which adds an offset_mapping field ([start, end) per token) to the encoding. Works for single strings and batched input; handles padding with [0,0] and strips the field before tensor conversion so it is never tensorized. - Adds computeOffsets() helper with case-insensitive fallback for uncased tokenizers (e.g. bert-base-uncased). Closes huggingface#425, closes huggingface#633.
- Fix decoder_forward() defaulting num_logits_to_keep to 0n instead of 1n. The comment correctly stated the value should be 1 to avoid computing logits for the entire prompt sequence, but the code contradicted it. For models like Gemma 4 with long contexts and large vocabularies this caused ~20 GB of unnecessary memory allocation during generation. Closes huggingface#1666. - Add get_available_devices() to the public API. The underlying supportedDevices list already existed in the ONNX backend but was not accessible to users. Returns a copy of the device list sorted by priority/performance for the current environment (Node.js, browser, Electron). Closes huggingface#1643.
…rough - createInferenceSession() now retries with progressively shorter provider lists when session creation fails. Previously, using device: 'auto' on Linux x64 would crash with a hard error if CUDA shared libraries were missing, even though WebGPU and CPU were available as fallbacks. The retry loop drops the failing provider and logs a warning before retrying. Closes huggingface#1642. - Text2TextGenerationPipeline._call() now accepts a tokenizer_options key in generate_kwargs, which is merged on top of the pipeline defaults (padding: true, truncation: true). This gives callers control over tokenizer settings like max_length and truncation side without needing to subclass the pipeline. Closes huggingface#1096.
…ration of concerns Move auto-device fallback logic from createInferenceSession into constructSessions, where device selection context is available. Each execution provider is now tried individually so a missing accelerator (e.g. CUDA on Linux without libcuda.so) falls back cleanly with a warning, without touching the web init chain. Strip the fake Session creation fallback unit tests (they only verified array-slicing, not actual fallback behavior) and keep the real device utility tests. Clean up tokenizer_options handling in Text2TextGenerationPipeline: tokenizer_options is destructured out of generate_kwargs so it never leaks into GenerationFunctionParameters. Comment updated to explain the contract.
This was referenced May 1, 2026
Collaborator
|
Hi 👋 thanks for the PR. Could you maybe make separate PRs for each issue you are trying to fix? This makes it a lot easier to review. Thanks |
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #425, closes #633, closes #1643, closes #1666, closes #1642, closes #1096.
Supersedes #1668 and #1669 (same changes, single PR).
Summary
return_offsets_mappingon the tokenizer; token-classification outputs can includestart/endcharacter spans; GPT-2 ByteLeveloffset_mappingincludes the leading space where appropriate.get_available_devices(); fixes defaultnum_logits_to_keepindecoder_forwardso it matches generation behavior and avoids computing full-prompt logits (memory / OOM risk on large vocab + long prompts).Text2TextGenerationPipeline#1096 — Whendevice: 'auto', tries execution providers sequentially instead of crashing if a higher-priority provider (e.g. CUDA) fails to load;tokenizer_optionsinText2TextGenerationPipelineis forwarded to the tokenizer and stripped beforegenerate().Fix: Auto device falls back gracefully when a provider fails to load (#1642)
When
device: 'auto'is used on a machine where a high-priority provider (e.g. CUDA on Linux withoutlibcuda.so) is unavailable, the session previously crashed hard.constructSessionsinsession.jsnow tries each execution provider individually, logs a warning for each failure, and continues until one succeeds or all are exhausted.createInferenceSessionis unchanged.Feature: tokenizer_options passthrough in Text2TextGenerationPipeline (#1096)