Add start/end character offsets to token classification and return_offsets_mapping to tokenizer by Suh0161 · Pull Request #1668 · huggingface/transformers.js

Suh0161 · 2026-05-01T17:20:48Z

Closes #425, closes #633.

Two related feature-parity gaps with the Python transformers library, fixed in one PR since they share the same underlying mechanism (character-offset computation).

return_offsets_mapping on the tokenizer ([Feature request] Return offset mapping using tokenizer #425)

PreTrainedTokenizer now accepts return_offsets_mapping: true:

const encoding = tokenizer('My name is Sarah', {
  return_tensor: false,
  return_offsets_mapping: true,
});
// encoding.offset_mapping → [[0,0],[0,2],[3,7],[8,10],[11,16],[0,0]]

…ing to tokenizer - TokenClassificationPipeline now populates start/end character offsets on every raw token result by scanning forward through the original text. Grouped results (aggregation_strategy='simple') carry the span of the first-to-last token in the group. - PreTrainedTokenizer._call now accepts return_offsets_mapping: true, which adds an offset_mapping field ([start, end) per token) to the encoding. Works for single strings and batched input; handles padding with [0,0] and strips the field before tensor conversion so it is never tensorized. - Adds computeOffsets() helper with case-insensitive fallback for uncased tokenizers (e.g. bert-base-uncased). Closes huggingface#425, closes huggingface#633.

Suh0161 · 2026-05-01T18:30:22Z

Closing in favor of #1670, which contains this work in the same commit stack.

Suh0161 closed this May 1, 2026

Suh0161 mentioned this pull request May 1, 2026

Auto device fallback on provider failure and tokenizer_options passthrough in Text2TextGenerationPipeline #1670

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add start/end character offsets to token classification and return_offsets_mapping to tokenizer#1668

Add start/end character offsets to token classification and return_offsets_mapping to tokenizer#1668
Suh0161 wants to merge 1 commit into
huggingface:mainfrom
Suh0161:feat/offsets-and-token-classification

Suh0161 commented May 1, 2026

Uh oh!

Suh0161 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Suh0161 commented May 1, 2026

Uh oh!

Suh0161 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant