Skip to content

Add start/end character offsets to token classification and return_offsets_mapping to tokenizer#1668

Closed
Suh0161 wants to merge 1 commit into
huggingface:mainfrom
Suh0161:feat/offsets-and-token-classification
Closed

Add start/end character offsets to token classification and return_offsets_mapping to tokenizer#1668
Suh0161 wants to merge 1 commit into
huggingface:mainfrom
Suh0161:feat/offsets-and-token-classification

Conversation

@Suh0161
Copy link
Copy Markdown

@Suh0161 Suh0161 commented May 1, 2026

Closes #425, closes #633.

Two related feature-parity gaps with the Python transformers library, fixed in one PR since they share the same underlying mechanism (character-offset computation).

  1. return_offsets_mapping on the tokenizer ([Feature request] Return offset mapping using tokenizer #425)

PreTrainedTokenizer now accepts return_offsets_mapping: true:

const encoding = tokenizer('My name is Sarah', {
  return_tensor: false,
  return_offsets_mapping: true,
});
// encoding.offset_mapping → [[0,0],[0,2],[3,7],[8,10],[11,16],[0,0]]

…ing to tokenizer

- TokenClassificationPipeline now populates start/end character offsets on
  every raw token result by scanning forward through the original text.
  Grouped results (aggregation_strategy='simple') carry the span of the
  first-to-last token in the group.

- PreTrainedTokenizer._call now accepts return_offsets_mapping: true, which
  adds an offset_mapping field ([start, end) per token) to the encoding.
  Works for single strings and batched input; handles padding with [0,0] and
  strips the field before tensor conversion so it is never tensorized.

- Adds computeOffsets() helper with case-insensitive fallback for uncased
  tokenizers (e.g. bert-base-uncased).

Closes huggingface#425, closes huggingface#633.
@Suh0161
Copy link
Copy Markdown
Author

Suh0161 commented May 1, 2026

Closing in favor of #1670, which contains this work in the same commit stack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Is 'aggregation_strategy' parameter available for token classification pipeline? [Feature request] Return offset mapping using tokenizer

1 participant