analysis-nori cannot analyze NFD-form Hangul as Korean morphemes

KoreanTokenizer expects precomposed Hangul syllables. When the same Korean text is supplied in NFD form, modern Hangul syllables become conjoining jamo sequences and Nori falls back to UNKNOWN eojeol-sized tokens instead of dictionary morphemes.

Minimal repro on current main: NFC "한국어 형태소를 분석합니다" produces 한국어/한국/어/형태소/형태/소/를/분석/합니다/하/ᄇ니다. NFD of the same text produces UNKNOWN tokens for each whitespace-delimited Korean span — so NFD-sourced text indexed beside NFC Korean text is silently unfindable.

This input shape is common in the wild: it is the same NFD text Korean users see as "jamo-separated" filenames when macOS-created archives are opened elsewhere. The same bytes reach indexes through filenames, extracted document text, and metadata pipelines.

Proposed fix: an opt-in analysis-nori CharFilter that composes only modern Hangul conjoining jamo sequences (L U+1100..U+1112, V U+1161..U+1175, optional T U+11A8..U+11C2) into precomposed syllables before KoreanTokenizer, with offset correction back to the original input. It is deliberately narrower than NFC: a precomposed LV syllable followed by a trailing jamo is left unchanged (that shape does not occur in NFD output). For general Unicode normalization the ICU module's ICUNormalizer2CharFilter remains the right tool; this covers the common Korean-only case without adding the ICU dependency to nori deployments.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis-nori cannot analyze NFD-form Hangul as Korean morphemes #16241

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

analysis-nori cannot analyze NFD-form Hangul as Korean morphemes #16241

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions