Skip to content

analysis-nori cannot analyze NFD-form Hangul as Korean morphemes #16241

@Incheonkirin

Description

@Incheonkirin

KoreanTokenizer expects precomposed Hangul syllables. When the same Korean text is supplied in NFD form, modern Hangul syllables become conjoining jamo sequences and Nori falls back to UNKNOWN eojeol-sized tokens instead of dictionary morphemes.

Minimal repro on current main: NFC "한국어 형태소를 분석합니다" produces 한국어/한국/어/형태소/형태/소/를/분석/합니다/하/ᄇ니다. NFD of the same text produces UNKNOWN tokens for each whitespace-delimited Korean span — so NFD-sourced text indexed beside NFC Korean text is silently unfindable.

This input shape is common in the wild: it is the same NFD text Korean users see as "jamo-separated" filenames when macOS-created archives are opened elsewhere. The same bytes reach indexes through filenames, extracted document text, and metadata pipelines.

Proposed fix: an opt-in analysis-nori CharFilter that composes only modern Hangul conjoining jamo sequences (L U+1100..U+1112, V U+1161..U+1175, optional T U+11A8..U+11C2) into precomposed syllables before KoreanTokenizer, with offset correction back to the original input. It is deliberately narrower than NFC: a precomposed LV syllable followed by a trailing jamo is left unchanged (that shape does not occur in NFD output). For general Unicode normalization the ICU module's ICUNormalizer2CharFilter remains the right tool; this covers the common Korean-only case without adding the ICU dependency to nori deployments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions