Skip to main content
Every validation layer depends on accurate syllable boundaries. The RegexSegmenter uses three complementary regex patterns to identify where one syllable ends and the next begins, handling stacked consonants, Kinzi formations, and mixed-script text.

Overview

Myanmar text has no whitespace between words, making segmentation challenging. The syllable segmenter breaks continuous Myanmar text into individual syllables using regex-based pattern matching combined with syllable rule validation.

Algorithm Design

The RegexSegmenter uses three regex patterns to identify syllable boundaries:

Pattern 1: Myanmar Consonant Syllable Start

(?<!(?<!\u103a)\u1039)[\u1000-\u1021](?![\u103a\u1039])
This pattern identifies consonants (U+1000-U+1021) that start new syllables:
  • Negative lookbehind: (?<!(?<!\u103a)\u1039) - NOT preceded by a stacking Virama (unless preceded by Asat for Kinzi)
  • Negative lookahead: (?![\u103a\u1039]) - NOT followed by Asat or Virama
This ensures stacked consonants stay together while allowing breaks after Kinzi formations.

Pattern 2: Other Syllable Starters

[\u1023-\u102a\u103f\u104c-\u104f\u1040-\u1049\u104a\u104b]
Matches:
  • Independent vowels (U+1023-U+102A)
  • Great Sa (U+103F)
  • Symbols (U+104C-U+104F)
  • Digits (U+1040-U+1049)
  • Punctuation (U+104A-U+104B)

Pattern 3: Non-Myanmar Characters

[^\u1000-\u104F]+
Groups consecutive non-Myanmar characters (English, punctuation, whitespace) to avoid over-fragmentation.

Implementation

RegexSegmenter Class

from myspellchecker.segmenters.regex import RegexSegmenter

# Create segmenter
segmenter = RegexSegmenter()

# Segment text into syllables
syllables = segmenter.segment_syllables("မြန်မာနိုင်ငံ")
# Result: ["မြန်", "မာ", "နိုင်", "ငံ"]

# Note: RegexSegmenter does NOT support word segmentation.
# segment_words() raises NotImplementedError.
# For word segmentation, use DefaultSegmenter instead:
from myspellchecker.segmenters.default import DefaultSegmenter
word_segmenter = DefaultSegmenter(word_engine='myword')
words = word_segmenter.segment_words("မြန်မာနိုင်ငံ")
# Result: ["မြန်မာ", "နိုင်ငံ"]

Configuration Options

# Allow extended Myanmar Unicode blocks (Shan, Mon, etc.)
segmenter = RegexSegmenter(allow_extended_myanmar=True)

# Strict Burmese only (default)
segmenter = RegexSegmenter(allow_extended_myanmar=False)

Syllable Validation

After segmentation, each syllable is validated using SyllableRuleValidator:
from myspellchecker.core.syllable_rules import SyllableRuleValidator

validator = SyllableRuleValidator()

# Check if syllable has valid structure
is_valid = validator.validate("မြန်")  # True
is_valid = validator.validate("ြမန်")  # False (medial without consonant)

Myanmar Syllable Structure

A valid Myanmar syllable follows this pattern:
Consonant + [Medial(s)] + [Vowel] + [Tone] + [Final]
ComponentUnicode RangeExamples
ConsonantU+1000-U+1021က, ခ, ဂ, မ, န
MedialsU+103B-U+103Eျ, ြ, ွ, ှ
VowelsU+102B-U+1032ါ, ာ, ိ, ု, ေ
Tone marksU+1036-U+1038ံ, ့, း
AsatU+103A် (syllable killer)
ViramaU+1039္ (consonant stacker)

Performance

The segmenter has two implementations:
ImplementationSpeedUse Case
Pure Python (RegexSegmenter)~10ms/1K charsDefault, portable
Cython (normalize_c.pyx)~1ms/1K charsProduction, auto-enabled
The Cython version is automatically used when available:
# Check which implementation is active
from myspellchecker.segmenters.regex import _HAS_CYTHON_SEGMENTER
print(f"Cython segmenter: {_HAS_CYTHON_SEGMENTER}")

Edge Cases

Stacked Consonants

Stacked consonants (using Virama U+1039) stay together:
သင်္ဘော → ["သင်္ဘော"] (not ["သ", "င်္", "ဘော"])

Kinzi Formation

Kinzi (Asat + Virama + Consonant) is handled correctly:
ကင်္ခ → ["ကင်္ခ"]

Mixed Script

Non-Myanmar text is grouped together:
Hello မြန်မာ World → ["Hello ", "မြန်", "မာ", " World"]

See Also