RegexSegmenter uses three complementary regex patterns to identify where one syllable ends and the next begins, handling stacked consonants, Kinzi formations, and mixed-script text.
Overview
Myanmar text has no whitespace between words, making segmentation challenging. The syllable segmenter breaks continuous Myanmar text into individual syllables using regex-based pattern matching combined with syllable rule validation.Algorithm Design
TheRegexSegmenter uses three regex patterns to identify syllable boundaries:
Pattern 1: Myanmar Consonant Syllable Start
- Negative lookbehind:
(?<!(?<!\u103a)\u1039)- NOT preceded by a stacking Virama (unless preceded by Asat for Kinzi) - Negative lookahead:
(?![\u103a\u1039])- NOT followed by Asat or Virama
Pattern 2: Other Syllable Starters
- Independent vowels (U+1023-U+102A)
- Great Sa (U+103F)
- Symbols (U+104C-U+104F)
- Digits (U+1040-U+1049)
- Punctuation (U+104A-U+104B)
Pattern 3: Non-Myanmar Characters
Implementation
RegexSegmenter Class
Configuration Options
Syllable Validation
After segmentation, each syllable is validated usingSyllableRuleValidator:
Myanmar Syllable Structure
A valid Myanmar syllable follows this pattern:| Component | Unicode Range | Examples |
|---|---|---|
| Consonant | U+1000-U+1021 | က, ခ, ဂ, မ, န |
| Medials | U+103B-U+103E | ျ, ြ, ွ, ှ |
| Vowels | U+102B-U+1032 | ါ, ာ, ိ, ု, ေ |
| Tone marks | U+1036-U+1038 | ံ, ့, း |
| Asat | U+103A | ် (syllable killer) |
| Virama | U+1039 | ္ (consonant stacker) |
Performance
The segmenter has two implementations:| Implementation | Speed | Use Case |
|---|---|---|
Pure Python (RegexSegmenter) | ~10ms/1K chars | Default, portable |
Cython (normalize_c.pyx) | ~1ms/1K chars | Production, auto-enabled |
Edge Cases
Stacked Consonants
Stacked consonants (using Virama U+1039) stay together:Kinzi Formation
Kinzi (Asat + Virama + Consonant) is handled correctly:Mixed Script
Non-Myanmar text is grouped together:See Also
- Syllable Validation - Validation rules
- Word Segmentation - Word-level assembly
- Cython Guide - Performance optimization