Syllable Segmentation Algorithm

Every validation layer depends on accurate syllable boundaries. The RegexSegmenter uses three complementary regex patterns to identify where one syllable ends and the next begins, handling stacked consonants, Kinzi formations, and mixed-script text.

Overview

Myanmar text has no whitespace between words, making segmentation challenging. The syllable segmenter breaks continuous Myanmar text into individual syllables using regex-based pattern matching combined with syllable rule validation.

Algorithm Design

The RegexSegmenter uses three regex patterns to identify syllable boundaries:

Pattern 1: Myanmar Consonant Syllable Start

(?<!(?<!\u103a)\u1039)[\u1000-\u1021](?![\u103a\u1039])

This pattern identifies consonants (U+1000-U+1021) that start new syllables:

Negative lookbehind: (?<!(?<!\u103a)\u1039) - NOT preceded by a stacking Virama (unless preceded by Asat for Kinzi)
Negative lookahead: (?![\u103a\u1039]) - NOT followed by Asat or Virama

This ensures stacked consonants stay together while allowing breaks after Kinzi formations.

Pattern 2: Other Syllable Starters

[\u1023-\u102a\u103f\u104c-\u104f\u1040-\u1049\u104a\u104b]

Matches:

Independent vowels (U+1023-U+102A)
Great Sa (U+103F)
Symbols (U+104C-U+104F)
Digits (U+1040-U+1049)
Punctuation (U+104A-U+104B)

Pattern 3: Non-Myanmar Characters

[^\u1000-\u104F]+

Groups consecutive non-Myanmar characters (English, punctuation, whitespace) to avoid over-fragmentation.

Implementation

RegexSegmenter Class

from myspellchecker.segmenters.regex import RegexSegmenter

# Create segmenter
segmenter = RegexSegmenter()

# Segment text into syllables
syllables = segmenter.segment_syllables("မြန်မာနိုင်ငံ")
# Result: ["မြန်", "မာ", "နိုင်", "ငံ"]

# Note: RegexSegmenter does NOT support word segmentation.
# segment_words() raises NotImplementedError.
# For word segmentation, use DefaultSegmenter instead:
from myspellchecker.segmenters.default import DefaultSegmenter
word_segmenter = DefaultSegmenter(word_engine='myword')
words = word_segmenter.segment_words("မြန်မာနိုင်ငံ")
# Result: ["မြန်မာ", "နိုင်ငံ"]

Configuration Options

# Allow extended Myanmar Unicode blocks (Shan, Mon, etc.)
segmenter = RegexSegmenter(allow_extended_myanmar=True)

# Strict Burmese only (default)
segmenter = RegexSegmenter(allow_extended_myanmar=False)

Syllable Validation

After segmentation, each syllable is validated using SyllableRuleValidator:

from myspellchecker.core.syllable_rules import SyllableRuleValidator

validator = SyllableRuleValidator()

# Check if syllable has valid structure
is_valid = validator.validate("မြန်")  # True
is_valid = validator.validate("ြမန်")  # False (medial without consonant)

Myanmar Syllable Structure

A valid Myanmar syllable follows this pattern:

Consonant + [Medial(s)] + [Vowel] + [Tone] + [Final]

Component	Unicode Range	Examples
Consonant	U+1000-U+1021	က, ခ, ဂ, မ, န
Medials	U+103B-U+103E	ျ, ြ, ွ, ှ
Vowels	U+102B-U+1032	ါ, ာ, ိ, ု, ေ
Tone marks	U+1036-U+1038	ံ, ့, း
Asat	U+103A	် (syllable killer)
Virama	U+1039	္ (consonant stacker)

Performance

The segmenter has two implementations:

Implementation	Speed	Use Case
Pure Python (`RegexSegmenter`)	~10ms/1K chars	Default, portable
Cython (`normalize_c.pyx`)	~1ms/1K chars	Production, auto-enabled

The Cython version is automatically used when available:

# Check which implementation is active
from myspellchecker.segmenters.regex import _HAS_CYTHON_SEGMENTER
print(f"Cython segmenter: {_HAS_CYTHON_SEGMENTER}")

Edge Cases

Stacked Consonants

Stacked consonants (using Virama U+1039) stay together:

သင်္ဘော → ["သင်္ဘော"] (not ["သ", "င်္", "ဘော"])

Kinzi Formation

Kinzi (Asat + Virama + Consonant) is handled correctly:

ကင်္ခ → ["ကင်္ခ"]

Mixed Script

Non-Myanmar text is grouped together:

Hello မြန်မာ World → ["Hello ", "မြန်", "မာ", " World"]

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

Syllable Segmentation Algorithm

Overview

Algorithm Design

Pattern 1: Myanmar Consonant Syllable Start

Pattern 2: Other Syllable Starters

Pattern 3: Non-Myanmar Characters

Implementation

RegexSegmenter Class

Configuration Options

Syllable Validation

Myanmar Syllable Structure

Performance

Edge Cases

Stacked Consonants

Kinzi Formation

Mixed Script

See Also

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Overview

​Algorithm Design

​Pattern 1: Myanmar Consonant Syllable Start

​Pattern 2: Other Syllable Starters

​Pattern 3: Non-Myanmar Characters

​Implementation

​RegexSegmenter Class

​Configuration Options

​Syllable Validation

​Myanmar Syllable Structure

​Performance

​Edge Cases

​Stacked Consonants

​Kinzi Formation

​Mixed Script

​See Also

Overview

Algorithm Design

Pattern 1: Myanmar Consonant Syllable Start

Pattern 2: Other Syllable Starters

Pattern 3: Non-Myanmar Characters

Implementation

RegexSegmenter Class

Configuration Options

Syllable Validation

Myanmar Syllable Structure

Performance

Edge Cases

Stacked Consonants

Kinzi Formation

Mixed Script

See Also