Myanmar text segmentation operates at two levels — syllable and word — because Myanmar script has no spaces between words. mySpellChecker separates these concerns: a fast regex-based segmenter handles syllables, while a configurable word engine (myword, CRF, or transformer) handles word boundaries.
Architecture
| Level | Default Engine | How It Works | Downloads |
|---|
| Syllable | RegexSegmenter | Regex-based Sylbreak algorithm | None |
| Word | myword | Viterbi with unigram/bigram dictionary | segmentation.mmap from HuggingFace |
Quick Start
from myspellchecker.segmenters import DefaultSegmenter
segmenter = DefaultSegmenter()
# Syllable segmentation (RegexSegmenter internally, no download)
syllables = segmenter.segment_syllables("မြန်မာနိုင်ငံ")
# ['မြန်', 'မာ', 'နိုင်', 'ငံ']
# Word segmentation (myword engine, downloads dictionary on first call)
words = segmenter.segment_words("မြန်မာနိုင်ငံသည်")
# ['မြန်မာ', 'နိုင်ငံ', 'သည်']
# Sentence segmentation
sentences = segmenter.segment_sentences("ပထမစာ။ ဒုတိယစာ။")
# ['ပထမစာ။', 'ဒုတိယစာ။']
Syllable Segmentation
RegexSegmenter
All syllable segmentation in mySpellChecker uses RegexSegmenter — a pure-Python, rule-based segmenter with zero external dependencies and no network downloads.
from myspellchecker.segmenters import RegexSegmenter
segmenter = RegexSegmenter(
allow_extended_myanmar=False, # Include Extended Myanmar Unicode blocks (default: False)
)
syllables = segmenter.segment_syllables("မြန်မာစကား")
# ['မြန်', 'မာ', 'စ', 'ကား']
Characteristics:
- Pure Python with optional Cython acceleration
- No downloads, no model, no dictionary needed
- Fork-safe for multiprocessing
- Handles stacked consonants (Virama ္), Kinzi sequences, and non-Myanmar text
RegexSegmenter only supports syllable and sentence segmentation. It raises NotImplementedError for segment_words(). Use DefaultSegmenter for word segmentation.
Sylbreak Algorithm
The segmenter uses an adapted Sylbreak algorithm:
# Pattern components:
# 1. Myanmar consonant not preceded by stacking Virama
p_my_cons = r"(?<!(?<!\u103a)\u1039)[\u1000-\u1021](?![\u103a\u1039])"
# 2. Independent vowels, digits, symbols
p_other_starters = r"[\u1023-\u102a\u103f\u104c-\u104f\u1040-\u1049\u104a\u104b]"
# 3. Non-Myanmar characters (grouped)
p_non_myanmar = r"[^\u1000-\u109F]+"
Cython Acceleration
RegexSegmenter automatically uses a Cython-compiled implementation when available:
from myspellchecker.segmenters.regex import _HAS_CYTHON_SEGMENTER
if _HAS_CYTHON_SEGMENTER:
print("Using fast Cython implementation")
else:
print("Using pure Python implementation")
Word Segmentation
Word segmentation is handled by DefaultSegmenter, which delegates to one of three word engines. Both myword and crf download resources from HuggingFace on first use.
Word Engines
| Engine | Accuracy | Speed | Model Source | Dependencies |
|---|
myword (default) | ~90% | Fast | segmentation.mmap from HuggingFace | None (pure Python + Cython) |
crf | ~92% | Medium | wordseg_c2_crf.crfsuite from HuggingFace | pycrfsuite |
transformer | ~96% | Slow (CPU) / Fast (GPU) | chuuhtetnaing/myanmar-text-segmentation-model | transformers, torch |
Engine Selection
from myspellchecker.segmenters import DefaultSegmenter
# myword (default) — Viterbi with unigram/bigram dictionary
segmenter = DefaultSegmenter(word_engine="myword")
# CRF — Conditional Random Field sequence tagger
segmenter = DefaultSegmenter(word_engine="crf")
# Transformer — XLM-RoBERTa fine-tuned for word boundaries
segmenter = DefaultSegmenter(
word_engine="transformer",
seg_model="chuuhtetnaing/myanmar-text-segmentation-model", # default
seg_device=-1, # -1=CPU, 0+=GPU
)
HuggingFace Resource Downloads
The myword and crf engines download their resources from the thettwe/myspellchecker-resources HuggingFace dataset repository on first use:
| Engine | Resource | File | Size |
|---|
myword | Word segmentation dictionary | segmentation/segmentation.mmap | Memory-mapped |
crf | CRF model | models/wordseg_c2_crf.crfsuite | CRF model file |
Resources are cached at ~/.cache/myspellchecker/resources/ and only downloaded once.
# Override cache directory
export MYSPELL_CACHE_DIR="/path/to/cache"
# Prevent network downloads (fail if resource not cached)
export MYSPELL_OFFLINE=true
Word segmenters use lazy initialization — no download occurs when you create a DefaultSegmenter or SpellChecker. The download happens on the first call to segment_words().
myword Engine
The default word segmentation engine, based on myWord by Ye Kyaw Thu. Uses a Viterbi algorithm with unigram and bigram probabilities from a memory-mapped dictionary.
segmenter = DefaultSegmenter(word_engine="myword")
words = segmenter.segment_words("မြန်မာနိုင်ငံသည်ကောင်းသည်")
# ['မြန်မာ', 'နိုင်ငံ', 'သည်', 'ကောင်း', 'သည်']
# Load additional custom words into the myword dictionary
segmenter.load_custom_dictionary(["ကျွန်တော်တို့", "မိသားစု"])
CRF Engine
CRF-based sequence tagger trained on myPOS corpus by Ye Kyaw Thu. Requires pycrfsuite.
segmenter = DefaultSegmenter(word_engine="crf")
words = segmenter.segment_words("မြန်မာနိုင်ငံသည်")
XLM-RoBERTa model fine-tuned for Myanmar word boundary detection by Chuu Htet Naing. Uses B/I (Begin/Inside) token classification.
pip install myspellchecker[transformers]
segmenter = DefaultSegmenter(
word_engine="transformer",
seg_device=0, # GPU for speed
)
words = segmenter.segment_words("မြန်မာနိုင်ငံသည်")
| Attribute | Value |
|---|
| Model | chuuhtetnaing/myanmar-text-segmentation-model |
| Base | XLM-RoBERTa |
| Accuracy | 96.17% |
| F1 Score | 78.66% |
Segmenter Interface
All segmenters implement the Segmenter abstract base class:
from myspellchecker.segmenters.base import Segmenter
class Segmenter(ABC):
@abstractmethod
def segment_syllables(self, text: str) -> list[str]:
"""Segment text into syllables."""
@abstractmethod
def segment_words(self, text: str) -> list[str]:
"""Segment text into words."""
@abstractmethod
def segment_sentences(self, text: str) -> list[str]:
"""Segment text into sentences."""
def segment_and_tag(self, text: str) -> tuple[list[str], list[str]]:
"""Segment and POS-tag simultaneously. Optional — raises NotImplementedError by default."""
raise NotImplementedError
DefaultSegmenter
The production segmenter that combines RegexSegmenter (syllables) with a configurable word engine:
from myspellchecker.segmenters import DefaultSegmenter
segmenter = DefaultSegmenter(
word_engine="myword", # "myword" | "crf" | "transformer"
allow_extended_myanmar=False, # Extended Myanmar Unicode blocks
seg_model=None, # Custom model path (transformer only)
seg_device=-1, # -1=CPU, 0+=GPU (transformer only)
)
Usage with SpellChecker
Via Configuration
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider
config = SpellCheckerConfig(
word_engine="myword", # or "crf" or "transformer"
)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)
Custom Segmenter
from myspellchecker import SpellChecker
from myspellchecker.segmenters import DefaultSegmenter
segmenter = DefaultSegmenter(word_engine="crf")
checker = SpellChecker(segmenter=segmenter)
Via Builder
from myspellchecker.core.builder import SpellCheckerBuilder
from myspellchecker.segmenters import DefaultSegmenter
segmenter = DefaultSegmenter(word_engine="crf")
checker = SpellCheckerBuilder().with_segmenter(segmenter).build()
| Segmenter | Level | Speed | Memory | Dependencies |
|---|
| RegexSegmenter | Syllable | Very Fast | Very Low | None |
| DefaultSegmenter (myword) | Word | Fast | Low | Downloads mmap dictionary |
| DefaultSegmenter (crf) | Word | Medium | Low | pycrfsuite + downloads CRF model |
| DefaultSegmenter (transformer) | Word | Slow (CPU) / Fast (GPU) | High (~500MB) | transformers, torch |
Sentence Boundaries
All segmenters split on Myanmar sentence separator (။):
text = "ပထမစာကြောင်း။ ဒုတိယစာကြောင်း။ တတိယစာကြောင်း။"
sentences = segmenter.segment_sentences(text)
# ['ပထမစာကြောင်း။', 'ဒုတိယစာကြောင်း။', 'တတိယစာကြောင်း။']
DefaultSegmenter also detects sentence-final particles (SFPs) as implicit sentence boundaries in longer texts.
See Also