Skip to main content
The tokenizers module provides low-level text splitting utilities for Myanmar text. Unlike Segmenters (which may involve complex logic and dictionary lookups), Tokenizers provide direct access to segmentation algorithms.

Overview

TokenizerAlgorithmPurposeSpeed
SyllableTokenizerRegex-basedSplit text into syllablesVery fast
WordTokenizerCRF or ViterbiSplit text into wordsFast
TransformerWordSegmenterHuggingFace token classificationSplit text into words using B/I labelsModel-dependent

SyllableTokenizer

A fast, regex-based tokenizer that splits Myanmar text into syllables using the Sylbreak algorithm rules.

Initialization

from myspellchecker.tokenizers import SyllableTokenizer

tokenizer = SyllableTokenizer()

Basic Usage

tokenizer = SyllableTokenizer()

# Basic syllable tokenization
syllables = tokenizer.tokenize("မြန်မာနိုင်ငံ")
# Output: ['မြန်', 'မာ', 'နိုင်', 'ငံ']

# Handles stacked consonants correctly
syllables = tokenizer.tokenize("သင်္ဘော")
# Output: ['သင်္ဘော']  # Kinzi preserved as single unit

# Handles mixed content
syllables = tokenizer.tokenize("မြန်မာ123abc")
# Output: ['မြန်', 'မာ', '1', '2', '3', 'a', 'b', 'c']

How It Works

The tokenizer uses regex patterns to identify syllable boundaries based on:
  1. Myanmar consonants (U+1000-U+1021)
  2. Virama/Asat markers (္ and ်) for stacking detection
  3. Negative lookbehind to preserve stacked consonants
# Internal pattern logic (simplified)
pattern = r"((?<!)[က-အ](?![်္])|[a-zA-Z0-9\s...])"

Internal Usage

SyllableTokenizer is the building block for:
  • WordTokenizer (inherits from it)
  • FrequencyBuilder (data pipeline)

WordTokenizer

A word tokenizer supporting two segmentation engines:
EngineAlgorithmAccuracySpeedNotes
mywordViterbi + mmap~95%FastRecommended (default)
CRFCRF model~92%MediumRequires pycrfsuite

Initialization

from myspellchecker.tokenizers import WordTokenizer

# Default: myword engine (recommended)
tokenizer = WordTokenizer(engine="myword")

# Alternative: CRF engine
tokenizer = WordTokenizer(engine="CRF")

Basic Usage

tokenizer = WordTokenizer(engine="myword")

# Word segmentation
words = tokenizer.tokenize("မြန်မာနိုင်ငံသည်အရှေ့တောင်အာရှတွင်တည်ရှိသည်")
# Output: ['မြန်မာ', 'နိုင်ငံ', 'သည်', 'အရှေ့တောင်', 'အာရှ', 'တွင်', 'တည်ရှိ', 'သည်']

# Handles numerals
words = tokenizer.tokenize("လူ၃ယောက်")
# Output: ['လူ', '၃', 'ယောက်']

Engine: myword (Viterbi)

The myword engine uses a Viterbi algorithm with unigram/bigram probabilities stored in a memory-mapped file for fork-safe, high-performance segmentation. Features:
  • Memory-mapped dictionary (Copy-on-Write for multiprocessing)
  • Cython-optimized Viterbi implementation
  • Post-processing for fragment merging and numeral splitting
Initialization Flow:
1. Load segmentation.mmap file
2. Initialize Cython mmap reader
3. Configure Viterbi function
Post-Processing Steps:
  1. Fragment merging: Merge invalid consonant+asat patterns
  2. Numeral splitting: Split word+numeral concatenations (e.g., လ၁['လ', '၁'])
  3. Re-merge: Handle fragments created by splitting

Engine: CRF

The CRF engine uses a trained Conditional Random Fields model for syllable-based word boundary detection. Features:
  • Uses pycrfsuite library
  • Feature extraction includes bigrams, trigrams, BOS/EOS markers
  • Good accuracy without requiring large dictionary files

Checking Custom Words

For the myword engine, you can check if words exist in the dictionary:
tokenizer = WordTokenizer(engine="myword")

# Check if custom words exist in the mmap dictionary
tokenizer.add_custom_words(["ဆော့ဖ်ဝဲ", "ဒေတာဘေ့စ်"])
# Logs: "2/2 words found in dictionary." or warnings for missing words
Note: With mmap-only mode, new words cannot be added dynamically at runtime.

Zero/Wa Normalization

The tokenizer automatically normalizes Myanmar numeral zero (၀, U+1040) to letter wa (ဝ, U+101D) when not in numeric context:
# Automatic normalization
words = tokenizer.tokenize("ဝါကျ")  # wa as letter
# Output: ['ဝါကျ']

words = tokenizer.tokenize("၂၀၂၄")  # zeros in number preserved
# Output: ['၂၀၂၄']

Cython Extensions

Performance-critical tokenization code uses Cython extensions:
ModuleFilePurpose
word_segmenttokenizers/cython/word_segment.pyxViterbi algorithm
mmap_readertokenizers/cython/mmap_reader.pyxMemory-mapped file access

Checking Cython Status

tokenizer = WordTokenizer(engine="myword")

# Check if using Cython
print(f"Using Cython: {tokenizer._using_cython}")
print(f"Using mmap: {tokenizer._using_mmap}")

Error Handling

from myspellchecker.tokenizers import WordTokenizer

# Invalid engine
try:
    tokenizer = WordTokenizer(engine="invalid")
except ValueError as e:
    print(e)  # "Unknown engine: invalid. Must be one of: CRF, myword"

# Missing mmap file
try:
    tokenizer = WordTokenizer(engine="myword")
except RuntimeError as e:
    print(e)  # "segmentation.mmap is required for myword engine..."

TransformerWordSegmenter

A model-agnostic word segmenter that uses any HuggingFace token classification model with B/I (Beginning/Inside) labels to identify word boundaries in Myanmar text.

Requirements

Requires the optional transformers dependency:
pip install myspellchecker[transformers]
This installs:
  • transformers>=4.30.0
  • torch>=2.0.0

Initialization

from myspellchecker.tokenizers.transformer_word_segmenter import (
    TransformerWordSegmenter,
)

# Use the default model
segmenter = TransformerWordSegmenter()

# Use a custom model
segmenter = TransformerWordSegmenter(
    model_name="your-org/your-model",
    device=0,  # GPU
)

Constructor Parameters

ParameterTypeDefaultDescription
model_nameOptional[str]"chuuhtetnaing/myanmar-text-segmentation-model"HuggingFace model ID or local path
deviceint-1Device for inference. -1 for CPU, 0+ for GPU index
batch_sizeint32Batch size for segment_batch(). Auto-tuned to 64 on CPU if left at default
max_lengthint512Maximum sequence length for the tokenizer
cache_dirOptional[str]NoneDirectory for caching downloaded models
**pipeline_kwargsAdditional arguments passed to transformers.pipeline()

Basic Usage

segmenter = TransformerWordSegmenter()

# Single text segmentation
words = segmenter.segment("မြန်မာနိုင်ငံသည်")
# Output depends on model: e.g., ['မြန်မာ', 'နိုင်ငံ', 'သည်']

# Batch segmentation (more efficient for multiple texts)
results = segmenter.segment_batch([
    "မြန်မာနိုင်ငံ",
    "ကျွန်တော်သွားပါမယ်",
])
# Output: list of word lists, one per input text

How It Works

The segmenter uses a HuggingFace token-classification pipeline with aggregation_strategy="simple". The model labels each token as:
  • B (Beginning): Start of a new word
  • I (Inside): Continuation of the current word
The _merge_bi_tags() method groups consecutive B+I* sequences into complete words:
Input tokens:  [B:"မြန်", I:"မာ", B:"နိုင်", I:"ငံ", B:"သည်"]
Merged words:  ["မြန်မာ",          "နိုင်ငံ",          "သည်"]
Edge cases handled:
  • I without preceding B: Treated as a new word start
  • Unknown tag: Treated as B (new word start)
  • Empty tokens: Skipped

Device Support

The segmenter supports CPU, CUDA GPU, and Apple Silicon MPS:
Device ValueHardwareNotes
-1 (default)CPUAlways available, batch_size auto-tuned to 64
0CUDA GPU 0Requires CUDA-capable GPU and PyTorch with CUDA
0 (on macOS)MPS (Apple Silicon)Auto-detected when CUDA unavailable but MPS available
1, 2, …CUDA GPU NFalls back to CPU if GPU index unavailable
Device fallback behavior:
  • If a GPU index is requested but unavailable, falls back to CPU with a warning
  • If PyTorch is not installed, falls back to CPU with a warning

Batch Processing

segment_batch() is significantly more efficient than calling segment() in a loop:
# Efficient: single batch call
results = segmenter.segment_batch(texts)

# Inefficient: individual calls
results = [segmenter.segment(t) for t in texts]
If batch processing fails (e.g., GPU memory), it automatically falls back to processing each text individually.

Data Pipeline Integration

The transformer engine integrates with the data pipeline for building dictionaries from corpus files.

CLI Usage

# Build with transformer word segmentation
myspellchecker build \
    --input corpus.txt \
    --output dict.db \
    --word-engine transformer

# Use a custom model
myspellchecker build \
    --input corpus.txt \
    --output dict.db \
    --word-engine transformer \
    --seg-model "your-org/your-model"

# Use GPU (CUDA or MPS auto-detected)
myspellchecker build \
    --input corpus.txt \
    --output dict.db \
    --word-engine transformer \
    --seg-device 0

CLI Flags

FlagDefaultDescription
--word-engine transformermywordSelect the transformer segmentation engine
--seg-model MODELchuuhtetnaing/myanmar-text-segmentation-modelHuggingFace model ID or local path
--seg-device DEVICE-1 (CPU)Device for inference. -1 for CPU, 0+ for GPU

Python API

from myspellchecker.data_pipeline import Pipeline, PipelineConfig

config = PipelineConfig(
    word_engine="transformer",
    seg_model="your-org/your-model",  # optional, uses default if None
    seg_device=0,                     # optional, -1 for CPU
)

pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="dict.db",
)

Pipeline Processing Behavior

When using the transformer engine, the pipeline processes chunks sequentially in the main process rather than using multiprocessing. This is because PyTorch’s internal C++ state (thread pools, memory allocators, CUDA contexts) does not survive fork() and loading the model in each spawned worker would be impractical (~1.1GB per worker). The pipeline automatically:
  1. Loads the transformer model once in the main process
  2. Processes chunks sequentially with per-chunk progress reporting
  3. Uses batch inference (segment_batch()) for efficient processing within each chunk

Compatible Model Requirements

The TransformerWordSegmenter is model-agnostic. Any HuggingFace model can be used as long as it meets these requirements:
  1. Task: Must be a token-classification model (compatible with transformers.pipeline("token-classification", ...))
  2. Labels: Must output entity_group values of "B" and "I":
    • B = Beginning of a new word
    • I = Inside/continuation of the current word
  3. Tokenizer: Must include a compatible tokenizer (automatically loaded by the HuggingFace pipeline)
  4. Hosting: Can be hosted on HuggingFace Hub (loaded by model ID) or stored locally (loaded by file path)
The default model is chuuhtetnaing/myanmar-text-segmentation-model, an XLM-RoBERTa model fine-tuned for Myanmar text segmentation.

Error Handling

# Missing transformers package
try:
    segmenter = TransformerWordSegmenter()
except ImportError as e:
    print(e)
    # "Transformer-based word segmentation requires the 'transformers' library.
    #  Install with: pip install myspellchecker[transformers]"

# Invalid model
try:
    segmenter = TransformerWordSegmenter(model_name="nonexistent/model")
except ValueError as e:
    print(e)
    # "Failed to load model 'nonexistent/model': ..."

Properties

PropertyTypeDescription
model_namestrThe model ID or path being used
deviceintThe device being used (-1 = CPU, 0+ = GPU)
batch_sizeintThe batch size for batch processing
max_lengthintMaximum sequence length
is_fork_safeboolTrue for CPU mode, False for GPU mode

Default Model Attribution

The default model is chuuhtetnaing/myanmar-text-segmentation-model:
  • Author: Chuu Htet Naing
  • Base: XLM-RoBERTa fine-tuned for token classification
  • Labels: B (beginning), I (inside)
  • License: See model page for details

Performance Comparison

OperationSyllableTokenizerWordTokenizer (myword)WordTokenizer (CRF)
Short text (10 chars)~5μs~50μs~100μs
Medium text (100 chars)~20μs~200μs~500μs
Long text (1000 chars)~100μs~1ms~3ms
Benchmarks on Apple M1, Python 3.11

Attribution

The word segmentation algorithms are based on research by Ye Kyaw Thu: The transformer word segmentation uses the model by Chuu Htet Naing:

See Also