Tokenizers API - mySpellChecker

The tokenizers module provides low-level text splitting utilities for Myanmar text. Unlike Segmenters (which may involve complex logic and dictionary lookups), Tokenizers provide direct access to segmentation algorithms.

Overview

Tokenizer	Algorithm	Purpose	Speed
`SyllableTokenizer`	Regex-based	Split text into syllables	Very fast
`WordTokenizer`	CRF or Viterbi	Split text into words	Fast
`TransformerWordSegmenter`	HuggingFace token classification	Split text into words using B/I labels	Model-dependent

SyllableTokenizer

A fast, regex-based tokenizer that splits Myanmar text into syllables using the Sylbreak algorithm rules.

Initialization

from myspellchecker.tokenizers import SyllableTokenizer

tokenizer = SyllableTokenizer()

Basic Usage

tokenizer = SyllableTokenizer()

# Basic syllable tokenization
syllables = tokenizer.tokenize("မြန်မာနိုင်ငံ")
# Output: ['မြန်', 'မာ', 'နိုင်', 'ငံ']

# Handles stacked consonants correctly
syllables = tokenizer.tokenize("သင်္ဘော")
# Output: ['သင်္ဘော']  # Kinzi preserved as single unit

# Handles mixed content
syllables = tokenizer.tokenize("မြန်မာ123abc")
# Output: ['မြန်', 'မာ', '1', '2', '3', 'a', 'b', 'c']

How It Works

The tokenizer uses regex patterns to identify syllable boundaries based on:

Myanmar consonants (U+1000-U+1021)
Virama/Asat markers (္ and ်) for stacking detection
Negative lookbehind to preserve stacked consonants

# Internal pattern logic (simplified)
pattern = r"((?<!္)[က-အ](?![်္])|[a-zA-Z0-9\s...])"

Internal Usage

SyllableTokenizer is the building block for:

WordTokenizer (inherits from it)
FrequencyBuilder (data pipeline)

WordTokenizer

A word tokenizer supporting two segmentation engines:

Engine	Algorithm	Accuracy	Speed	Notes
`myword`	Viterbi + mmap	~95%	Fast	Recommended (default)
`CRF`	CRF model	~92%	Medium	Requires pycrfsuite

Initialization

from myspellchecker.tokenizers import WordTokenizer

# Default: myword engine (recommended)
tokenizer = WordTokenizer(engine="myword")

# Alternative: CRF engine
tokenizer = WordTokenizer(engine="CRF")

Basic Usage

tokenizer = WordTokenizer(engine="myword")

# Word segmentation
words = tokenizer.tokenize("မြန်မာနိုင်ငံသည်အရှေ့တောင်အာရှတွင်တည်ရှိသည်")
# Output: ['မြန်မာ', 'နိုင်ငံ', 'သည်', 'အရှေ့တောင်', 'အာရှ', 'တွင်', 'တည်ရှိ', 'သည်']

# Handles numerals
words = tokenizer.tokenize("လူ၃ယောက်")
# Output: ['လူ', '၃', 'ယောက်']

Engine: myword (Viterbi)

The myword engine uses a Viterbi algorithm with unigram/bigram probabilities stored in a memory-mapped file for fork-safe, high-performance segmentation. Features:

Memory-mapped dictionary (Copy-on-Write for multiprocessing)
Cython-optimized Viterbi implementation
Post-processing for fragment merging and numeral splitting

Initialization Flow:

Load segmentation.mmap file
Initialize Cython mmap reader
Configure Viterbi function

Post-Processing Steps:

Fragment merging: Merge invalid consonant+asat patterns
Numeral splitting: Split word+numeral concatenations (e.g., လ၁ → ['လ', '၁'])
Re-merge: Handle fragments created by splitting

Engine: CRF

The CRF engine uses a trained Conditional Random Fields model for syllable-based word boundary detection. Features:

Uses pycrfsuite library
Feature extraction includes bigrams, trigrams, BOS/EOS markers
Good accuracy without requiring large dictionary files

Checking Custom Words

For the myword engine, you can check if words exist in the dictionary:

tokenizer = WordTokenizer(engine="myword")

# Check if custom words exist in the mmap dictionary
tokenizer.add_custom_words(["ဆော့ဖ်ဝဲ", "ဒေတာဘေ့စ်"])
# Logs: "2/2 words found in dictionary." or warnings for missing words

Note: With mmap-only mode, new words cannot be added dynamically at runtime.

Zero/Wa Normalization

The tokenizer automatically normalizes Myanmar numeral zero (၀, U+1040) to letter wa (ဝ, U+101D) when not in numeric context:

# Automatic normalization
words = tokenizer.tokenize("ဝါကျ")  # wa as letter
# Output: ['ဝါကျ']

words = tokenizer.tokenize("၂၀၂၄")  # zeros in number preserved
# Output: ['၂၀၂၄']

Cython Extensions

Performance-critical tokenization code uses Cython extensions:

Module	File	Purpose
`word_segment`	`tokenizers/cython/word_segment.pyx`	Viterbi algorithm
`mmap_reader`	`tokenizers/cython/mmap_reader.pyx`	Memory-mapped file access

Checking Cython Status

tokenizer = WordTokenizer(engine="myword")

# Check if using Cython
print(f"Using Cython: {tokenizer._using_cython}")
print(f"Using mmap: {tokenizer._using_mmap}")

Error Handling

from myspellchecker.tokenizers import WordTokenizer

# Invalid engine
try:
    tokenizer = WordTokenizer(engine="invalid")
except ValueError as e:
    print(e)  # "Unknown engine: invalid. Must be one of: CRF, myword"

# Missing mmap file
try:
    tokenizer = WordTokenizer(engine="myword")
except RuntimeError as e:
    print(e)  # "segmentation.mmap is required for myword engine..."

TransformerWordSegmenter

A model-agnostic word segmenter that uses any HuggingFace token classification model with B/I (Beginning/Inside) labels to identify word boundaries in Myanmar text.

Requirements

Requires the optional transformers dependency:

pip install myspellchecker[transformers]

This installs:

transformers>=4.30.0
torch>=2.0.0

Initialization

from myspellchecker.tokenizers.transformer_word_segmenter import (
    TransformerWordSegmenter,
)

# Use the default model
segmenter = TransformerWordSegmenter()

# Use a custom model
segmenter = TransformerWordSegmenter(
    model_name="your-org/your-model",
    device=0,  # GPU
)

Constructor Parameters

Parameter	Type	Default	Description
`model_name`	`Optional[str]`	`"chuuhtetnaing/myanmar-text-segmentation-model"`	HuggingFace model ID or local path
`device`	`int`	`-1`	Device for inference. `-1` for CPU, `0+` for GPU index
`batch_size`	`int`	`32`	Batch size for `segment_batch()`. Auto-tuned to `64` on CPU if left at default
`max_length`	`int`	`512`	Maximum sequence length for the tokenizer
`cache_dir`	`Optional[str]`	`None`	Directory for caching downloaded models
`**pipeline_kwargs`			Additional arguments passed to `transformers.pipeline()`

Basic Usage

segmenter = TransformerWordSegmenter()

# Single text segmentation
words = segmenter.segment("မြန်မာနိုင်ငံသည်")
# Output depends on model: e.g., ['မြန်မာ', 'နိုင်ငံ', 'သည်']

# Batch segmentation (more efficient for multiple texts)
results = segmenter.segment_batch([
    "မြန်မာနိုင်ငံ",
    "ကျွန်တော်သွားပါမယ်",
])
# Output: list of word lists, one per input text

How It Works

The segmenter uses a HuggingFace token-classification pipeline with aggregation_strategy="simple". The model labels each token as:

B (Beginning): Start of a new word
I (Inside): Continuation of the current word

The _merge_bi_tags() method groups consecutive B+I* sequences into complete words:

Input tokens:  [B:"မြန်", I:"မာ", B:"နိုင်", I:"ငံ", B:"သည်"]
Merged words:  ["မြန်မာ",          "နိုင်ငံ",          "သည်"]

Edge cases handled:

I without preceding B: Treated as a new word start
Unknown tag: Treated as B (new word start)
Empty tokens: Skipped

Device Support

The segmenter supports CPU, CUDA GPU, and Apple Silicon MPS:

Device Value	Hardware	Notes
`-1` (default)	CPU	Always available, batch_size auto-tuned to 64
`0`	CUDA GPU 0	Requires CUDA-capable GPU and PyTorch with CUDA
`0` (on macOS)	MPS (Apple Silicon)	Auto-detected when CUDA unavailable but MPS available
`1`, `2`, …	CUDA GPU N	Falls back to CPU if GPU index unavailable

Device fallback behavior:

If a GPU index is requested but unavailable, falls back to CPU with a warning
If PyTorch is not installed, falls back to CPU with a warning

Batch Processing

segment_batch() is significantly more efficient than calling segment() in a loop:

# Efficient: single batch call
results = segmenter.segment_batch(texts)

# Inefficient: individual calls
results = [segmenter.segment(t) for t in texts]

If batch processing fails (e.g., GPU memory), it automatically falls back to processing each text individually.

Data Pipeline Integration

The transformer engine integrates with the data pipeline for building dictionaries from corpus files.

CLI Usage

# Build with transformer word segmentation
myspellchecker build \
    --input corpus.txt \
    --output dict.db \
    --word-engine transformer

# Use a custom model
myspellchecker build \
    --input corpus.txt \
    --output dict.db \
    --word-engine transformer \
    --seg-model "your-org/your-model"

# Use GPU (CUDA or MPS auto-detected)
myspellchecker build \
    --input corpus.txt \
    --output dict.db \
    --word-engine transformer \
    --seg-device 0

CLI Flags

Flag	Default	Description
`--word-engine transformer`	`myword`	Select the transformer segmentation engine
`--seg-model MODEL`	`chuuhtetnaing/myanmar-text-segmentation-model`	HuggingFace model ID or local path
`--seg-device DEVICE`	`-1` (CPU)	Device for inference. `-1` for CPU, `0+` for GPU

Python API

from myspellchecker.data_pipeline import Pipeline, PipelineConfig

config = PipelineConfig(
    word_engine="transformer",
    seg_model="your-org/your-model",  # optional, uses default if None
    seg_device=0,                     # optional, -1 for CPU
)

pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="dict.db",
)

Pipeline Processing Behavior

When using the transformer engine, the pipeline processes chunks sequentially in the main process rather than using multiprocessing. This is because PyTorch’s internal C++ state (thread pools, memory allocators, CUDA contexts) does not survive fork() and loading the model in each spawned worker would be impractical (~1.1GB per worker). The pipeline automatically:

Loads the transformer model once in the main process
Processes chunks sequentially with per-chunk progress reporting
Uses batch inference (segment_batch()) for efficient processing within each chunk

Compatible Model Requirements

The TransformerWordSegmenter is model-agnostic. Any HuggingFace model can be used as long as it meets these requirements:

Task: Must be a token-classification model (compatible with transformers.pipeline("token-classification", ...))
Labels: Must output entity_group values of "B" and "I":
- B = Beginning of a new word
- I = Inside/continuation of the current word
Tokenizer: Must include a compatible tokenizer (automatically loaded by the HuggingFace pipeline)
Hosting: Can be hosted on HuggingFace Hub (loaded by model ID) or stored locally (loaded by file path)

The default model is chuuhtetnaing/myanmar-text-segmentation-model, an XLM-RoBERTa model fine-tuned for Myanmar text segmentation.

Error Handling

# Missing transformers package
try:
    segmenter = TransformerWordSegmenter()
except ImportError as e:
    print(e)
    # "Transformer-based word segmentation requires the 'transformers' library.
    #  Install with: pip install myspellchecker[transformers]"

# Invalid model
try:
    segmenter = TransformerWordSegmenter(model_name="nonexistent/model")
except ValueError as e:
    print(e)
    # "Failed to load model 'nonexistent/model': ..."

Properties

Property	Type	Description
`model_name`	`str`	The model ID or path being used
`device`	`int`	The device being used (`-1` = CPU, `0+` = GPU)
`batch_size`	`int`	The batch size for batch processing
`max_length`	`int`	Maximum sequence length
`is_fork_safe`	`bool`	`True` for CPU mode, `False` for GPU mode

Default Model Attribution

The default model is chuuhtetnaing/myanmar-text-segmentation-model:

Author: Chuu Htet Naing
Base: XLM-RoBERTa fine-tuned for token classification
Labels: B (beginning), I (inside)
License: See model page for details

Performance Comparison

Operation	SyllableTokenizer	WordTokenizer (myword)	WordTokenizer (CRF)
Short text (10 chars)	~5μs	~50μs	~100μs
Medium text (100 chars)	~20μs	~200μs	~500μs
Long text (1000 chars)	~100μs	~1ms	~3ms

Benchmarks on Apple M1, Python 3.11

Attribution

The word segmentation algorithms are based on research by Ye Kyaw Thu:

The transformer word segmentation uses the model by Chuu Htet Naing:

myanmar-text-segmentation-model

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Overview

​SyllableTokenizer

​Initialization

​Basic Usage

​How It Works

​Internal Usage

​WordTokenizer

​Initialization

​Basic Usage

​Engine: myword (Viterbi)

​Engine: CRF

​Checking Custom Words

​Zero/Wa Normalization

​Cython Extensions

​Checking Cython Status

​Error Handling

​TransformerWordSegmenter

​Requirements

​Initialization

​Constructor Parameters

​Basic Usage

​How It Works

​Device Support

​Batch Processing

​Data Pipeline Integration

​CLI Usage

​CLI Flags

​Python API

​Pipeline Processing Behavior

​Compatible Model Requirements

​Error Handling

​Properties

​Default Model Attribution

​Performance Comparison

​Attribution

​See Also

Overview

SyllableTokenizer

Initialization

Basic Usage

How It Works

Internal Usage

WordTokenizer

Initialization

Basic Usage

Engine: myword (Viterbi)

Engine: CRF

Checking Custom Words

Zero/Wa Normalization

Cython Extensions

Checking Cython Status

Error Handling

TransformerWordSegmenter

Requirements

Initialization

Constructor Parameters

Basic Usage

How It Works

Device Support

Batch Processing

Data Pipeline Integration

CLI Usage

CLI Flags

Python API

Pipeline Processing Behavior

Compatible Model Requirements

Error Handling

Properties

Default Model Attribution

Performance Comparison

Attribution

See Also