Stemmer
TheStemmer provides rule-based stemming to strip common suffixes from Myanmar words, identifying their root forms.
Use Cases
- Identifying OOV words that are conjugated forms of known words
- Aggregating statistics for root words
- Improving POS tagging by mapping to root POS
Usage
Performance Features
- LRU caching for frequently stemmed words
- Pre-computed suffix list sorted by length for optimal matching
- O(n) suffix collection using append + reverse pattern
Configuration
Phonetic Hasher
ThePhoneticHasher generates phonetic codes for Myanmar text, enabling fuzzy matching based on pronunciation.
Features
- Groups phonetically similar characters
- Normalizes tone markers and medials
- Handles visual confusability
- LRU caching for performance
Basic Usage
Find Phonetically Similar Words
Generate Phonetic Variants
Configuration
Batch Processing
Tone Disambiguator
TheToneDisambiguator uses context to resolve tone-ambiguous words in Myanmar text.
Myanmar Tone System
| Tone | Marker | Example |
|---|---|---|
| Low | unmarked (short vowel) | ငါ (I/me) |
| High | း (visarga) | ငါး (five/fish) |
| Creaky | ့ (aukmyit/dot below) | လေ့ (habit/practice) |
| Checked | final ် (asat) | သပ် (sparse) |
Common Ambiguities
| Word | Meanings |
|---|---|
| ငါ / ငါး | I/me vs five/fish |
| တော / တော့ | forest vs (particle, emphasis) |
| တော / တော် | forest vs royal/suitable |
| ပဲ | only/just vs bean |
Usage
Context-Based Disambiguation
Check Full Sentence
Configuration
Zawgyi Support
Zawgyi is a legacy encoding for Myanmar script. The library detects and handles Zawgyi-encoded text.Detection
Conversion
The library includes built-in Zawgyi to Unicode conversion:Text Validation
Validate Myanmar text structure using module-level functions:Normalization
Text normalization for consistent processing:Cython Optimization
Normalization has a Cython-optimized version for performance:Integration
All text utilities integrate with the main spell checker:Performance Tips
- Enable caching: All utilities support LRU caching
- Batch operations: Use batch methods when processing many texts
- Adjust cache sizes: Increase for high-throughput scenarios
See Also
- Morphology Analysis - Word structure analysis
- POS Tagging - Part-of-speech tagging
- Normalization - Unicode normalization