mySpellChecker has a pluggable POS tagging system with multiple backends — from zero-dependency rule-based inference to high-accuracy transformer models. POS tags drive grammar checking, disambiguation, and context-aware suggestions throughout the validation pipeline.
Introduction
What is POS Tagging?
Part-of-Speech tagging assigns grammatical categories (noun, verb, adjective, etc.) to words:
မြန်မာ → N (Noun)
ကောင်း → ADJ (Adjective)
သည် → P_SENT (Sentence-ending Particle)
Why Use POS Tagging?
- Improved Accuracy: Context-aware spell checking (85-93% accuracy vs 70% without)
- Better Suggestions: Grammatically appropriate correction suggestions
- Disambiguation: Distinguish between homonyms based on context
- Validation: Detect grammatical errors and inconsistencies
Integration Points
The POS tagger integrates at two levels:
- Build-Time: The inference engine assigns POS tags when building your dictionary from corpus
- Runtime: On-the-fly tagging for OOV (out-of-vocabulary) words during spell checking
Two Tag Systems
mySpellChecker has two POS sources that produce different tag granularities:
- Inference Engine (rule-based): Produces granular particle tags (P_SUBJ, P_OBJ, P_SENT, P_MOD, P_LOC) by analyzing suffixes and morphology. Used during dictionary building and for OOV word fallback.
- Transformer Model (HuggingFace): Produces coarse particle tags (PPM, PART) because the underlying model was trained on a coarser tag set. Used for high-accuracy runtime tagging.
The transformer model does not distinguish between particle types — it outputs PPM (postpositional marker) or PART (general particle) for all particles. Granular particle tags (P_SUBJ, P_OBJ, etc.) come from the inference engine and the dictionary.
Complete POS Tag Set
| Tag | Description | Examples | Source |
|---|
| N | Noun | မြန်မာ (Myanmar), နိုင်ငံ (country), လူ (person) | All |
| V | Verb | စား (eat), သွား (go), လုပ် (do), ရေး (write) | All |
| ADJ | Adjective | ကောင်း (good), လှ (beautiful), ကြီး (big) | All |
| ADV | Adverb | အလွန် (very), မြန်မြန် (quickly), ဖြည်းဖြည်း (slowly) | All |
| NUM | Number | တစ် (one), နှစ် (two), ၁၀ (10) | All |
| PRON | Pronoun | ကျွန်တော် (I), သူ (he/she), သူတို့ (they) | All |
| CONJ | Conjunction | နှင့် (and), သို့မဟုတ် (or), ဒါပေမယ့် (but) | All |
| INT | Interjection | ဟယ် (hey), အို (oh), ဟာ (wow) | All |
| UNK | Unknown | — | All |
Granular Particle Tags (inference engine only)
These tags are produced by the rule-based inference engine and stored in the dictionary. The transformer model cannot distinguish between particle types — it uses coarse tags instead.
| Tag | Description | Examples |
|---|
| P_SUBJ | Subject/Topic Particle | က, ကား, ဟာ, မှာ |
| P_OBJ | Object Particle | ကို, အား |
| P_SENT | Sentence Ending | သည်, တယ်, မယ်, ပြီ, ပါ |
| P_MOD | Modifier Particle | သော, တဲ့, နဲ့, လို, ဖြင့် |
| P_LOC | Location/Direction | မှ (from), သို့ (to), ဆီ (towards), တွင် (in/at) |
| P | General Particle | လည်း (also), ပဲ (only), တော့ (as for) |
These tags come from the HuggingFace transformer model. They are broader categories that don’t distinguish particle subtypes.
| Tag | Description | Notes |
|---|
| PPM | Postpositional Marker | Covers all particles (P_SUBJ, P_OBJ, P_SENT, P_MOD, P_LOC) |
| PART | General Particle | Catch-all for particles not classified as PPM |
| PUNCT | Punctuation | ။, ၊ |
| ABB | Abbreviation | Shortened forms |
| FW | Foreign Word | Non-Myanmar words |
| SB | Symbol | Special symbols |
| TN | Text Number | Numbers written in text form |
The HuggingFace model (chuuhtetnaing/myanmar-pos-model) outputs lowercase tags. The TransformerPOSTagger maps them to the internal uppercase convention via HF_TO_INTERNAL_TAG_MAP:
# HuggingFace → Internal mapping (pos_tagger_transformer.py)
HF_TO_INTERNAL_TAG_MAP = {
"n": "N", "v": "V", "adj": "ADJ",
"adv": "ADV", "pron": "PRON", "num": "NUM",
"conj": "CONJ", "int": "INT", "punc": "PUNCT",
"ppm": "PPM", "part": "PART",
"abb": "ABB", "fw": "FW", "sb": "SB", "tn": "TN",
}
Tag Disambiguation Guidelines
Many Myanmar words can have multiple POS tags depending on context. Here are common ambiguities:
1. Noun vs. Verb Ambiguity
Some words function as both nouns and verbs:
| Word | As Noun | As Verb | Resolution |
|---|
| စာ | book/letter (N) | to read (V) | Check for preceding ကို or following particle |
| အလုပ် | work/job (N) | to work (V) | Check sentence structure |
| ပညာ | knowledge (N) | to educate (V) | Rare as verb, default to N |
Resolution rule: If followed by ကို/အား, it’s likely a noun. If followed by တယ်/ပြီ, it’s a verb.
2. Adjective vs. Verb Ambiguity
Myanmar adjectives often function as stative verbs:
| Word | Context | Tag | Example |
|---|
| ကောင်း | standalone predicate | V | အဲဒါ ကောင်းတယ် (That is good) |
| ကောင်း | modifier before noun | ADJ | ကောင်းတဲ့ လူ (good person) |
| လှ | sentence-final | V | သူ လှတယ် (She is beautiful) |
| လှ | with သော/တဲ့ | ADJ | လှသော မိန်းကလေး (beautiful girl) |
Resolution rule: With modifier particle (သော, တဲ့) → ADJ. As predicate → V.
3. Particle Disambiguation
Particles require careful context analysis:
| Word | Possible Tags | Context | Example |
|---|
| က | P_SUBJ | After subject noun | သူက သွားတယ် |
| က | P_LOC | With location meaning | ရန်ကုန်က လာတယ် (came from Yangon) |
| မှာ | P_SUBJ | Topic marker | ဒါမှာ ကောင်းတယ် |
| မှာ | P_LOC | Location marker | စားပွဲမှာ ရှိတယ် (is on the table) |
| မှာ | V | ”to order” meaning | ထမင်းမှာမယ် (will order rice) |
Annotation Guidelines
When annotating Myanmar text for POS tagging:
- Segment first: Ensure proper word boundaries before tagging
- Context matters: Always consider surrounding words for disambiguation
- Particle chains: Tag each particle in a chain separately
- Example: သွားပါမယ် = V(သွား) + P_SENT(ပါ) + P_SENT(မယ်)
- Compound words: Tag as single unit if dictionary entry exists
- Example: ကျောင်းသား (student) = N (not N + N)
- Numbers: Use NUM for digits and number words
- Punctuation: Exclude from POS tagging (handled separately)
Common Annotation Errors to Avoid
| Error | Incorrect | Correct | Reason |
|---|
| Particle as noun | သည် (N) | သည် (P_SENT) | Sentence-final particles aren’t nouns |
| Missing particle | ကောင်းတယ် (V) | ကောင်း (V) + တယ် (P_SENT) | Segment particles separately |
| Verb as adjective | ကောင်း (ADJ) in predicate | ကောင်း (V) | Predicative = verb |
| Wrong particle type | က (P) | က (P_SUBJ) | Use specific particle tags when available |
Quick Start
Default Configuration (Rule-Based)
No setup required - works out of the box with zero dependencies:
from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider
# Uses default rule-based tagger
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(provider=provider)
result = checker.check("မြန်မာ နိုင်ငံ")
Install transformers package and configure:
# Install with transformer support
pip install myspellchecker[transformers]
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, POSTaggerConfig
from myspellchecker.providers import SQLiteProvider
config = SpellCheckerConfig(
pos_tagger=POSTaggerConfig(
tagger_type="transformer",
device=0, # GPU (use -1 for CPU)
)
)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)
Using Custom Models
Point to your fine-tuned HuggingFace model:
from myspellchecker.core.config import SpellCheckerConfig, POSTaggerConfig
config = SpellCheckerConfig(
pos_tagger=POSTaggerConfig(
tagger_type="transformer",
model_name="your-username/your-myanmar-pos-model",
device=-1, # CPU
)
)
Tagger Types
1. Rule-Based (Default)
Best for: Quick setup, no dependencies, production environments with tight resource constraints
Characteristics:
- Fast suffix-based morphological analysis
- Produces granular particle tags (P_SUBJ, P_OBJ, P_SENT, P_MOD, P_LOC)
- No external dependencies
- Fork-safe for multiprocessing
- Lowest memory footprint
Performance:
- Speed: Very Fast
- Accuracy: ~70%
- Memory: Very Low
- Dependencies: None
How it works:
from myspellchecker.algorithms.pos_tagger_rule import RuleBasedPOSTagger
tagger = RuleBasedPOSTagger(
use_morphology_fallback=True,
cache_size=10000,
unknown_tag="UNK"
)
tag = tagger.tag_word("စားပြီ") # Returns: P_SENT
Fallback chain:
- Check pos_map (if provided)
- Morphological suffix analysis
- Return “UNK” for unknown words
Best for: Maximum accuracy, when GPU is available, offline processing
Characteristics:
- Pre-trained neural models from HuggingFace
- Context-aware sequence tagging
- Produces coarse particle tags (PPM, PART) — mapped from HF lowercase tags
- Requires GPU for optimal speed
- Not fork-safe (CUDA limitations)
Performance:
- Speed: Slow (CPU), Fast (GPU)
- Accuracy: ~93%
- Memory: ~500 MB (model) + ~100 MB (buffer)
- Dependencies:
transformers>=4.30.0, torch>=2.0.0
Default model: chuuhtetnaing/myanmar-pos-model (XLM-RoBERTa-based, 93.37% accuracy)
How it works:
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger
tagger = TransformerPOSTagger(
model_name="chuuhtetnaing/myanmar-pos-model",
device=0, # GPU
batch_size=32,
max_length=128
)
# Single word
tag = tagger.tag_word("မြန်မာ") # Returns: N
# Sequence (context-aware)
tags = tagger.tag_sequence(["မြန်မာ", "နိုင်ငံ", "သည်"])
# Returns: ['N', 'N', 'PPM'] (coarse particle tag)
# With confidence scores
prediction = tagger.tag_word_with_confidence("ကောင်း")
print(f"{prediction.tag} (confidence: {prediction.confidence:.2f})")
# Output: ADJ (confidence: 0.95)
3. Viterbi HMM
Best for: Context-aware tagging without GPU, balanced accuracy/speed
Characteristics:
- Hidden Markov Model with Viterbi algorithm
- Uses trigram transition probabilities
- Requires pre-built probability tables
- Fork-safe
Performance:
- Speed: Fast
- Accuracy: ~85% (with probability tables), ~70% (fallback to morphology)
- Memory: ~50 MB (probability tables)
- Dependencies: None (pure Python + optional Cython)
Database Requirements:
The Viterbi tagger requires POS probability tables in the database:
| Table | Description |
|---|
pos_unigrams | P(tag) - Prior tag probabilities |
pos_bigrams | P(tag2 | tag1) - Tag transition probabilities |
pos_trigrams | P(tag3 | tag1, tag2) - Trigram context |
Building database with POS probabilities:
# Build with POS tagging enabled (populates probability tables)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger transformer
# Or use sample database (includes pre-computed probabilities)
myspellchecker build --sample
Note: If probability tables are empty, Viterbi falls back to morphological analysis with reduced accuracy (~70%).
How it works:
from myspellchecker.algorithms.pos_tagger_factory import POSTaggerFactory
from myspellchecker.providers import SQLiteProvider
# Requires provider with POS probability tables
provider = SQLiteProvider("mydict.db")
tagger = POSTaggerFactory.create("viterbi", provider=provider, beam_width=10)
# Context-aware sequence tagging
tags = tagger.tag_sequence(["မြန်မာ", "နိုင်ငံ", "သည်"])
4. Custom Tagger
Best for: Domain-specific requirements, research experiments
Implement your own tagger by inheriting from POSTaggerBase:
from myspellchecker.algorithms.pos_tagger_base import POSTaggerBase, TaggerType
class MyCustomTagger(POSTaggerBase):
def tag_word(self, word: str) -> str:
# Your logic here
return "N"
def tag_sequence(self, words: list[str]) -> list[str]:
# Your logic here
return ["N"] * len(words)
@property
def tagger_type(self) -> TaggerType:
return TaggerType.CUSTOM
# Use via factory
from myspellchecker.algorithms.pos_tagger_factory import POSTaggerFactory
from myspellchecker.core.config import POSTaggerConfig
tagger = POSTaggerFactory.create("custom", provider=provider)
Configuration
POSTaggerConfig
Central configuration for POS tagger system:
from myspellchecker.core.config import POSTaggerConfig
config = POSTaggerConfig(
# Tagger selection
tagger_type="transformer", # "rule_based" | "transformer" | "viterbi"
# Transformer settings
model_name="chuuhtetnaing/myanmar-pos-model",
device=-1, # -1=CPU, 0+=GPU index
batch_size=32,
cache_dir=None, # Model cache directory
# Rule-based settings
cache_size=10000,
unknown_tag="UNK",
use_morphology_fallback=True,
# Viterbi settings
beam_width=10,
emission_weight=1.2,
min_prob=1e-10,
)
Environment Variables
Configure via environment variables (useful for deployment):
# Tagger type
export MYSPELL_POS_TAGGER_TYPE="transformer"
# Model selection
export MYSPELL_POS_TAGGER_MODEL_NAME="your-username/model"
# Beam width for Viterbi tagger
export MYSPELL_POS_TAGGER_BEAM_WIDTH="15"
Configuration Priority
- Explicit config in code (highest priority)
- Environment variables
- Default values (lowest priority)
Build-Time Usage
CLI - Building Dictionaries
Default (Rule-Based)
myspellchecker build \
-i corpus.txt \
-o mydict.db \
--sample=false
myspellchecker build \
-i corpus.txt \
-o mydict.db \
--pos-tagger transformer \
--pos-model chuuhtetnaing/myanmar-pos-model \
--pos-device 0 \
--sample=false
With Custom Model
myspellchecker build \
-i corpus.txt \
-o mydict.db \
--pos-tagger transformer \
--pos-model /path/to/my/finetuned/model \
--pos-device -1 \
--sample=false
Python API - Building Dictionaries
from myspellchecker.data_pipeline.pipeline import Pipeline
from myspellchecker.data_pipeline.config import PipelineConfig
from myspellchecker.core.config import POSTaggerConfig
# Configure pipeline with POS tagger
config = PipelineConfig(
pos_tagger=POSTaggerConfig(
tagger_type="transformer",
model_name="chuuhtetnaing/myanmar-pos-model",
device=0, # GPU
batch_size=64, # Larger batch for build-time
),
keep_intermediate=False,
)
# Build database
pipeline = Pipeline(config=config, work_dir="temp_build")
pipeline.build_database(
input_files=["corpus1.txt", "corpus2.txt"],
database_path="mydict.db",
sample=False,
)
Runtime Usage
SpellChecker Configuration
Default (Rule-Based)
from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider
# Explicit provider (required — no bundled database)
provider = SQLiteProvider(database_path="mydict.db")
checker = SpellChecker(provider=provider)
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, POSTaggerConfig
from myspellchecker.providers import SQLiteProvider
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger
# Create transformer tagger
tagger = TransformerPOSTagger(
model_name="chuuhtetnaing/myanmar-pos-model",
device=0, # GPU
)
# Create provider with tagger
provider = SQLiteProvider(database_path="mydict.db", pos_tagger=tagger)
# Create config and spell checker
config = SpellCheckerConfig(
pos_tagger=POSTaggerConfig(tagger_type="transformer", device=0)
)
checker = SpellChecker(config=config, provider=provider)
# Use spell checker
result = checker.check("မြန်မာ နိုင်ငံ ကောင်း သည်")
OOV Word Handling
The POS tagger provides fallback for out-of-vocabulary words:
from myspellchecker.providers import SQLiteProvider
from myspellchecker.algorithms.pos_tagger_rule import RuleBasedPOSTagger
provider = SQLiteProvider(database_path="mydict.db")
# For known word: database lookup
pos = provider.get_word_pos("မြန်မာ") # Returns: N (from database)
# For OOV word: tagger fallback
pos = provider.get_word_pos("CompletelyUnknownWord123") # Returns: UNK
Fallback chain:
- Database lookup
- Stemming + root lookup
- POS tagger
- Morphology analyzer (backward compatibility)
- Return None or “UNK”
Comparison
| Tagger | Speed | Accuracy | Memory | Context-Aware |
|---|
| Rule-Based | Very Fast | ~70% | Very Low | No |
| Viterbi | Fast | ~85% | Low | Yes |
| Transformer | Slow (CPU) / Fast (GPU) | ~93% | High | Yes |
Recommendation Matrix
| Use Case | Recommended Tagger | Reason |
|---|
| Production API | Rule-Based or Viterbi | Fast, low memory, no GPU needed |
| Batch Processing | Transformer (GPU) | Highest accuracy, GPU parallelism |
| Offline Analysis | Transformer (CPU) | Accuracy over speed |
| Embedded Systems | Rule-Based | Minimal footprint |
| Research | Transformer or Custom | Flexibility and accuracy |
Troubleshooting
Missing Dependencies
Error: ImportError: transformers required
Solution:
pip install myspellchecker[transformers]
# Or manually:
pip install transformers>=4.30.0 torch>=2.0.0
Verification:
try:
from transformers import pipeline
print("Transformers installed")
except ImportError:
print("Transformers not installed")
CUDA Errors
Error: RuntimeError: CUDA out of memory
Solutions:
- Reduce batch size:
config = POSTaggerConfig(
tagger_type="transformer",
batch_size=8, # Reduce from default 32
)
- Use CPU:
config = POSTaggerConfig(
tagger_type="transformer",
device=-1, # Force CPU
)
- Clear GPU cache:
import torch
torch.cuda.empty_cache()
Error: RuntimeError: CUDA error: device-side assert triggered
Solution: Usually model/data mismatch. Verify:
tagger = TransformerPOSTagger(device=0)
# Ensure input is valid Myanmar Unicode text
tag = tagger.tag_word("မြန်မာ") # Valid
# tag = tagger.tag_word(None) # Invalid - will crash
Model Loading Failures
Error: OSError: Can't load model from 'nonexistent/model'
Solutions:
- Verify model exists:
# Check HuggingFace model
curl -I https://huggingface.co/chuuhtetnaing/myanmar-pos-model
# Or use local path
ls /path/to/my/model/config.json
- Check internet connection (for HuggingFace downloads):
import requests
response = requests.get("https://huggingface.co")
print(f"Status: {response.status_code}")
- Use cache directory:
config = POSTaggerConfig(
tagger_type="transformer",
cache_dir="/path/to/cache", # Persistent cache
)
- Download manually:
# Download model to local directory
huggingface-cli download chuuhtetnaing/myanmar-pos-model --local-dir ./my-model
# Use local path
python -c "
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger
tagger = TransformerPOSTagger(model_name='./my-model')
"
Fork-Safety Issues
Error: RuntimeError: Cannot re-initialize CUDA in forked subprocess
Cause: Transformer models use CUDA which is not fork-safe.
Solution: Use rule-based or Viterbi tagger for multiprocessing:
from multiprocessing import Pool
from myspellchecker.algorithms.pos_tagger_rule import RuleBasedPOSTagger
# Fork-safe
tagger = RuleBasedPOSTagger()
def process_batch(words):
return [tagger.tag_word(w) for w in words]
with Pool(4) as pool:
results = pool.map(process_batch, batches)
# NOT fork-safe
# tagger = TransformerPOSTagger() # Will crash in forked processes
Alternative: Use spawn instead of fork:
from multiprocessing import get_context
with get_context("spawn").Pool(4) as pool:
results = pool.map(process_batch, batches)
Slow tagging with transformer:
- Use GPU:
config = POSTaggerConfig(device=0) # GPU 0
- Increase batch size:
config = POSTaggerConfig(batch_size=64) # Default: 32
- Use quantization (trade accuracy for speed):
# Requires torch>=2.0
from transformers import AutoModelForTokenClassification
import torch
model = AutoModelForTokenClassification.from_pretrained(
"chuuhtetnaing/myanmar-pos-model"
)
model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
Joint Segmentation and Tagging
Overview
Joint segmentation and tagging is an advanced mode that performs word segmentation and POS tagging simultaneously in a single Viterbi pass. This is different from the default sequential approach where text is first segmented, then tagged.
Default Behavior (Sequential Mode):
Input Text -> Segmenter -> Words -> POS Tagger -> Tagged Words
Joint Mode:
Input Text -> Joint Viterbi Decoder -> Words + Tags (simultaneously)
Why It’s Disabled by Default
Joint mode is disabled by default (config.joint.enabled=False) for several important reasons:
| Reason | Explanation |
|---|
| Increased Complexity | State space is O(positions x word_lengths x tags^2) vs O(words x tags^2) for sequential |
| Higher Memory Usage | Beam search over joint state space requires more memory |
| Less Tested | Sequential pipeline has more extensive production testing |
| Similar Accuracy | For most use cases, sequential mode achieves comparable results |
| Startup Overhead | Joint mode requires loading additional probability tables |
When to Enable Joint Mode
Joint mode may provide benefits in specific scenarios:
| Use Case | Benefit | Enable Joint? |
|---|
| Ambiguous segmentation | POS context helps resolve word boundaries | Yes |
| OOV-heavy text | Joint optimization handles unknown words better | Yes |
| Research/Experiments | Comparing segmentation approaches | Yes |
| Production API | Latency-sensitive, well-segmented text | No |
| Simple validation | Basic spell checking | No |
Configuration
Enable Joint Mode
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, JointConfig
from myspellchecker.providers import SQLiteProvider
# Enable joint segmentation-tagging
config = SpellCheckerConfig(
joint=JointConfig(
enabled=True,
beam_width=15, # Larger beam for joint state space
max_word_length=20,
emission_weight=1.2,
word_score_weight=1.0,
)
)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)
Using SpellCheckerBuilder
from myspellchecker.core.builder import SpellCheckerBuilder
checker = (
SpellCheckerBuilder()
.with_joint_segmentation(enabled=True)
.build()
)
# Perform joint segmentation and tagging
words, tags = checker.segment_and_tag("မြန်မာနိုင်ငံ")
print(list(zip(words, tags)))
# Output: [('မြန်မာ', 'N'), ('နိုင်ငံ', 'N')]
JointConfig Parameters
| Parameter | Type | Default | Description |
|---|
enabled | bool | False | Enable joint segmentation-tagging mode |
beam_width | int | 15 | Beam width for Viterbi decoding (larger = more accurate, slower) |
max_word_length | int | 20 | Maximum word length in characters |
emission_weight | float | 1.2 | Weight for P(tag | word) emission probabilities |
word_score_weight | float | 1.0 | Weight for word n-gram scores |
min_prob | float | 1e-10 | Minimum probability threshold to prevent underflow |
use_morphology_fallback | bool | True | Use morphology analyzer for OOV word tagging |
| Mode | Speed | Memory | Best For |
|---|
| Sequential | Fast | Low | Production, latency-sensitive |
| Joint | Moderate | Higher | Ambiguous text, research |
Note: Performance varies based on text complexity and hardware.
Usage Example
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, JointConfig
# Sequential mode (default)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
sequential_checker = SpellChecker(provider=provider)
words_seq, tags_seq = sequential_checker.segment_and_tag("မြန်မာနိုင်ငံသည်ကောင်းသည်")
# Joint mode
joint_config = SpellCheckerConfig(
joint=JointConfig(enabled=True)
)
joint_checker = SpellChecker(config=joint_config)
words_joint, tags_joint = joint_checker.segment_and_tag("မြန်မာနိုင်ငံသည်ကောင်းသည်")
# Compare results
print(f"Sequential: {list(zip(words_seq, tags_seq))}")
print(f"Joint: {list(zip(words_joint, tags_joint))}")
Technical Details
The joint decoder uses a unified Viterbi algorithm that optimizes:
argmax P(words, tags | text)
= argmax P(word_i) x P(tag_i | tag_{i-1}, tag_{i-2}) x P(tag_i | word_i)
State representation: (position, word_start, current_tag, prev_tag)
Scoring components:
- Word score:
log P(word | prev_word) - N-gram language model
- Transition score:
log P(tag | prev_tags) - POS tag sequence model
- Emission score:
log P(tag | word) - Word-to-tag emission probability
Limitations
- Requires probability tables: Joint mode needs bigram/trigram probabilities in the database
- Not all segmenters support it: Only
JointSegmentTagger implements joint mode
- Base segmenters raise NotImplementedError: Individual segmenters don’t support joint mode; use
SpellChecker.segment_and_tag() instead
Advanced Topics
Fine-Tuning Custom Models
Train your own Myanmar POS tagger on domain-specific data:
# 1. Prepare training data (word, POS tag pairs)
training_data = [
("မြန်မာ", "N"),
("နိုင်ငံ", "N"),
("ကောင်း", "ADJ"),
# ... more examples
]
# 2. Use HuggingFace Trainer (example)
from transformers import (
AutoModelForTokenClassification,
AutoTokenizer,
TrainingArguments,
Trainer
)
model = AutoModelForTokenClassification.from_pretrained(
"xlm-roberta-base",
num_labels=len(pos_tags)
)
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
# 3. Train (simplified)
trainer = Trainer(
model=model,
args=TrainingArguments(output_dir="./my-myanmar-pos"),
train_dataset=train_dataset,
)
trainer.train()
# 4. Use your model
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger
tagger = TransformerPOSTagger(model_name="./my-myanmar-pos")
Extending with Custom Taggers
Create domain-specific taggers:
from myspellchecker.algorithms.pos_tagger_base import POSTaggerBase, POSPrediction, TaggerType
class DomainSpecificTagger(POSTaggerBase):
"""Medical domain POS tagger."""
def __init__(self, medical_terms_dict):
self.medical_terms = medical_terms_dict
def tag_word(self, word: str) -> str:
# Check medical terminology first
if word in self.medical_terms:
return self.medical_terms[word]
# Fallback to heuristics
if word.endswith("ရောဂါ"):
return "N_DISEASE"
return "UNK"
def tag_sequence(self, words: list[str]) -> list[str]:
return [self.tag_word(w) for w in words]
@property
def tagger_type(self) -> TaggerType:
return TaggerType.CUSTOM
# Usage
medical_dict = {
"ငှက်ဖျားရောဂါ": "N_DISEASE",
"ဆေးဝါး": "N_MEDICINE",
}
tagger = DomainSpecificTagger(medical_dict)
Combining Multiple Taggers
Ensemble approach for higher accuracy:
class EnsembleTagger(POSTaggerBase):
def __init__(self, taggers: list[POSTaggerBase], weights: list[float]):
self.taggers = taggers
self.weights = weights
def tag_word_with_confidence(self, word: str) -> POSPrediction:
predictions = [
t.tag_word_with_confidence(word) for t in self.taggers
]
# Weighted voting
votes = {}
for pred, weight in zip(predictions, self.weights):
votes[pred.tag] = votes.get(pred.tag, 0) + weight * pred.confidence
best_tag = max(votes, key=votes.get)
confidence = votes[best_tag] / sum(self.weights)
return POSPrediction(word=word, tag=best_tag, confidence=confidence)
# ... implement other methods
# Usage
from myspellchecker.algorithms.pos_tagger_rule import RuleBasedPOSTagger
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger
ensemble = EnsembleTagger(
taggers=[
RuleBasedPOSTagger(),
TransformerPOSTagger(),
],
weights=[0.3, 0.7] # Trust transformer more
)
Caching Strategies
Optimize performance with intelligent caching:
from functools import lru_cache
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger
class CachedTransformerTagger(TransformerPOSTagger):
def __init__(self, *args, cache_size=10000, **kwargs):
super().__init__(*args, **kwargs)
self._setup_cache(cache_size)
def _setup_cache(self, cache_size):
self._tag_word_cached = lru_cache(maxsize=cache_size)(
super().tag_word
)
def tag_word(self, word: str) -> str:
return self._tag_word_cached(word)
# Usage - 10x speedup for repeated words
tagger = CachedTransformerTagger(cache_size=50000)
Acknowledgments
The default transformer-based POS tagger uses the myanmar-pos-model by Chuu Htet Naing:
| Attribute | Value |
|---|
| Model | chuuhtetnaing/myanmar-pos-model |
| Author | Chuu Htet Naing |
| Base Model | XLM-RoBERTa |
| Accuracy | 93.37% |
| F1 Score | 92.24% |
| License | Please refer to the model’s Hugging Face page for license information |
This model was trained specifically for Myanmar/Burmese Part-of-Speech tagging and provides state-of-the-art accuracy for the language.
Citation: If you use the transformer POS tagger in your research, please cite the original model:
@misc{chuuhtetnaing-myanmar-pos,
author = {Chuu Htet Naing},
title = {Myanmar POS Model},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/chuuhtetnaing/myanmar-pos-model}
}
We express our gratitude to Chuu Htet Naing for making this model publicly available, which significantly enhances the accuracy of Myanmar language processing in mySpellChecker.
See Also